-------------- Task 1 --------------

-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 80000000 (elements), Offset = 0 (elements)
Memory per array = 610.4 MiB (= 0.6 GiB).
Total memory required = 1831.1 MiB (= 1.8 GiB).
Each kernel will be executed 20 times.
    The *best* time for each kernel (excluding the first iteration)
    will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 116886 microseconds.
    (= 116886 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           29569.4     0.048585     0.043288     0.059164
Scale:          17644.0     0.082248     0.072546     0.102548
Add:            21030.1     0.100620     0.091298     0.124700
Triad:          21230.7     0.100758     0.090435     0.120631
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
./flops.exe

    FLOPS C Program (Double Precision), V2.0 18 Dec 1992

    Module     Error        RunTime      MFLOPS
                            (usec)
        1      4.0146e-13      0.0024   5827.9076
        2     -1.4166e-13      0.0007  10037.8942
        3      4.7184e-14      0.0039   4371.9185
        4     -1.2557e-13      0.0034   4355.5711
        5     -1.3800e-13      0.0066   4415.6439
        6      3.2380e-13      0.0065   4441.6299
        7     -8.4583e-11      0.0053   2277.1707
        8      3.4867e-13      0.0069   4367.6094

    Iterations      =  512000000
    NullTime (usec) =     0.0000
    MFLOPS(1)       =  7050.6178
    MFLOPS(2)       =  3461.6233
    MFLOPS(3)       =  4175.0442
    MFLOPS(4)       =  4389.7311

-------------- Task 2 --------------

Memory needed (double 64-bit, 8 bytes):
(A) (2N + 1) * 8 bytes
(B) (M*N + M + N) * 8 bytes
(C) (M*L + L*N + M*N) * 8 bytes
(D) (N + N + p) * 8 bytes

Floating point operations:
(A) 2N
(B) M * 2N
(C) M * 2L * N
(D) 2 * N * p (Horner Schema)

Read/Write operations:
(A) Read: 2N         Write: 1
(B) Read: M*2N       Write: M*N
(C) Read: M*2L*N     Write: M*L*N
(D) Read: 2*N*p      Write: N*P

-------------- Task 3 --------------

Functions implemented in task_3.cpp

-------------- Task 4 --------------

----- Benchmark (A) -----
Memory allocated  : 0.745 GByte
Duration per loop : 0.036 sec
GFLOPS            : 2.579
GiByte/s          : 20.630
-------------------------
----- Benchmark (B) -----
Memory allocated  : 0.715 GByte
Duration per loop : 0.105 sec
GFLOPS            : 1.704
GiByte/s          : 6.818
-------------------------
----- Benchmark (C) -----
Memory allocated  : 0.026 GByte
Duration per loop : 0.459 sec
GFLOPS            : 4.062
GiByte/s          : 0.057
-------------------------
----- Benchmark (D) -----
Memory allocated  : 0.015 GByte
Duration per loop : 0.310 sec
GFLOPS            : 1.201
GiByte/s          : 0.048
-------------------------


-------------- Task 5 --------------

----- Benchmark norm -----
||x|| = 897124.301552
Memory allocated  : 0.373 GByte
Duration per loop : 0.022 sec
GFLOPS            : 4.222
GiByte/s          : 16.890
-------------------------
What do you observe? Why?
-> Faster per loop than scalar product, only loads elements of 1 vector, instead of 2.

-------------- Task 6 --------------

Benchmarks using cBLAS
----- Benchmark (A) -----
Memory allocated  : 0.745 GByte
Duration per loop : 0.023 sec
GFLOPS            : 4.006
GiByte/s          : 32.052
-------------------------
----- Benchmark (B) -----
Memory allocated  : 0.715 GByte
Duration per loop : 0.026 sec
GFLOPS            : 7.010
GiByte/s          : 28.045
-------------------------
----- Benchmark (C) -----
Memory allocated  : 0.026 GByte
Duration per loop : 0.020 sec
GFLOPS            : 91.320
GiByte/s          : 1.278
-------------------------


-------------- Task 7 --------------

A =
4.000000 1.000000 0.250000 0.111111 0.062500 
1.000000 4.000000 1.000000 0.250000 0.111111 
0.250000 1.000000 4.000000 1.000000 0.250000 
0.111111 0.250000 1.000000 4.000000 1.000000 
0.062500 0.111111 0.250000 1.000000 4.000000 


b =
0.000000 1.000000 
0.000000 1.000000 
0.000000 1.000000 
0.000000 1.000000 
0.000000 1.000000 


L + U =
4.000000 1.000000 0.250000 0.111111 0.062500 
0.250000 3.750000 0.937500 0.222222 0.095486 
0.062500 0.250000 3.750000 0.937500 0.222222 
0.027778 0.059259 0.250000 3.749370 0.937050 
0.015625 0.025463 0.059259 0.249922 3.749234 


x =
0.000000 0.196259 
0.000000 0.148391 
0.000000 0.151272 
0.000000 0.148391 
0.000000 0.196259 


Check solution:
A * x = 
0.000000 1.000000 
0.000000 1.000000 
0.000000 1.000000 
0.000000 1.000000 
0.000000 1.000000 


N    =   | 1      | 2      | 4      | 8      | 16     | 32 
---------|--------|--------|--------|--------|--------|-------
Nrhs = 2 | 0.0047 | 0.0045 | 0.0046 | 0.0130 | 0.0203 | 0.0476
Nrhs = 4 | 0.0027 | 0.0031 | 0.0033 | 0.0046 | 0.0085 | 0.0250
Nrhs = 8 | 0.0035 | 0.0035 | 0.0045 | 0.0061 | 0.0119 | 0.0300
Nrhs = 16 | 0.0085 | 0.0062 | 0.0221 | 0.0113 | 0.0599 | 0.0757
Nrhs = 32 | 0.0122 | 0.0165 | 0.0112 | 0.0123 | 0.0238 | 0.0834
Nrhs = 64 | 0.0072 | 0.0078 | 0.0164 | 0.0133 | 0.0421 | 0.0666
Nrhs = 128 | 0.0073 | 0.0189 | 0.0269 | 0.0199 | 0.0337 | 0.1041
Nrhs = 256 | 0.0107 | 0.0135 | 0.0279 | 0.0351 | 0.0582 | 0.1438
Nrhs = 512 | 0.0276 | 0.0174 | 0.0237 | 0.1027 | 0.1113 | 0.2417

For fixed n, the solution time per rhs does not slow down consistently and scales very well.
Its faster than expected.


-------------- Task 8 --------------


 There are 1 processes running.

Intervalls: 100 x 100

 Start Jacobi solver for 10201 d.o.f.s
aver. Jacobi rate :  0.997922  (1000 iter)
final error: 0.124971 (rel)   0.000194029 (abs)
JacobiSolve: timing in sec. : 0.079399
ASCI file  square_100.txt  opened
17361  2  34320  3

 Start Jacobi solver for 17361 d.o.f.s
aver. Jacobi rate :  0.998401  (1000 iter)
final error: 0.201744 (rel)   0.000265133 (abs)
JacobiSolve: timing in sec. : 0.18853