222 lines
6.6 KiB
Text
222 lines
6.6 KiB
Text
|
|
-------------- Task 1 --------------
|
|
|
|
-------------------------------------------------------------
|
|
STREAM version $Revision: 5.10 $
|
|
-------------------------------------------------------------
|
|
This system uses 8 bytes per array element.
|
|
-------------------------------------------------------------
|
|
Array size = 80000000 (elements), Offset = 0 (elements)
|
|
Memory per array = 610.4 MiB (= 0.6 GiB).
|
|
Total memory required = 1831.1 MiB (= 1.8 GiB).
|
|
Each kernel will be executed 20 times.
|
|
The *best* time for each kernel (excluding the first iteration)
|
|
will be used to compute the reported bandwidth.
|
|
-------------------------------------------------------------
|
|
Your clock granularity/precision appears to be 1 microseconds.
|
|
Each test below will take on the order of 116886 microseconds.
|
|
(= 116886 clock ticks)
|
|
Increase the size of the arrays if this shows that
|
|
you are not getting at least 20 clock ticks per test.
|
|
-------------------------------------------------------------
|
|
WARNING -- The above is only a rough guideline.
|
|
For best results, please be sure you know the
|
|
precision of your system timer.
|
|
-------------------------------------------------------------
|
|
Function Best Rate MB/s Avg time Min time Max time
|
|
Copy: 29569.4 0.048585 0.043288 0.059164
|
|
Scale: 17644.0 0.082248 0.072546 0.102548
|
|
Add: 21030.1 0.100620 0.091298 0.124700
|
|
Triad: 21230.7 0.100758 0.090435 0.120631
|
|
-------------------------------------------------------------
|
|
Solution Validates: avg error less than 1.000000e-13 on all three arrays
|
|
-------------------------------------------------------------
|
|
./flops.exe
|
|
|
|
FLOPS C Program (Double Precision), V2.0 18 Dec 1992
|
|
|
|
Module Error RunTime MFLOPS
|
|
(usec)
|
|
1 4.0146e-13 0.0024 5827.9076
|
|
2 -1.4166e-13 0.0007 10037.8942
|
|
3 4.7184e-14 0.0039 4371.9185
|
|
4 -1.2557e-13 0.0034 4355.5711
|
|
5 -1.3800e-13 0.0066 4415.6439
|
|
6 3.2380e-13 0.0065 4441.6299
|
|
7 -8.4583e-11 0.0053 2277.1707
|
|
8 3.4867e-13 0.0069 4367.6094
|
|
|
|
Iterations = 512000000
|
|
NullTime (usec) = 0.0000
|
|
MFLOPS(1) = 7050.6178
|
|
MFLOPS(2) = 3461.6233
|
|
MFLOPS(3) = 4175.0442
|
|
MFLOPS(4) = 4389.7311
|
|
|
|
-------------- Task 2 --------------
|
|
|
|
Memory needed (double 64-bit, 8 bytes):
|
|
(A) (2N + 1) * 8 bytes
|
|
(B) (M*N + M + N) * 8 bytes
|
|
(C) (M*L + L*N + M*N) * 8 bytes
|
|
(D) (N + N + p) * 8 bytes
|
|
|
|
Floating point operations:
|
|
(A) 2N
|
|
(B) M * 2N
|
|
(C) M * 2L * N
|
|
(D) 2 * N * p (Horner Schema)
|
|
|
|
Read/Write operations:
|
|
(A) Read: 2N Write: 1
|
|
(B) Read: M*2N Write: M*N
|
|
(C) Read: M*2L*N Write: M*L*N
|
|
(D) Read: 2*N*p Write: N*P
|
|
|
|
-------------- Task 3 --------------
|
|
|
|
Functions implemented in task_3.cpp
|
|
|
|
-------------- Task 4 --------------
|
|
|
|
----- Benchmark (A) -----
|
|
Memory allocated : 0.745 GByte
|
|
Duration per loop : 0.036 sec
|
|
GFLOPS : 2.579
|
|
GiByte/s : 20.630
|
|
-------------------------
|
|
----- Benchmark (B) -----
|
|
Memory allocated : 0.715 GByte
|
|
Duration per loop : 0.105 sec
|
|
GFLOPS : 1.704
|
|
GiByte/s : 6.818
|
|
-------------------------
|
|
----- Benchmark (C) -----
|
|
Memory allocated : 0.026 GByte
|
|
Duration per loop : 0.459 sec
|
|
GFLOPS : 4.062
|
|
GiByte/s : 0.057
|
|
-------------------------
|
|
----- Benchmark (D) -----
|
|
Memory allocated : 0.015 GByte
|
|
Duration per loop : 0.310 sec
|
|
GFLOPS : 1.201
|
|
GiByte/s : 0.048
|
|
-------------------------
|
|
|
|
|
|
-------------- Task 5 --------------
|
|
|
|
----- Benchmark norm -----
|
|
||x|| = 897124.301552
|
|
Memory allocated : 0.373 GByte
|
|
Duration per loop : 0.022 sec
|
|
GFLOPS : 4.222
|
|
GiByte/s : 16.890
|
|
-------------------------
|
|
What do you observe? Why?
|
|
-> Faster per loop than scalar product, only loads elements of 1 vector, instead of 2.
|
|
|
|
-------------- Task 6 --------------
|
|
|
|
Benchmarks using cBLAS
|
|
----- Benchmark (A) -----
|
|
Memory allocated : 0.745 GByte
|
|
Duration per loop : 0.023 sec
|
|
GFLOPS : 4.006
|
|
GiByte/s : 32.052
|
|
-------------------------
|
|
----- Benchmark (B) -----
|
|
Memory allocated : 0.715 GByte
|
|
Duration per loop : 0.026 sec
|
|
GFLOPS : 7.010
|
|
GiByte/s : 28.045
|
|
-------------------------
|
|
----- Benchmark (C) -----
|
|
Memory allocated : 0.026 GByte
|
|
Duration per loop : 0.020 sec
|
|
GFLOPS : 91.320
|
|
GiByte/s : 1.278
|
|
-------------------------
|
|
|
|
|
|
-------------- Task 7 --------------
|
|
|
|
A =
|
|
4.000000 1.000000 0.250000 0.111111 0.062500
|
|
1.000000 4.000000 1.000000 0.250000 0.111111
|
|
0.250000 1.000000 4.000000 1.000000 0.250000
|
|
0.111111 0.250000 1.000000 4.000000 1.000000
|
|
0.062500 0.111111 0.250000 1.000000 4.000000
|
|
|
|
|
|
b =
|
|
0.000000 1.000000
|
|
0.000000 1.000000
|
|
0.000000 1.000000
|
|
0.000000 1.000000
|
|
0.000000 1.000000
|
|
|
|
|
|
L + U =
|
|
4.000000 1.000000 0.250000 0.111111 0.062500
|
|
0.250000 3.750000 0.937500 0.222222 0.095486
|
|
0.062500 0.250000 3.750000 0.937500 0.222222
|
|
0.027778 0.059259 0.250000 3.749370 0.937050
|
|
0.015625 0.025463 0.059259 0.249922 3.749234
|
|
|
|
|
|
x =
|
|
0.000000 0.196259
|
|
0.000000 0.148391
|
|
0.000000 0.151272
|
|
0.000000 0.148391
|
|
0.000000 0.196259
|
|
|
|
|
|
Check solution:
|
|
A * x =
|
|
0.000000 1.000000
|
|
0.000000 1.000000
|
|
0.000000 1.000000
|
|
0.000000 1.000000
|
|
0.000000 1.000000
|
|
|
|
|
|
Solution time per right hand side in milliseconds: sec*1000/Nrhs
|
|
N = | 1 | 2 | 4 | 8 | 16 | 32
|
|
------------|--------|--------|--------|--------|--------|-------
|
|
Nrhs = 2000 | 0.2122 | 0.0048 | 0.0079 | 0.0544 | 0.0850 | 0.1523
|
|
Nrhs = 4000 | 0.0036 | 0.0053 | 0.0085 | 0.0142 | 0.0748 | 0.1596
|
|
Nrhs = 6000 | 0.0033 | 0.0043 | 0.0069 | 0.0147 | 0.0327 | 0.1801
|
|
Nrhs = 8000 | 0.0036 | 0.0041 | 0.0069 | 0.0197 | 0.0342 | 0.1795
|
|
Nrhs = 10000 | 0.0028 | 0.0042 | 0.0067 | 0.0140 | 0.0328 | 0.1883
|
|
Nrhs = 12000 | 0.0025 | 0.0044 | 0.0067 | 0.0157 | 0.0320 | 0.1981
|
|
Nrhs = 14000 | 0.0023 | 0.0038 | 0.0065 | 0.0157 | 0.0333 | 0.1916
|
|
Nrhs = 16000 | 0.0022 | 0.0036 | 0.0073 | 0.0140 | 0.0324 | 0.1824
|
|
Nrhs = 18000 | 0.0028 | 0.0038 | 0.0066 | 0.0145 | 0.0337 | 0.1741
|
|
|
|
For fixed n, the solution time per rhs stays roughly constant.
|
|
|
|
|
|
-------------- Task 8 --------------
|
|
|
|
|
|
There are 1 processes running.
|
|
|
|
Intervalls: 100 x 100
|
|
|
|
Start Jacobi solver for 10201 d.o.f.s
|
|
aver. Jacobi rate : 0.997922 (1000 iter)
|
|
final error: 0.124971 (rel) 0.000194029 (abs)
|
|
JacobiSolve: timing in sec. : 0.079399
|
|
ASCI file square_100.txt opened
|
|
17361 2 34320 3
|
|
|
|
Start Jacobi solver for 17361 d.o.f.s
|
|
aver. Jacobi rate : 0.998401 (1000 iter)
|
|
final error: 0.201744 (rel) 0.000265133 (abs)
|
|
JacobiSolve: timing in sec. : 0.18853
|
|
|
|
|
|
|