task8
This commit is contained in:
parent
3763c53dab
commit
3882aee07a
71 changed files with 160045 additions and 0 deletions
222
ex3/ex3_results.txt
Normal file
222
ex3/ex3_results.txt
Normal file
|
|
@ -0,0 +1,222 @@
|
|||
|
||||
-------------- Task 1 --------------
|
||||
|
||||
-------------------------------------------------------------
|
||||
STREAM version $Revision: 5.10 $
|
||||
-------------------------------------------------------------
|
||||
This system uses 8 bytes per array element.
|
||||
-------------------------------------------------------------
|
||||
Array size = 80000000 (elements), Offset = 0 (elements)
|
||||
Memory per array = 610.4 MiB (= 0.6 GiB).
|
||||
Total memory required = 1831.1 MiB (= 1.8 GiB).
|
||||
Each kernel will be executed 20 times.
|
||||
The *best* time for each kernel (excluding the first iteration)
|
||||
will be used to compute the reported bandwidth.
|
||||
-------------------------------------------------------------
|
||||
Your clock granularity/precision appears to be 1 microseconds.
|
||||
Each test below will take on the order of 116886 microseconds.
|
||||
(= 116886 clock ticks)
|
||||
Increase the size of the arrays if this shows that
|
||||
you are not getting at least 20 clock ticks per test.
|
||||
-------------------------------------------------------------
|
||||
WARNING -- The above is only a rough guideline.
|
||||
For best results, please be sure you know the
|
||||
precision of your system timer.
|
||||
-------------------------------------------------------------
|
||||
Function Best Rate MB/s Avg time Min time Max time
|
||||
Copy: 29569.4 0.048585 0.043288 0.059164
|
||||
Scale: 17644.0 0.082248 0.072546 0.102548
|
||||
Add: 21030.1 0.100620 0.091298 0.124700
|
||||
Triad: 21230.7 0.100758 0.090435 0.120631
|
||||
-------------------------------------------------------------
|
||||
Solution Validates: avg error less than 1.000000e-13 on all three arrays
|
||||
-------------------------------------------------------------
|
||||
./flops.exe
|
||||
|
||||
FLOPS C Program (Double Precision), V2.0 18 Dec 1992
|
||||
|
||||
Module Error RunTime MFLOPS
|
||||
(usec)
|
||||
1 4.0146e-13 0.0024 5827.9076
|
||||
2 -1.4166e-13 0.0007 10037.8942
|
||||
3 4.7184e-14 0.0039 4371.9185
|
||||
4 -1.2557e-13 0.0034 4355.5711
|
||||
5 -1.3800e-13 0.0066 4415.6439
|
||||
6 3.2380e-13 0.0065 4441.6299
|
||||
7 -8.4583e-11 0.0053 2277.1707
|
||||
8 3.4867e-13 0.0069 4367.6094
|
||||
|
||||
Iterations = 512000000
|
||||
NullTime (usec) = 0.0000
|
||||
MFLOPS(1) = 7050.6178
|
||||
MFLOPS(2) = 3461.6233
|
||||
MFLOPS(3) = 4175.0442
|
||||
MFLOPS(4) = 4389.7311
|
||||
|
||||
-------------- Task 2 --------------
|
||||
|
||||
Memory needed (double 64-bit, 8 bytes):
|
||||
(A) (2N + 1) * 8 bytes
|
||||
(B) (M*N + M + N) * 8 bytes
|
||||
(C) (M*L + L*N + M*N) * 8 bytes
|
||||
(D) (N + N + p) * 8 bytes
|
||||
|
||||
Floating point operations:
|
||||
(A) 2N
|
||||
(B) M * 2N
|
||||
(C) M * 2L * N
|
||||
(D) 2 * N * p (Horner Schema)
|
||||
|
||||
Read/Write operations:
|
||||
(A) Read: 2N Write: 1
|
||||
(B) Read: M*2N Write: M*N
|
||||
(C) Read: M*2L*N Write: M*L*N
|
||||
(D) Read: 2*N*p Write: N*P
|
||||
|
||||
-------------- Task 3 --------------
|
||||
|
||||
Functions implemented in task_3.cpp
|
||||
|
||||
-------------- Task 4 --------------
|
||||
|
||||
----- Benchmark (A) -----
|
||||
Memory allocated : 0.745 GByte
|
||||
Duration per loop : 0.036 sec
|
||||
GFLOPS : 2.579
|
||||
GiByte/s : 20.630
|
||||
-------------------------
|
||||
----- Benchmark (B) -----
|
||||
Memory allocated : 0.715 GByte
|
||||
Duration per loop : 0.105 sec
|
||||
GFLOPS : 1.704
|
||||
GiByte/s : 6.818
|
||||
-------------------------
|
||||
----- Benchmark (C) -----
|
||||
Memory allocated : 0.026 GByte
|
||||
Duration per loop : 0.459 sec
|
||||
GFLOPS : 4.062
|
||||
GiByte/s : 0.057
|
||||
-------------------------
|
||||
----- Benchmark (D) -----
|
||||
Memory allocated : 0.015 GByte
|
||||
Duration per loop : 0.310 sec
|
||||
GFLOPS : 1.201
|
||||
GiByte/s : 0.048
|
||||
-------------------------
|
||||
|
||||
|
||||
-------------- Task 5 --------------
|
||||
|
||||
----- Benchmark norm -----
|
||||
||x|| = 897124.301552
|
||||
Memory allocated : 0.373 GByte
|
||||
Duration per loop : 0.022 sec
|
||||
GFLOPS : 4.222
|
||||
GiByte/s : 16.890
|
||||
-------------------------
|
||||
What do you observe? Why?
|
||||
-> Faster per loop than scalar product, only loads elements of 1 vector, instead of 2.
|
||||
|
||||
-------------- Task 6 --------------
|
||||
|
||||
Benchmarks using cBLAS
|
||||
----- Benchmark (A) -----
|
||||
Memory allocated : 0.745 GByte
|
||||
Duration per loop : 0.023 sec
|
||||
GFLOPS : 4.006
|
||||
GiByte/s : 32.052
|
||||
-------------------------
|
||||
----- Benchmark (B) -----
|
||||
Memory allocated : 0.715 GByte
|
||||
Duration per loop : 0.026 sec
|
||||
GFLOPS : 7.010
|
||||
GiByte/s : 28.045
|
||||
-------------------------
|
||||
----- Benchmark (C) -----
|
||||
Memory allocated : 0.026 GByte
|
||||
Duration per loop : 0.020 sec
|
||||
GFLOPS : 91.320
|
||||
GiByte/s : 1.278
|
||||
-------------------------
|
||||
|
||||
|
||||
-------------- Task 7 --------------
|
||||
|
||||
A =
|
||||
4.000000 1.000000 0.250000 0.111111 0.062500
|
||||
1.000000 4.000000 1.000000 0.250000 0.111111
|
||||
0.250000 1.000000 4.000000 1.000000 0.250000
|
||||
0.111111 0.250000 1.000000 4.000000 1.000000
|
||||
0.062500 0.111111 0.250000 1.000000 4.000000
|
||||
|
||||
|
||||
b =
|
||||
0.000000 1.000000
|
||||
0.000000 1.000000
|
||||
0.000000 1.000000
|
||||
0.000000 1.000000
|
||||
0.000000 1.000000
|
||||
|
||||
|
||||
L + U =
|
||||
4.000000 1.000000 0.250000 0.111111 0.062500
|
||||
0.250000 3.750000 0.937500 0.222222 0.095486
|
||||
0.062500 0.250000 3.750000 0.937500 0.222222
|
||||
0.027778 0.059259 0.250000 3.749370 0.937050
|
||||
0.015625 0.025463 0.059259 0.249922 3.749234
|
||||
|
||||
|
||||
x =
|
||||
0.000000 0.196259
|
||||
0.000000 0.148391
|
||||
0.000000 0.151272
|
||||
0.000000 0.148391
|
||||
0.000000 0.196259
|
||||
|
||||
|
||||
Check solution:
|
||||
A * x =
|
||||
0.000000 1.000000
|
||||
0.000000 1.000000
|
||||
0.000000 1.000000
|
||||
0.000000 1.000000
|
||||
0.000000 1.000000
|
||||
|
||||
|
||||
N = | 1 | 2 | 4 | 8 | 16 | 32
|
||||
---------|--------|--------|--------|--------|--------|-------
|
||||
Nrhs = 2 | 0.0047 | 0.0045 | 0.0046 | 0.0130 | 0.0203 | 0.0476
|
||||
Nrhs = 4 | 0.0027 | 0.0031 | 0.0033 | 0.0046 | 0.0085 | 0.0250
|
||||
Nrhs = 8 | 0.0035 | 0.0035 | 0.0045 | 0.0061 | 0.0119 | 0.0300
|
||||
Nrhs = 16 | 0.0085 | 0.0062 | 0.0221 | 0.0113 | 0.0599 | 0.0757
|
||||
Nrhs = 32 | 0.0122 | 0.0165 | 0.0112 | 0.0123 | 0.0238 | 0.0834
|
||||
Nrhs = 64 | 0.0072 | 0.0078 | 0.0164 | 0.0133 | 0.0421 | 0.0666
|
||||
Nrhs = 128 | 0.0073 | 0.0189 | 0.0269 | 0.0199 | 0.0337 | 0.1041
|
||||
Nrhs = 256 | 0.0107 | 0.0135 | 0.0279 | 0.0351 | 0.0582 | 0.1438
|
||||
Nrhs = 512 | 0.0276 | 0.0174 | 0.0237 | 0.1027 | 0.1113 | 0.2417
|
||||
|
||||
For fixed n, the solution time per rhs does not slow down consistently and scales very well.
|
||||
Its faster than expected.
|
||||
|
||||
|
||||
-------------- Task 8 --------------
|
||||
|
||||
|
||||
There are 1 processes running.
|
||||
|
||||
Intervalls: 100 x 100
|
||||
|
||||
Start Jacobi solver for 10201 d.o.f.s
|
||||
aver. Jacobi rate : 0.997922 (1000 iter)
|
||||
final error: 0.124971 (rel) 0.000194029 (abs)
|
||||
JacobiSolve: timing in sec. : 0.079399
|
||||
ASCI file square_100.txt opened
|
||||
17361 2 34320 3
|
||||
|
||||
Start Jacobi solver for 17361 d.o.f.s
|
||||
aver. Jacobi rate : 0.998401 (1000 iter)
|
||||
final error: 0.201744 (rel) 0.000265133 (abs)
|
||||
JacobiSolve: timing in sec. : 0.18853
|
||||
|
||||
|
||||
|
||||
Loading…
Add table
Add a link
Reference in a new issue