-------------- Task 1 -------------- ------------------------------------------------------------- STREAM version $Revision: 5.10 $ ------------------------------------------------------------- This system uses 8 bytes per array element. ------------------------------------------------------------- Array size = 80000000 (elements), Offset = 0 (elements) Memory per array = 610.4 MiB (= 0.6 GiB). Total memory required = 1831.1 MiB (= 1.8 GiB). Each kernel will be executed 20 times. The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth. ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 116886 microseconds. (= 116886 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Best Rate MB/s Avg time Min time Max time Copy: 29569.4 0.048585 0.043288 0.059164 Scale: 17644.0 0.082248 0.072546 0.102548 Add: 21030.1 0.100620 0.091298 0.124700 Triad: 21230.7 0.100758 0.090435 0.120631 ------------------------------------------------------------- Solution Validates: avg error less than 1.000000e-13 on all three arrays ------------------------------------------------------------- ./flops.exe FLOPS C Program (Double Precision), V2.0 18 Dec 1992 Module Error RunTime MFLOPS (usec) 1 4.0146e-13 0.0024 5827.9076 2 -1.4166e-13 0.0007 10037.8942 3 4.7184e-14 0.0039 4371.9185 4 -1.2557e-13 0.0034 4355.5711 5 -1.3800e-13 0.0066 4415.6439 6 3.2380e-13 0.0065 4441.6299 7 -8.4583e-11 0.0053 2277.1707 8 3.4867e-13 0.0069 4367.6094 Iterations = 512000000 NullTime (usec) = 0.0000 MFLOPS(1) = 7050.6178 MFLOPS(2) = 3461.6233 MFLOPS(3) = 4175.0442 MFLOPS(4) = 4389.7311 -------------- Task 2 -------------- Memory needed (double 64-bit, 8 bytes): (A) (2N + 1) * 8 bytes (B) (M*N + M + N) * 8 bytes (C) (M*L + L*N + M*N) * 8 bytes (D) (N + N + p) * 8 bytes Floating point operations: (A) 2N (B) M * 2N (C) M * 2L * N (D) 2 * N * p (Horner Schema) Read/Write operations: (A) Read: 2N Write: 1 (B) Read: M*2N Write: M*N (C) Read: M*2L*N Write: M*L*N (D) Read: 2*N*p Write: N*P -------------- Task 3 -------------- Functions implemented in task_3.cpp -------------- Task 4 -------------- ----- Benchmark (A) ----- Memory allocated : 0.745 GByte Duration per loop : 0.036 sec GFLOPS : 2.579 GiByte/s : 20.630 ------------------------- ----- Benchmark (B) ----- Memory allocated : 0.715 GByte Duration per loop : 0.105 sec GFLOPS : 1.704 GiByte/s : 6.818 ------------------------- ----- Benchmark (C) ----- Memory allocated : 0.026 GByte Duration per loop : 0.459 sec GFLOPS : 4.062 GiByte/s : 0.057 ------------------------- ----- Benchmark (D) ----- Memory allocated : 0.015 GByte Duration per loop : 0.310 sec GFLOPS : 1.201 GiByte/s : 0.048 ------------------------- -------------- Task 5 -------------- ----- Benchmark norm ----- ||x|| = 897124.301552 Memory allocated : 0.373 GByte Duration per loop : 0.022 sec GFLOPS : 4.222 GiByte/s : 16.890 ------------------------- What do you observe? Why? -> Faster per loop than scalar product, only loads elements of 1 vector, instead of 2. -------------- Task 6 -------------- Benchmarks using cBLAS ----- Benchmark (A) ----- Memory allocated : 0.745 GByte Duration per loop : 0.023 sec GFLOPS : 4.006 GiByte/s : 32.052 ------------------------- ----- Benchmark (B) ----- Memory allocated : 0.715 GByte Duration per loop : 0.026 sec GFLOPS : 7.010 GiByte/s : 28.045 ------------------------- ----- Benchmark (C) ----- Memory allocated : 0.026 GByte Duration per loop : 0.020 sec GFLOPS : 91.320 GiByte/s : 1.278 ------------------------- -------------- Task 7 -------------- A = 4.000000 1.000000 0.250000 0.111111 0.062500 1.000000 4.000000 1.000000 0.250000 0.111111 0.250000 1.000000 4.000000 1.000000 0.250000 0.111111 0.250000 1.000000 4.000000 1.000000 0.062500 0.111111 0.250000 1.000000 4.000000 b = 0.000000 1.000000 0.000000 1.000000 0.000000 1.000000 0.000000 1.000000 0.000000 1.000000 L + U = 4.000000 1.000000 0.250000 0.111111 0.062500 0.250000 3.750000 0.937500 0.222222 0.095486 0.062500 0.250000 3.750000 0.937500 0.222222 0.027778 0.059259 0.250000 3.749370 0.937050 0.015625 0.025463 0.059259 0.249922 3.749234 x = 0.000000 0.196259 0.000000 0.148391 0.000000 0.151272 0.000000 0.148391 0.000000 0.196259 Check solution: A * x = 0.000000 1.000000 0.000000 1.000000 0.000000 1.000000 0.000000 1.000000 0.000000 1.000000 N = | 1 | 2 | 4 | 8 | 16 | 32 ---------|--------|--------|--------|--------|--------|------- Nrhs = 2 | 0.0047 | 0.0045 | 0.0046 | 0.0130 | 0.0203 | 0.0476 Nrhs = 4 | 0.0027 | 0.0031 | 0.0033 | 0.0046 | 0.0085 | 0.0250 Nrhs = 8 | 0.0035 | 0.0035 | 0.0045 | 0.0061 | 0.0119 | 0.0300 Nrhs = 16 | 0.0085 | 0.0062 | 0.0221 | 0.0113 | 0.0599 | 0.0757 Nrhs = 32 | 0.0122 | 0.0165 | 0.0112 | 0.0123 | 0.0238 | 0.0834 Nrhs = 64 | 0.0072 | 0.0078 | 0.0164 | 0.0133 | 0.0421 | 0.0666 Nrhs = 128 | 0.0073 | 0.0189 | 0.0269 | 0.0199 | 0.0337 | 0.1041 Nrhs = 256 | 0.0107 | 0.0135 | 0.0279 | 0.0351 | 0.0582 | 0.1438 Nrhs = 512 | 0.0276 | 0.0174 | 0.0237 | 0.1027 | 0.1113 | 0.2417 For fixed n, the solution time per rhs does not slow down consistently and scales very well. Its faster than expected. -------------- Task 8 -------------- There are 1 processes running. Intervalls: 100 x 100 Start Jacobi solver for 10201 d.o.f.s aver. Jacobi rate : 0.997922 (1000 iter) final error: 0.124971 (rel) 0.000194029 (abs) JacobiSolve: timing in sec. : 0.079399 ASCI file square_100.txt opened 17361 2 34320 3 Start Jacobi solver for 17361 d.o.f.s aver. Jacobi rate : 0.998401 (1000 iter) final error: 0.201744 (rel) 0.000265133 (abs) JacobiSolve: timing in sec. : 0.18853