next up previous contents
Next: 4. Basic routines Up: 3.4 Performance evaluation of Previous: 3.4.2 Efficiency   Contents


3.4.3 Communication expenditure

The latter sections neglected the time needed for communication (exchange of data) in all investigations on speedup and efficiency.

Let
$ t_{\mathrm{Startup}}$ - latency (time until communication starts)
$ t_{\mathrm{Word}}\;\;$ - time for transfering one Word ( $ \leftarrow$ bandwidth),

then the time to transfer $ N$ Word information (message) is :

$\displaystyle \boxed{ t_k \;=\; t_{\mathrm{Startup}} + N\cdot t_{\mathrm{Word}}\enspace. }$ (3.6)

Regarding to formula (3.8) and to practical experience, the transfer of one long message is faster than the transfer of several short messages.
Operating system or/and hardware split huge messages into several smaller ones. The size of those smaller messages differs with the manufacturer ($ > 1kByte$).

Table 3.1: Comparison of latency and bandwidth
machine/OS latency bandwidth
Xplorer/Parix 100 $ \mu$s (meas.) 1.3 MB/sec
    (peak) 8.8 MB/sec
T9000 10 $ \mu$s (est.) 80 MB/sec
Intel iPsc/860 82.5 $ \mu$s 2.5 MB/sec
Convex SPP-1 0.2 $ \mu$s 16.0 MB/sec
nCube2   71.0 MB/sec
Cray TT3D 6 $ \mu$s 120.0 MB/sec
IntelParagon   200.0 MB/sec
Origin2000   (sust.) 680.0 MB/sec



The sustained bisection bandwith of an Origin2000 scales linear up to 10 GB/sec on 32 processors, this value remains the same for 64 processors and doubles to 20 GB/sec with 128 processors. There are algorithms those are not worth to parallelize, although they look in such a way.

Ex.: FFT fine grain

Computation of FFT coefficients $ e^{2\pi i \frac{k}{N}}$ $ \scriptstyle k=\overline{0,N-1} \makebox[0pt]{}$

Denote with $ t_x$ the time to compute one coefficient, then we have
sequential : $ t_{\mathrm{sequential}} \;=\; N\cdot t_x$ for a vector of lenght $ N$.
parallel :
a) Calculate locally a subvector of length  $ \frac{N}{P}$ , time: $ \frac{N}{P}\cdot t_x$ .
b) Date exchange (all processes need results)
$ {\mathcal O}(N)\cdot t_{\mathrm{Word}} +
\underbrace{{\mathcal O}(\log{P})}_{\mbox{\tiny best case}}\cdot t_{\mathrm{Startup}} $

$ t_{\mathrm{par}} \;=\; \frac{N}{P}\cdot t_x + {\mathcal O}(N)\cdot t_{\mathrm{Word}} + {\mathcal O}(\log{P})\cdot t_{\mathrm{Startup}}$


Question : When do we achieve $ t_{\mathrm{seriell}} > t_{\mathrm{par}}$ ?
$ \longrightarrow$ Only if $ t_x$ is big enough, i.e. parallelization for small $ t_x$ does not pay off - even if latency is small and bandwidth is big.
$ \longrightarrow$ Ratio $ \frac{\mbox{{\scriptsize arithmetics}}}{\mbox{{\scriptsize communication}}}$ should be as big as possible.
$ \longrightarrow$ Coarse grain algorithms should be prefered (see Sect. (6.4)).

next up previous contents
Next: 4. Basic routines Up: 3.4 Performance evaluation of Previous: 3.4.2 Efficiency   Contents
Gundolf Haase 2000-03-20