3.4.3 Communication expenditure

Next: 4. Basic routines Up: 3.4 Performance evaluation of Previous: 3.4.2 Efficiency Contents

3.4.3 Communication expenditure

The latter sections neglected the time needed for communication (exchange of data) in all investigations on speedup and efficiency.

Let

$t_{\mathrm{Startup}}$ - latency (time until communication starts)
$t_{\mathrm{Word}}\;\;$ - time for transfering one Word ( $\leftarrow$ bandwidth),

then the time to transfer

Word information (message) is :

$\displaystyle \boxed{ t_k \;=\; t_{\mathrm{Startup}} + N\cdot t_{\mathrm{Word}}\enspace. }$

(3.6)

Regarding to formula (3.8) and to practical experience, the transfer of one long message is faster than the transfer of several short messages.
Operating system or/and hardware split huge messages into several smaller ones. The size of those smaller messages differs with the manufacturer (

Table 3.1: Comparison of latency and bandwidth

machine/OS	latency	bandwidth
Xplorer/Parix	100 $\mu$ s	(meas.) 1.3 MB/sec
		(peak) 8.8 MB/sec
T9000	10 $\mu$ s	(est.) 80 MB/sec
Intel iPsc/860	82.5 $\mu$ s	2.5 MB/sec
Convex SPP-1	0.2 $\mu$ s	16.0 MB/sec
nCube2		71.0 MB/sec
Cray TT3D	6 $\mu$ s	120.0 MB/sec
IntelParagon		200.0 MB/sec
Origin2000		(sust.) 680.0 MB/sec

The sustained bisection bandwith of an Origin2000 scales linear up to 10 GB/sec on 32 processors, this value remains the same for 64 processors and doubles to 20 GB/sec with 128 processors. There are algorithms those are not worth to parallelize, although they look in such a way.

Ex.: FFT fine grain

Computation of FFT coefficients $e^{2\pi i \frac{k}{N}}$ $\scriptstyle k=\overline{0,N-1} \makebox[0pt]{}$

Denote with

the time to compute one coefficient, then we have
sequential : $t_{\mathrm{sequential}} \;=\; N\cdot t_x$ for a vector of lenght

.
parallel :

a) Calculate locally a subvector of length $\frac{N}{P}$ , time: $\frac{N}{P}\cdot t_x$ .
b) Date exchange (all processes need results)
${\mathcal O}(N)\cdot t_{\mathrm{Word}} + \underbrace{{\mathcal O}(\log{P})}_{\mbox{\tiny best case}}\cdot t_{\mathrm{Startup}}$

$t_{\mathrm{par}} \;=\; \frac{N}{P}\cdot t_x + {\mathcal O}(N)\cdot t_{\mathrm{Word}} + {\mathcal O}(\log{P})\cdot t_{\mathrm{Startup}}$

Question : When do we achieve $t_{\mathrm{seriell}} > t_{\mathrm{par}}$ ?
$\longrightarrow$ Only if

is big enough, i.e. parallelization for small

does not pay off - even if latency is small and bandwidth is big.
$\longrightarrow$ Ratio $\frac{\mbox{{\scriptsize arithmetics}}}{\mbox{{\scriptsize communication}}}$ should be as big as possible.
$\longrightarrow$ Coarse grain algorithms should be prefered (see Sect. (6.4)).

Next: 4. Basic routines Up: 3.4 Performance evaluation of Previous: 3.4.2 Efficiency Contents

Gundolf Haase 2000-03-20