next up previous contents
Next: 3.3.5 Barrier Up: 3.3 Basic global operations Previous: 3.3.3 Broadcast   Contents


3.3.4 Reduce- and All-Reduce operations

Here, we perform an arithmetical/logical operation on the distributed data of all processes (e.g., $ s = \sum\limits_{i=0}^{p-1} a_i$). At the end, the result is available on one root process (Reduce operation) or on all processes (All-Reduce).

$ \bullet$ Reduce : We use the tree topologie with root $ = 000$.

Figure 3.14: Reduce operation in binary tree
\begin{figure}\unitlength0.04\textwidth
\begin{center}
\begin{picture}(14,9)(0...
...5}} \put(5.9,8.4){\makebox(0,0)[bl]{c)}}
\end{picture} \end{center} \end{figure}

Sequence a), b), c): a) = link 3, b) = link 2, c) = link 1
$ s_4 := a_4$, $ s_6 := a_6$, $ s_5 := a_5$, $ s_7 := a_7$
$ s_2 := (a_2+a_6)$, $ s_3 := (a_3+a_7)$
$ s_1 := [(a_1+a_5)+(a_3+a_7)]$
$ s_0 := [(a_0+a_4)+(a_2+a_6)] + [(a_1+a_5) + (a_3+a_7)]
 =  \sum\limits_{i=0}^{p-1} a_i$

Thus only process $ 000$ possesses the correct result, all other processes have only partial results at the end.

Realization via the function TREE/SMALL>_UPDO($ n$,$ X$,$ Y$,$ H$,$ VtOp$), which can also handle vectors and operations $ VtOp$ on them in the parameter list (in example above the call is : CALL Tree_UpDo(1,S,A,AuxArray,VdPlus)).


$ \bullet$ All-Reduce

Variant a): Combines a Reduce with a Scatter operation.
$\left.{\begin{array}{l}
\mbox{{\sc Tree\_UpDo}} \ \mbox{{\sc Tree\_Down}}\;\;...
...Tree\_Do}}\begin{smallmatrix}I \ D \ ensuremath{\mathbb{R}}\end{smallmatrix} $


Variant b): Accumulation in hypercube (Cube accumulation)
In the cube accumulation, one half of the processes (sub-cube of dimension nCube-1) exchanges data with the appropriate processes of the second sub-cube. This exchange will be repeated nCube-times between all proper sub-cube combinations.

Figure 3.15: All-Reduce in hypercube
\begin{figure}\unitlength0.04\textwidth
\begin{center}
\begin{picture}(11,11)
...
...akebox(0,0)[br]{$\scriptstyle a)$}}
%
\par\end{picture}\end{center}\end{figure}

Sequence in time: a) b) c), are mapped on links 3, 2 and 1.

Result:
$ s_0 = [(a_0+a_4)+(a_2+a_6)] + [(a_1+a_5)+(a_3+a_7)]$
$ \;\;\; \vdots$
$ s_7 = [(a_0+a_4)+(a_2+a_6)] + [(a_1+a_5)+(a_3+a_7)]$
$ \;\;\; \Downarrow$
$ s_i = \sum\limits_{i=0}^{p-1} a_i \;\;\; \forall i=0,p-1$


Realization via function {\sc Cube\_Do}$\begin{smallmatrix}I \ D \ ensuremath{\mathbb{R}} \end{smallmatrix} $($n$,$X$,$Y$,$H$,$VtOp$).



Comparison of variants a) and b)

  CUBE/SMALL>_DO  TREE/SMALL>_DO
$ \bullet$ equivalent arithmetical costs  
$ \bullet$ nCube bi-directional communications or 2$ \ast$nCube uni-directional communications  2$ \ast$nCube uni-directional communications
$ \bullet$ same bandwidth on all links required  decreasing bandwith required with increasing/decreasing link number
$ \bullet$ best choice if all links have the same bandwidth Ideal bei gleicher Bandbreite aller Links  Selection of the tree embedded in the hypercube with respect to the bandwidth of the links
$ \bullet$ best in a physical hypercube, e,g, NCUBE2  With larger systems technologically caused smaller bandwidths can occur on some links, e.g., POWER-GC


Ex.: Links in POWER-GC

The following relations hold in a POWER-GC by PARSYTEC (64 processors or more) :

Links 1-4 : bandwidth $ a$
Links 5-6 : bandwidth $ b$
Links 7-8 : bandwidth $ c$


with $ a > b > c$. Since the higher links have a smaller bandwidth, they should be used with the TREE/SMALL>_.. operation in the optimal tree close to the root process. There only a few processes have to communicate to each other.
Test have been performed on a POWER-GC with 128 processors at the TU Chemnitz by Dr. M.Pester resulted in a significantly better run time behavior ( $ \leftarrow$ communication time) of the accordingly chosen TREE/SMALL>_DO routines in relation to the equivalent CUBE/SMALL>_DO routines.
The nasty problem of the different bandwidths in the CUBE/SMALL>_DO routine is closely connected with the

Bisection bandwidth :
Available bandwidth, if one half of the processes communicates with the other half.

The bisection bandwidth is anisotrop in the example above, i.e., it depends on which parts of the computer want to communicate to each other. This problem also occures in the Origin2000 with more than 32 processes ([SGI97], page 34ff).
In an idealistic system, the bisection bandwidth grows with the number of processors.


Exercise 10:
Change algorithm for TREE/SMALL>_DOWN at page [*] such that links with larger numbers transport as little data as possible.

next up previous contents
Next: 3.3.5 Barrier Up: 3.3 Basic global operations Previous: 3.3.3 Broadcast   Contents
Gundolf Haase 2000-03-20