3.3.4 Reduce- and All-Reduce operations

Next: 3.3.5 Barrier Up: 3.3 Basic global operations Previous: 3.3.3 Broadcast Contents

3.3.4 Reduce- and All-Reduce operations

Here, we perform an arithmetical/logical operation on the distributed data of all processes (e.g., $s = \sum\limits_{i=0}^{p-1} a_i$ ). At the end, the result is available on one root process (Reduce operation) or on all processes (All-Reduce).

$\bullet$ Reduce : We use the tree topologie with root

**Figure 3.14:** Reduce operation in binary tree
$\begin{figure}\unitlength0.04\textwidth \begin{center} \begin{picture}(14,9)(0... ...5}} \put(5.9,8.4){\makebox(0,0)[bl]{c)}} \end{picture} \end{center} \end{figure}$

Sequence a), b), c): a) = link 3, b) = link 2, c) = link 1

$s_0 := [(a_0+a_4)+(a_2+a_6)] + [(a_1+a_5) + (a_3+a_7)] = \sum\limits_{i=0}^{p-1} a_i$

Thus only process

possesses the correct result, all other processes have only partial results at the end.

Realization via the function TREE/SMALL>_UPDO(

), which can also handle vectors and operations

on them in the parameter list (in example above the call is : CALL Tree_UpDo(1,S,A,AuxArray,VdPlus)).

$\bullet$ All-Reduce

Variant a): Combines a Reduce with a Scatter operation.
$\left.{\begin{array}{l} \mbox{{\sc Tree\_UpDo}} \ \mbox{{\sc Tree\_Down}}\;\;... ...Tree\_Do}}\begin{smallmatrix}I \ D \ ensuremath{\mathbb{R}}\end{smallmatrix}$

Variant b): Accumulation in hypercube (Cube accumulation)
In the cube accumulation, one half of the processes (sub-cube of dimension nCube-1) exchanges data with the appropriate processes of the second sub-cube. This exchange will be repeated nCube-times between all proper sub-cube combinations.

**Figure 3.15:** All-Reduce in hypercube
$\begin{figure}\unitlength0.04\textwidth \begin{center} \begin{picture}(11,11) ... ...akebox(0,0)[br]{$\scriptstyle a)$}} % \par\end{picture}\end{center}\end{figure}$

Sequence in time: a) b) c), are mapped on links 3, 2 and 1.

Result:

$\;\;\; \vdots$

$\;\;\; \Downarrow$
$s_i = \sum\limits_{i=0}^{p-1} a_i \;\;\; \forall i=0,p-1$

Realization via function ${\sc Cube\_Do}$\begin{smallmatrix}I \ D \ ensuremath{\mathbb{R}} \end{smallmatrix} $($n$,$X$,$Y$,$H$,$VtOp$).$

Comparison of variants a) and b)

	CUBE/SMALL>_DO	TREE/SMALL>_DO
$\bullet$	equivalent arithmetical costs
$\bullet$	`nCube` bi-directional communications or 2 $\ast$ `nCube` uni-directional communications	2 $\ast$ `nCube` uni-directional communications
$\bullet$	same bandwidth on all links required	decreasing bandwith required with increasing/decreasing link number
$\bullet$	best choice if all links have the same bandwidth Ideal bei gleicher Bandbreite aller Links	Selection of the tree embedded in the hypercube with respect to the bandwidth of the links
$\bullet$	best in a physical hypercube, e,g, NCUBE2	With larger systems technologically caused smaller bandwidths can occur on some links, e.g., POWER-GC

Ex.: Links in POWER-GC

The following relations hold in a POWER-GC by PARSYTEC (64 processors or more) :

Links 1-4	:	bandwidth
Links 5-6	:	bandwidth
Links 7-8	:	bandwidth

with

. Since the higher links have a smaller bandwidth, they should be used with the TREE/SMALL>_.. operation in the optimal tree close to the root process. There only a few processes have to communicate to each other.
Test have been performed on a POWER-GC with 128 processors at the TU Chemnitz by Dr. M.Pester resulted in a significantly better run time behavior ( $\leftarrow$ communication time) of the accordingly chosen TREE/SMALL>_DO routines in relation to the equivalent CUBE/SMALL>_DO routines.

The nasty problem of the different bandwidths in the CUBE/SMALL>_DO routine is closely connected with the

Bisection bandwidth :
Available bandwidth, if one half of the processes communicates with the other half.

The bisection bandwidth is anisotrop in the example above, i.e., it depends on which parts of the computer want to communicate to each other. This problem also occures in the Origin2000 with more than 32 processes ([SGI97], page 34ff).
In an idealistic system, the bisection bandwidth grows with the number of processors.

Exercise 10:

Change algorithm for TREE/SMALL>_DOWN at page

such that links with larger numbers transport as little data as possible.

Next: 3.3.5 Barrier Up: 3.3 Basic global operations Previous: 3.3.3 Broadcast Contents

Gundolf Haase 2000-03-20