next up previous contents
Next: 4.1.2 Matrix-by-Vector operations (BLAS2) Up: 4.1.1 Vector-by-Vector operations (BLAS1) Previous: 4.1.1.1 Determining the inner   Contents

4.1.1.2 Inner product on a parallel machine with distributed memory

We split vectors $ \underline{x}$, $ \underline{y}$ into $ P$ disjoint subvectors of lengths $ N_j$ ( $ \scriptstyle j=\overline{1,P} \makebox[0pt]{}$) and store each of them on the appropriate processor $ P_j$.

     DO IN PARALLEL 
$ j  =  1,  P $ 

$ s_j \;:=\; 0 $
DO $ i  =  1,  N_j $
$ s_j \;:=\; x_i \ast y_j $
END DO
END DO
CALL ALL/SMALL>_REDUCE($ s_j$) $ \longrightarrow$ $ s$ ,i.e. $ s = \sum\limits_{j=1}^P s_j$



Exercise 11:
Write a small program which calculates the global inner product of two disjoint stored vectors on a distributed memory machine.


Remark : If one wants to create his own vector library including BLAS1-routines as a subset, then the following two points are advisable.

1 $ ^{\text{st}}$ level of implementation :

A common tool for accelerating code is the loop unrolling, wherein the ratio between arithmetics/load/store operations and the loop overhead will be improved. Let us investigate an inner product with stride 1.

\fbox{\begin{minipage}[t]{0.3\textwidth}
$s \;=\; 0$ \\
DO $i = 1, N$ \\
\hspace*{3em} $ s\;=\;s + x_i \ast y_i $ \\
END DO
\end{minipage}} $ \longrightarrow$ \fbox{\begin{minipage}[t]{0.5\textwidth}
$m \;=\; N  \mbox{mod} 4$ \\
$s \...
...m} + x_{i+2} \ast y_{i+2} + x_{i+3} \ast y_{i+3} $ \\
END DO
\end{minipage}}

The best choice for the modulo (in example 4) is strongly hardware dependent (number of pipes $ \Longrightarrow$ software pipelining, caches).


2 $ ^{\text{nd}}$ level of implementation :

Use same techniques as in the previous point but use BLAS1-routines whenever possible, i.e., VDPLUS( $ n,X,ix,Y,iy,Z,iz$), d.h. $ \underline{x} = \underline{y}+\underline{z}$ .

    IF (adr($ X$)==adr($ Y$) AND $ ix==iy$) 

THEN CALL DAXPY( $ n,1d0,X,ix,Z,iz$)
ELSE IF (adr($ X$)==adr($ Z$) AND $ ix==iz$)
THEN CALL DAXPY( $ n,1d0,X,ix,Z,iz$)
ELSE Loop unrolling as in 1 $ ^{\text{st}}$ level
END IF
END IF



next up previous contents
Next: 4.1.2 Matrix-by-Vector operations (BLAS2) Up: 4.1.1 Vector-by-Vector operations (BLAS1) Previous: 4.1.1.1 Determining the inner   Contents
Gundolf Haase 2000-03-20