Next: 4.1.2 Matrix-by-Vector operations (BLAS2)
Up: 4.1.1 Vector-by-Vector operations (BLAS1)
Previous: 4.1.1.1 Determining the inner
  Contents
We split vectors
,
into
disjoint
subvectors of lengths
(
) and
store each of them on the appropriate processor
.
DO IN PARALLEL
DO
END DO
END DO
CALL ALL/SMALL>_REDUCE(
)
,i.e.
Exercise 11:
Write a small program which calculates the global inner product
of two disjoint stored vectors on a distributed memory machine.
|
Remark :
If one wants to create his own vector library
including BLAS1-routines as a subset, then the following two points are
advisable.
1
level of implementation :
A common tool for accelerating code is the loop unrolling, wherein
the ratio between arithmetics/load/store operations and the loop overhead
will be improved.
Let us investigate an inner product with stride 1.
The best choice for the modulo (in example 4) is strongly hardware dependent
(number of pipes
software pipelining, caches).
2
level of implementation :
Use same techniques as in the previous point but use BLAS1-routines
whenever possible, i.e.,
VDPLUS(
), d.h.
.
IF (adr(
)==adr(
) AND
)
THEN CALL DAXPY(
)
ELSE IF (adr(
)==adr(
) AND
)
THEN CALL DAXPY(
)
ELSE Loop unrolling as in 1
level
END IF
END IF
Next: 4.1.2 Matrix-by-Vector operations (BLAS2)
Up: 4.1.1 Vector-by-Vector operations (BLAS1)
Previous: 4.1.1.1 Determining the inner
  Contents
Gundolf Haase
2000-03-20