Gerhard Zumbusch

Current CPU (processor) and GPU (graphics) architectures are able to perform many more floating point operations than typical numeric codes usually achieve. Processors heavily use data and instruction parallelism at different levels. Standard programming approaches cannot exploit this parallelism in total. However, numerical algorithms on these systems tend to be memory bandwidth limited.

In addition to algorithmic optimizations, we will compare different programming techniques including Java and Nvidia Cuda in a numerical Android App. Furthermore we consider a given Finite Difference Stencil kernel and discuss cache aware algorithms and vectorization strategies including non-standard memory layouts.

Finite Difference Stencils in HPC