Current
CPU (processor) and GPU (graphics) architectures are able to perform
many more floating point operations than typical numeric codes usually
achieve. Processors heavily use data and instruction parallelism at
different levels. Standard programming approaches cannot exploit this
parallelism in total. However, numerical algorithms on these systems
tend to be memory bandwidth limited.
In addition to algorithmic optimizations, we will compare different
programming techniques including Java and Nvidia Cuda in a numerical
Android App. Furthermore we consider a given Finite Difference Stencil
kernel and discuss cache aware algorithms and vectorization strategies
including non-standard memory layouts.