High Performance Computing (Master Course)

Contents:

We will start with an introduction into basic principles and algorithms of hardware aware and parallel computing followed by transferring selected algorithms onto many-core architectures.

Lecturer:

Prof. Gundolf Haase, Heinrichstr. 36, Zi 506, Tel. 5178,

Appointments:

Wednesday 8:15 - 9:45 in Heinrichstr. 36, SR 11.33

Contents:

Introduction into hardware and parallel concepts: pdf.
Iterative Methods: pdf.
Intro into Finite Elements: pdf.
Parallel Finite Elements (OpenMP, MPI): pdf. Commented code for OpenMP and for MPI.
Geometrisches Multigrid: pdf.
Algebraic Multigrid: pdf.
Application in cardiovascular applications: pdf.
Parallel improvements: pdf.
Non-linear problems: pdf.
MPI+X: pdf.

Exercises (Folllow the links in the pdf-document):

Recover your C++: Exercise 1.

Download: utils (doc), example skalar.
Some explanation of example structure(makefile etc).

Sequential programming: Exercise 2. Templates: skalar (tar, doc), jacobi (tar, doc), vector-matrix (tar, doc)
Workaround for slow MKL on AMD hardware.
Benchmark results:
Shared memory programming (OpenMP): Exercise 3.
Benchmark results: (ods, pdf),
Distributed memory programming (MPI): Exercise 4, see also hint and Intro MPI, parallel algebra.
Results:

Hardware (login from outside KFU only via VPN):

Mephisto at IMSC; try the jupyter interface.
Clusters in Graz: sauron (queuing system); hostfile , see hints.
Remote login to servers:

VPN to KFU is needed: install via VPN Service the software AnyConnect (configure as server: https://univpn.uni-graz.at; login: KFU E-mail)
Linux: use ssh -X 143.50.47.xxx to connect to compute server
Windows: Install WinnSSHTerm with a guided installation of further packages (putty, winscp, X-Server)

Software:

Ubuntu and windows (thanks to Simon Stadler):

Installing Ubuntu on windows; Update Oct. 2023
Accessing files on hard drive
WSL FAQ

Intel Parallel Studio XE (free for students on Windows/Linux/macOS)
Linux with gcc. Install the most recent gnu compiler.
Linux with llvm: clang++ (and alternatives with g++).
Makefiles (tutorial, specific)
[AMD compiler: AOCC]
clang-tidy, cppcheck, SonarCloud.

Material:

Link for impatient.
Winter/Summer School 2019.
Course I held in Chile, its presentations, the fast entry.
Using BLAS, and other linear algebra packages.
OpenMP: 5.1, Quick Reference, LLNL-tutorial on OpenMP, nice tutorial, guide into OpenMP; german tutorial; compiler
OpenMP Accelerator Offload (Intel)
SYCL,
MPI: OpenMPI(home, doc), LLNL-tutorial on MPI
Intel optimized BLAS/Lapack: MKL
AMD optimizied BLAS: ACML (good docu)
valgrind on MPI, see §4.9 of manual.
likwid (code, wiki)
use of restrict,
Dining philosophers problem
Some words on Linux/Unix.
R. Grimm: C++ Core Guidelines: Mehr Fallen in der Concurrency,

Further Links

Agner Fog: Software optimization;
SIMD instruction list. intel intrinsics guide.
Top 500 [Nov 2019]: ARM A64FX with 16.9 GFLOPS/Watt (+Nvidia), Groq,
Top 5 [June 2020]
Wiki for semiconductors ( AMD Zen (Epyc, EPYC 7551, Threadripper), Intel Coffee Lake, Skylake, Haswell (i7-4770), Alder Lake)
KI-Chip TPU by google
Vienna Supercomputer [VSC-4, VSC-3]
Chinease #1: Sunway TaihuLight [May 2016]
Intel Xeon E7-8890-v4 (24 x 2.20 GHz, 60 MB Cache, 102 GB/s bandwidth, 844 GFLOPS(D)) [June '16: Intel, press, info]
Intel® Xeon® Phi Knights Landing (72 cores, 1.5 GHz, 16 GB MCDRAM, 490/102 GB/s bandwidth, 3 TFLOPS(D)) [June '16: Intel, heise]
Intel Xeon W-3175X (28 x 3.1 GHz, 38.5 MB Cache) [Oct'18: info]
AMD Ryzen Threadripper (32 x 3.0 GHz, 80 MB Cache) [Aug.'18, heise]
NVIDIA: Pascal-Architecture, Titan X; 12 GB with 480 GB/s, 11 TFLOPS (single; double 1/32).
2020: Nvidia Ampere (1,2,3, RTX 3090, dito; 36 Shader-Teraflops, 69 Raytracing-Teraflops, 285 Tensor-Teraflops)
2020: Arm (Ampere Altra, Neoverse, 72-core)
top 500 (2020: Fugaku, 2022: Frontier)

Books

Sterling/Anderson/Brodowicz: "High Performance Computing", Elsevier, 2018 (e-book)
Thomas Rauber and Gudula Rünger, "Parallel Programming: for Multicore and Cluster Systems ", Springer, Berlin, 2013, 3rd edition (e-book, 1st ed: e-book)
Craig C. Douglas and Gundolf Haase and Ulrich Langer: "A Tutorial on Elliptic PDE Solvers and their Parallelization", SIAM, 2003 (e-book)
B1, B2
Bartłomiej Filipek: C++17 in Detail.
C++20: official standard, blog by Rainer Grimm

Nov 6, 2024