Introduction into recent processor development
Introduction into recent processor development and keywords therein. That
includes superscalarity, vector units, threads, caches and NUMA on a single
CPU.
Examples for recent HPC-processors
- Xeon
E7-8870-v3 by Intel (block-diagram,
p.16, p.21) (pictures 1,
2)
- 12 cores, 96 GFLOPS(peak) , 43 GByte/sec; 130 Watt
- Xeon
E7-8890-v4 by Intel (block-diagram)
- 2.20 GHz, 60 MB L2-Cache, 102 GB/s bandwidth, 844 GFLOPS(D)
- Xeon
Phi 5110P by Intel (block-diagram,
p.13, p.17, p.34; p.37; p.45) (picture: 1,
2)
- 60 cores; 320 GByte/sec; 1 TFLOPS (peak) ; 300 Watt
- Intel® Xeon® Phi
Knights Landing (heise,
info)
- 72 cores, 1.5 GHz, 16 GB MCDRAM, 490/102 GB/s bandwidth, 3 TFLOPS(D)
- Tesla K20X
by NVIDIA (block-diagram;
outside
p.6)
- 14×192 cores; 1.31 TFLOPS-DP; 250 GB/s; 235 Watt
- NVIDIA: Pascal-Architecture,
- GTX 1080 (info):
8GB with 320 GB/s, 8.9 TFLOPS (single; double 1/32)
- Titan
X; 12 GB with 480 GB/s, 11 TFLOPS (single; double 1/32).
- Tesla P100: 16 GB 720 GB/s, 4.7 TFLOPS(D) (info)
- Firepro
S10000 by AMD (offical)
- 3584 cores 1.48 TFLOPS-DP; 480 GB/s; 375 Watt
Contents:
- We will start with the classical von-Neumann
Computer
- Concepts in one core
- FLOPS
- Superscalarity (multiply functional units)
- Pipelining
- Out-of-order execution
- Speculative execution
- SSE-units
(AVX,
AVX2, AVX512)
- Memory
- again latency and bandwidth
- DRAM
- SRAM
- Caches [Rauber §2.7]
- NUMA
- CPU speed vs. memory speed
- Moore's law
- clock rate vs. energy consumption
- multiple cores
- ccNUMA
- BBC report on
the transputer in 1986
Demonstration:
- Scalar product: vectorization, loop unrolling
- matrix-vektor product: rowise vs. columnwise access, tiling
- profiling