Vectorization approaches to efficient code
Modern CPUs are equipped with Advanced Vector Extensions (AVX). This means that the floating point unit (FPU) on each core of the cpu is capable of executing most instructions (like +,-,*,/, but also exp, sqrt, sin, ...) on simultaneously on several arguments rather than on a single argument. This is also known as SIMD vectorization (single instruction multiple data). On older CPU, e.g. Westmere (Turing) the SIMD register width is 128-bit, allowing the vector instructions to operate on 4 single precision numbers, or 2 double precision numbers, simultaneously. On modern CPUs, e.g. Ivy Bridge EP (Hopper), the SIMD register width is 256-bit and 8 single precision, or 4 double precision SIMD instructions can be executed simultaneously.
As usual, there is no free lunch and developers needs to ensure that certain conditions are fulfilled in order to allow the compiler to make use of these highly efficient SIMD instructions. Failing to meet these conditions immediately reduces the performance of your program by a factor 8 (single precision) or 4 (double precision), which is, in fact, intolerable for codes running for days on expensive supercomputers.
In this second session of HPC Tips&Tricks you will learn to understand these conditions and how to meet them. These principles will be demonstrated with a case study of an in house developed FORTRAN physics code computing magnetization of bulk ferromagnets. Initially this code failed to vectorize, but using some very simple tricks, the performance of this code was improved by a factor 40.