ThinkPad P50: mind your bindings!

May 14, 2016

MPICH vs OpenMPI: shmem latency of 5 runs

Last time we were puzzled by strange behavior of MPICH in one of the benchmarking runs - both in comparison to normal self and also to OpenMPI. The picture above highlights just how remarkable this behavior was (here, logarithmic scaling of the message length axis was replaced by linear one). It almost feels like this was a completely different hardware.

In fact, it was not. In order to understand why this happens, one has to look into the internals of the Intel Xeon E3-1535M v5 processor. It has 4 cores running 2 HW threads each (if so called Hyper-threading is on). Getting data by the means of a handy script

cat /proc/cpuinfo | egrep "processor|physical id|core id" |\
sed 's/^processor/\nprocessor/g'

one can build the following table of correspondence between the HW threads and the physical cores:

Thread	Core
0	0
1	1
2	2
3	3
4	0
5	1
6	2
7	3

(imagine that LinkedIn has finally decided to support markdown syntax).

In human terms this means that HW threads #0 and #4 share the same physical core, and so do HW threads #1 and #5, etc. Thus these HW thread pairs share more than just L3 cache, and hence may enjoy substantially better small message latency and bandwidth compared to the HW tread pairs that have to cross the core boundary during communication. So, since MPI processes this time were not bound to any cores, their odd performance behavior may be explained by the process migration.

Indeed, forgetting for a spell about OpenMPI and playing with the MPICH process binding settings alone, we can get a confirmation of this hypothesis:

MPICH vs OpenMPI: shmem latency of 5 runs

Here, blue graph corresponds to the (cross-core) binding of the MPI processes 0 and 1 to HW threads #0 and #1, resp., while yellow graph corresponds to the (intra-core) binding of these processes to HW threads #0 and #4. Apart from yielding an almost flat latency, this latter binding also leads to bandwidth increasing linearly:

MPICH vs OpenMPI: shmem latency of 5 runs

To sum up: when placing your MPI processes, mind their bindings! Otherwise your mileage may vary run to run, as MPI processes may (and over time will) jump cores at the context switch. However, process binding is tricky and should be used with care, as we will see next time.

(to be continued)

Tags: High Performance Computing, Laptops, MPI