[updated on April 6, 2016]
Software-based data prefetch on Intel MIC coprocessors is very useful for Monte Carlo transport code. It helps hide the long latency when loading microscopic cross-section data from DRAM. There are a total of 8 different types of prefetch with subtle differences. Here we tell them apart.
Cache hierarchy
A MIC has 32-KB L1 cache per core and 512 KB L2 cache per core. Here by “cache” we mean the data cache instead of instruction cache, and by “core” we mean the physical core instead of logical core. Both levels of cache implement MESI coherency protocol and have a cache line size of 64 bytes (i.e. 8 consecutive FP64 values).
Prefetch instruction
Let’s take a look at two orthogonal concepts first:
- non-temporal hint (NTA) — informs that data will be used only once in the future and causes them to be evicted from the cache after the first use (most recently used data to be evicted).
- exclusive hint (E) — renders the cache line on the current core in the “exclusive” state, where the cache lines on other cores are invalidated.
The combination of temporality, exclusiveness, and locality (L1 or L2) together yields 8 types of instructions supported by the present-day Knights Corner MIC. They specify how the data are expected to be uniquely handled in the cache, enumerated below.
instruction | hint | purpose |
vprefetchnta | _MM_HINT_NTA | loads data to L1 and L2 cache, marks it as NTA |
vprefetch0 | _MM_HINT_T0 | loads data to L1 and L2 cache |
vprefetch1 | _MM_HINT_T1 | loads data to L2 cache only |
vprefetch2 | _MM_HINT_T2 | loads data to L2 cache only, marks it as NTA This mnemonic is counter-intuitive as there is not NTA in it |
vprefetchenta | _MM_HINT_ENTA | exclusive version of vprefetchnta |
vprefetche0 | _MM_HINT_ET0 | exclusive version of vprefetch0 |
vprefetche1 | _MM_HINT_ET1 | exclusive version of vprefetch1 |
vprefetche2 | _MM_HINT_ET2 | exclusive version of vprefetch2 |
Note L2 cache of the MIC is inclusive in the sense that it has a copy of all the data in L1.
There are two ways of implementing prefetch in C — intrinsic and assembly.
// method 1: intrinsic _mm_prefetch((const char*)addr, hint); // method 2: assembly asm volatile ("prefetch_inst [%0]"::"m"(addr));
Here addr is the address of the byte starting from which to prefetch, prefetch_inst is the prefetch instructions listed above, and hint is the parameter for the compiler intrinsic. We would like to emphasize again that _MM_HINT_T2 and _MM_HINT_ET2 are counter-intuitive. In fact they are misnomers as both are non-temporary. They should have been named as _MM_HINT_NTA2 and _MM_HINT_ENTA2 by Intel.
Prefetch on CPUs
So how about prefetch on Intel Xeon CPUs? Well, turn out very different! Check the list below.
instruction | hint | purpose |
prefetchnta | _MM_HINT_NTA | loads data to L2 and L3 cache, marks as NTA |
prefetcht0 | _MM_HINT_T0 | loads data to L2 and L3 cache |
prefetcht1 | _MM_HINT_T1 | equivalent to prefetch0 |
prefetcht2 | _MM_HINT_T2 | equivalent to prefetch0 |
prefetchw | n/a[2] | exclusive version of prefetch0 [1] |
prefetchwt1 | n/a[3] | equivalent to prefetchw [1] |
[1] not confirmed
[2] icpc does not compiler _mm_prefetch((const char*)addr, _MM_HINT_ENTA);
[3] icpc does not compile _mm_prefetch((const char*)addr, _MM_HINT_ET1);
Note L3 cache of the Intel Xeon CPU is inclusive in the sense that it has a copy of all the data in L2.
Reference
[1]Intel Xeon Phi Coprocessor Instruction Set Architecture Reference Manual, 2012.
[2]Intel 64 and IA-32 Architectures Software Developer’s Manual, 2015.