Category Archives: Science

Porting code from Nvidia GPU to AMD : lesson learned

In the past several weeks I have been porting a codebase from Nvidia CUDA platform to AMD HIP. Several critical issues were encountered, some solved, some attributed to compiler bugs, some remaining unfathomable to me.

There are 3 important things I have learned so far from the painstaking debugging process.

  • An unsigned integer with n bits only allows 0~(n-1) times bitwise left shift (<<). Excess shifts lead to undefined behavior. For Nvidia platform, 0 bit will be added, whereas for AMD, 1 bit will be added!!!
  • Currently there is a serious compiler bug: the wavefront vote function __any(pred) , which is supposed to work like __any_sync(__activemask(), pred) in CUDA, yields incorrect result in divergent threads!!!
  • This is very easy to miss: the parameter of wavefront vote functions __any(pred), __all(pred), etc is a 32-bit integer for both Nvidia and AMD platforms. If, however, a 64-bit integer is passed to the function, higher bits will be truncated!!! The solution is to explicitly cast the 64-bit integer to bool, which is then implicitly cast to int.

FindNVML.cmake done correctly — how to have CMake find Nvidia Management Library (NVML) on Windows and Linux

[Last updated on Feb 26, 2020]

The latest cmake 3.17 has just started to officially support a new module FindCUDAToolkit where NVML library is conveniently referenced by the target CUDA::nvml. With this new feature this article is now deprecated.

Nvidia Management Library (NVML) is a powerful API to get and set GPU states. Currently there is a lack of official CMake support. The first couple of google search results point to a script on github, which unfortunately is only partially correct and does not work on Windows. Here we provide a working solution, tested on Scientific Linux 6 and Windows 10, with CUDA 9.1 and CMake 3.11.

The NVML API is spread across several locations:

  • Linux
    • Header: ${CUDA_INCLUDE_DIRS}/nvml.h
    • Shared library: ${CUDA_TOOLKIT_ROOT_DIR}/lib64/stubs/libnvidia-ml.so
  • Windows
    • Header: ${CUDA_INCLUDE_DIRS}/nvml.h
    • Shared library: C:/Program Files/NVIDIA Corporation/NVSMI/nvml.dll
    • Import library: ${CUDA_TOOLKIT_ROOT_DIR}/lib/x64/nvml.lib

It is critical to note that, on Windows a dynamic library (.dll) is accompanied by an import library (.lib), which is different from a static library (also .lib). In CMake the target binary should link to the import library (.lib) directly instead of the .dll file. With that, the correct FindNVML.cmake script is shown in listing 1.

Listing 1

# FindNVML.cmake

if(${CUDA_VERSION_STRING} VERSION_LESS "9.1")
    string(CONCAT ERROR_MSG "--> ARCHER: Current CUDA version "
                         ${CUDA_VERSION_STRING}
                         " is too old. Must upgrade it to 9.1 or newer.")
    message(FATAL_ERROR ${ERROR_MSG})
endif()

# windows, including both 32-bit and 64-bit
if(WIN32)
    set(NVML_NAMES nvml)
    set(NVML_LIB_DIR "${CUDA_TOOLKIT_ROOT_DIR}/lib/x64")
    set(NVML_INCLUDE_DIR ${CUDA_INCLUDE_DIRS})

    # .lib import library full path
    find_file(NVML_LIB_PATH
              NO_DEFAULT_PATH
              NAMES nvml.lib
              PATHS ${NVML_LIB_DIR})

    # .dll full path
    find_file(NVML_DLL_PATH
              NO_DEFAULT_PATH
              NAMES nvml.dll
              PATHS "C:/Program Files/NVIDIA Corporation/NVSMI")
# linux
elseif(UNIX AND NOT APPLE)
    set(NVML_NAMES nvidia-ml)
    set(NVML_LIB_DIR "${CUDA_TOOLKIT_ROOT_DIR}/lib64/stubs")
    set(NVML_INCLUDE_DIR ${CUDA_INCLUDE_DIRS})

    find_library(NVML_LIB_PATH
                 NO_DEFAULT_PATH
                 NAMES ${NVML_NAMES}
                 PATHS ${NVML_LIB_DIR})
else()
    message(FATAL_ERROR "Unsupported platform.")
endif()

find_path(NVML_INCLUDE_PATH
          NO_DEFAULT_PATH
          NAMES nvml.h
          PATHS ${NVML_INCLUDE_DIR})

include(FindPackageHandleStandardArgs)
find_package_handle_standard_args(NVML DEFAULT_MSG NVML_LIB_PATH NVML_INCLUDE_PATH)

Once find_package(NVML) is called in user CMake code, two cache variables are generated: NVML_LIB_PATH and NVML_INCLUDE_PATH. For Windows, there is an additional NVML_DLL_PATH.

The correct way of building MPI program using Cmake

[Last update on Feb 18, 2018]

Many posts on this topic appear outdated. Modern cmake is centered around target-specific configurations. A correct way of building MPI program with cmake (version 3.10.2 for instance) would be:

find_package(MPI REQUIRED)
add_executable(my_mpi_bin src1.cpp src2.cpp)
target_include_directories(my_mpi_bin PRIVATE ${MPI_CXX_INCLUDE_PATH} src1.h src2.h)
target_compile_options(my_mpi_bin PRIVATE ${MPI_CXX_COMPILE_FLAGS} my_compile_flags)
target_link_libraries(my_mpi_bin ${MPI_CXX_LIBRARIES} ${MPI_CXX_LINK_FLAGS} my_link_flags)

If the MPI implementation (MPICH-3.2 for instance) is installed at certain location that cmake is unable to find automatically, explicitly specify the path. For example:

cmake \
-DMPI_CXX_COMPILER=/usr/local/mpich-install/bin/mpicxx \
-DMPI_C_COMPILER=/usr/local/mpich-install/bin/mpicc \
-DMPIEXEC_EXECUTABLE=/usr/local/mpich-install/bin/mpiexec.hydra

MPI_CXX_COMPILER and MPI_C_COMPILER are merely MPI wrappers. They are not the actual compiler/linker. To specify a certain compiler/linker:

cmake \
-DCMAKE_CXX_COMPILER=/usr/local/bin/g++-6.4.0 \
-DCMAKE_C_COMPILER=/usr/local/bin/gcc-6.4.0 \
-DMPI_CXX_COMPILER=/usr/local/mpich-install/bin/mpicxx \
-DMPI_C_COMPILER=/usr/local/mpich-install/bin/mpicc \
-DMPIEXEC_EXECUTABLE=/usr/local/mpich-install/bin/mpiexec.hydra

*Never ever specify CMAKE_CXX_COMPILER and CMAKE_CXX_COMPILER by hardcoding them in the cmake script. This is such a common anti-pattern.

To create a test for the MPI program:

add_test(NAME my_mpi_test
         COMMAND ${MPIEXEC_EXECUTABLE}
         ${MPIEXEC_NUMPROC_FLAG}
         ${MPIEXEC_MAX_NUMPROCS}
         ${MPIEXEC_PREFLAGS}
         ${CMAKE_CURRENT_BINARY_DIR}/my_mpi_bin
         ${MPIEXEC_POSTFLAGS}
         my_arg_1 my_arg_2 ...)

*Prior to cmake 3.10.2, use MPIEXEC instead of MPIEXEC_EXECUTABLE.

What phrase is considered toxic but may apply now? — GG EZ

Reference:
https://cmake.org/cmake/help/v3.9/module/FindMPI.html
http://www.urbandictionary.com/define.php?term=ggez

Build Geant4 (including OpenGL visualization) using Cygwin on Windows

Build Geant4 (including OpenGL visualization) using Cygwin on Windows

[Updated on Nov 15, 2019]

There has been a lack of official documentation on how to build Geant4 using Cygwin on Windows. This post is intended to fill the gap. All we need to do is to install several Cygwin packages, modify a couple of cmake scripts, and change a few lines of Geant4 source code.

Test conditions

  • Geant4 10.5.1
  • Windows 10 (64-bit)
  • Cygwin 64 version 3.0.7
  • gcc and g++ version 7.4.0
  • cmake version 3.14.5

The following Cygwin packages are required to build C++ project under CMake.

  • gcc-g++
  • cmake
  • make

The following Cygwin packages are required to build Geant4 core engine.

  • expat, libexpat-devel
  • zlib, zlib-devel

The following Cygwin packages are required to build Geant4 OpenGL visualization module.

  • libX11-devel
  • libXmu-devel
  • libGL-devel
  • xinit
  • xorg-server
  • xorg-x11-fonts-*

Steps

  • Modify cmake scripts.
    • In both cmake\Modules\G4BuildSettings.cmake and cmake\Modules\G4ConfigureCMakeHelpers.cmake, change CMAKE_CXX_EXTENSIONS from OFF to ON, i.e.
      set(CMAKE_CXX_EXTENSIONS ON)
      

      The above steps are crucial in that they let the compiler flag -std=gnu++11 be automatically added in place of the initial flag -std=c++11. On Cygwin -std=c++11 will make the Posix function posix_memalign() inaccessible, which will cause Geant4 compile errors.

  • Modify source code.
    • In source\processes\electromagnetic\dna\utils\include\G4MoleculeGun.hh, add declaration of explicit specialization immediately after class definition:
      template<typename TYPE>
      class TG4MoleculeShoot : public G4MoleculeShoot
      {
      public:
          TG4MoleculeShoot() : G4MoleculeShoot(){;}
          virtual ~TG4MoleculeShoot(){;}
          void Shoot(G4MoleculeGun*){}
      
      protected:
          void ShootAtRandomPosition(G4MoleculeGun*){}
          void ShootAtFixedPosition(G4MoleculeGun*){}
      };
      
      // Above is class definition in Geant4
      // We need to add three lines of code here
      // to declare explicit specialization
      
      template<> void TG4MoleculeShoot<G4Track>::ShootAtRandomPosition(G4MoleculeGun* gun);
      template<> void TG4MoleculeShoot<G4Track>::ShootAtFixedPosition(G4MoleculeGun* gun);
      template<> void TG4MoleculeShoot<G4Track>::Shoot(G4MoleculeGun* gun);
      

      Otherwise the compiler would complain about multiple definition.

    • In source\global\management\src\G4Threading.cc, comment out syscall.h include. Apparently Cygwin does not offer the OS specific header file syscall.h, and thus do not support multithreading in Geant4 that relies on syscall.h.
      // #include  // comment out this line
      
  • Create out-of-source build script. Due to lack of syscall.h in Cygwin, only single-threaded Geant4 can be built.
    • Release build
      cmake ../geant4_src -DCMAKE_C_COMPILER=/usr/bin/gcc.exe \
      -DCMAKE_CXX_COMPILER=/usr/bin/g++.exe \
      -DCMAKE_INSTALL_PREFIX=/opt/geant4/release \
      -DCMAKE_BUILD_TYPE=Release \
      -DGEANT4_USE_SYSTEM_EXPAT=ON \
      -DGEANT4_USE_SYSTEM_ZLIB=ON \
      -DGEANT4_INSTALL_DATA=ON \
      -DGEANT4_USE_OPENGL_X11=ON
      
  • Build.
    make
    

    For faster compile, use make -j6 which uses 6 parallel processes.

  • Install.
    make install
    
  • Visualization
    To visualize B1 example, in one Cygwin terminal:

    startxwin
    

    In another terminal:

    export DISPLAY=:0.0
    ./exampleB1.exe
    
  • Have fun with Geant4 !!! … and remember: If you love something, set it free.

Acknowledgement

Thanks to Charlie for making me aware of the issues in newer Geant4.

Prefetch on Intel MIC coprocessor

[updated on April 6, 2016]

Software-based data prefetch on Intel MIC coprocessors is very useful for Monte Carlo transport code. It helps hide the long latency when loading microscopic cross-section data from DRAM. There are a total of 8 different types of prefetch with subtle differences. Here we tell them apart.

Cache hierarchy

A MIC has 32-KB L1 cache per core and 512 KB L2 cache per core. Here by “cache” we mean the data cache instead of instruction cache, and by “core” we mean the physical core instead of logical core. Both levels of cache implement MESI coherency protocol and have a cache line size of 64 bytes (i.e. 8 consecutive FP64 values).

Prefetch instruction

Let’s take a look at two orthogonal concepts first:

  • non-temporal hint (NTA) — informs that data will be used only once in the future and causes them to be evicted from the cache after the first use (most recently used data to be evicted).
  • exclusive hint (E) — renders the cache line on the current core in the “exclusive” state, where the cache lines on other cores are invalidated.

The combination of temporality, exclusiveness, and locality (L1 or L2) together yields 8 types of instructions supported by the present-day Knights Corner MIC. They specify how the data are expected to be uniquely handled in the cache, enumerated below.

instruction hint purpose
vprefetchnta _MM_HINT_NTA loads data to L1 and L2 cache, marks it as NTA
vprefetch0 _MM_HINT_T0 loads data to L1 and L2 cache
vprefetch1 _MM_HINT_T1 loads data to L2 cache only
vprefetch2 _MM_HINT_T2 loads data to L2 cache only, marks it as NTA This mnemonic is counter-intuitive as there is not NTA in it
vprefetchenta _MM_HINT_ENTA exclusive version of vprefetchnta
vprefetche0 _MM_HINT_ET0 exclusive version of vprefetch0
vprefetche1 _MM_HINT_ET1 exclusive version of vprefetch1
vprefetche2 _MM_HINT_ET2 exclusive version of vprefetch2

Note L2 cache of the MIC is inclusive in the sense that it has a copy of all the data in L1.

There are two ways of implementing prefetch in C — intrinsic and assembly.

// method 1: intrinsic
_mm_prefetch((const char*)addr, hint);

// method 2: assembly
asm volatile ("prefetch_inst [%0]"::"m"(addr));

Here addr is the address of the byte starting from which to prefetch, prefetch_inst is the prefetch instructions listed above, and hint is the parameter for the compiler intrinsic. We would like to emphasize again that _MM_HINT_T2 and _MM_HINT_ET2 are counter-intuitive. In fact they are misnomers as both are non-temporary. They should have been named as _MM_HINT_NTA2 and _MM_HINT_ENTA2 by Intel.

Prefetch on CPUs

So how about prefetch on Intel Xeon CPUs? Well, turn out very different! Check the list below.

instruction hint purpose
prefetchnta _MM_HINT_NTA loads data to L2 and L3 cache, marks as NTA
prefetcht0 _MM_HINT_T0 loads data to L2 and L3 cache
prefetcht1 _MM_HINT_T1 equivalent to prefetch0
prefetcht2 _MM_HINT_T2 equivalent to prefetch0
prefetchw n/a[2] exclusive version of prefetch0 [1]
prefetchwt1 n/a[3] equivalent to prefetchw [1]

[1] not confirmed
[2] icpc does not compiler _mm_prefetch((const char*)addr, _MM_HINT_ENTA);
[3] icpc does not compile _mm_prefetch((const char*)addr, _MM_HINT_ET1);
Note L3 cache of the Intel Xeon CPU is inclusive in the sense that it has a copy of all the data in L2.

Reference
[1]Intel Xeon Phi Coprocessor Instruction Set Architecture Reference Manual, 2012.
[2]Intel 64 and IA-32 Architectures Software Developer’s Manual, 2015.