In the past several weeks I have been porting a codebase from Nvidia CUDA platform to AMD HIP. Several critical issues were encountered, some solved, some attributed to compiler bugs, some remaining unfathomable to me.
There are 3 important things I have learned so far from the painstaking debugging process.
- An unsigned integer with n bits only allows 0~(n-1) times bitwise left shift (
<<
). Excess shifts lead to undefined behavior. For Nvidia platform, 0 bit will be added, whereas for AMD, 1 bit will be added!!! - Currently there is a serious compiler bug: the wavefront vote function
__any(pred)
, which is supposed to work like__any_sync(__activemask(), pred)
in CUDA, yields incorrect result in divergent threads!!! - This is very easy to miss: the parameter of wavefront vote functions
__any(pred)
,__all(pred)
, etc is a 32-bit integer for both Nvidia and AMD platforms. If, however, a 64-bit integer is passed to the function, higher bits will be truncated!!! The solution is to explicitly cast the 64-bit integer to bool, which is then implicitly cast to int.