Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The CUDA block size is likely to be a good proxy for register pressure so if the block size is small you can try running with a small subgroup, etc.

NVIDIA used to discourage code which relies on the subgroup or warp size. I'm not sure how much this is true of real world code though.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: