The CUDA block size is likely to be a good proxy for register pressure so if the...

The CUDA block size is likely to be a good proxy for register pressure so if the block size is small you can try running with a small subgroup, etc.

NVIDIA used to discourage code which relies on the subgroup or warp size. I'm not sure how much this is true of real world code though.