This is something that fundamentally can't work, unfortunately. One showstopper ...

AnthonyMouse · on June 15, 2023

Things like this are often useful even if they're not optimal. Before you had a piece of code that simply would not run on your GPU. Now it runs. Even if it's slower than it should be, that's better than not running at all. Which makes more people willing to buy the GPU.

Then they go to the developers and ask why the implementation isn't optimized for this hardware lots of people have and the solution is to do an implementation in Vulkan etc.

fancyfredbot · on June 15, 2023

The CUDA block size is likely to be a good proxy for register pressure so if the block size is small you can try running with a small subgroup, etc.

NVIDIA used to discourage code which relies on the subgroup or warp size. I'm not sure how much this is true of real world code though.

pjmlp · on June 15, 2023

Only if SPIR-V tooling ever gets half as good as PTX ecosystem.