This is something that fundamentally can't work, unfortunately. One showstopper (and there may be others) is subgroup size. Nvidia hardware has a subgroup (warp) size of 32, while Intel's subgroup size story is far more complicated, and depends on a compiler heuristic to tune. The short version of the story is that it's usually 16 but can be 8 if there's a lot of register pressure, or 32 for a big workgroup and not much register pressure (and for those who might reasonably question whether forcing subgroup size to 32 can solve the compatibility issue, the answer is that it will frequently cause registers to spill and performance to tank). CUDA code is not written to be agile in subgroup size, so there is no automated translation that works efficiently on Intel GPU hardware.
Longer term, I think we can write GPU code that is portable, but it will require building out the infrastructure for it. Vulkan compute shaders are one good starting point, and as of Vulkan 1.3 the "subgroup size control" feature is mandatory. WebGPU is another possible path to get there, but it's currently lacking a lot of important features, including subgroups at all. There's more discussion of subgroups as a potential WebGPU feature in [1], including how to handle subgroup size.
Things like this are often useful even if they're not optimal. Before you had a piece of code that simply would not run on your GPU. Now it runs. Even if it's slower than it should be, that's better than not running at all. Which makes more people willing to buy the GPU.
Then they go to the developers and ask why the implementation isn't optimized for this hardware lots of people have and the solution is to do an implementation in Vulkan etc.
Longer term, I think we can write GPU code that is portable, but it will require building out the infrastructure for it. Vulkan compute shaders are one good starting point, and as of Vulkan 1.3 the "subgroup size control" feature is mandatory. WebGPU is another possible path to get there, but it's currently lacking a lot of important features, including subgroups at all. There's more discussion of subgroups as a potential WebGPU feature in [1], including how to handle subgroup size.
[1]: https://github.com/gpuweb/gpuweb/issues/3950