This is something that fundamentally can't work, unfortunately. One showstopper (and there may be others) is subgroup size. Nvidia hardware has a subgroup (warp) size of 32, while Intel's subgroup size story is far more complicated, and depends on a compiler heuristic to tune. The short version of the story is that it's usually 16 but can be 8 if there's a lot of register pressure, or 32 for a big workgroup and not much register pressure (and for those who might reasonably question whether forcing subgroup size to 32 can solve the compatibility issue, the answer is that it will frequently cause registers to spill and performance to tank). CUDA code is not written to be agile in subgroup size, so there is no automated translation that works efficiently on Intel GPU hardware.
Longer term, I think we can write GPU code that is portable, but it will require building out the infrastructure for it. Vulkan compute shaders are one good starting point, and as of Vulkan 1.3 the "subgroup size control" feature is mandatory. WebGPU is another possible path to get there, but it's currently lacking a lot of important features, including subgroups at all. There's more discussion of subgroups as a potential WebGPU feature in [1], including how to handle subgroup size.
Things like this are often useful even if they're not optimal. Before you had a piece of code that simply would not run on your GPU. Now it runs. Even if it's slower than it should be, that's better than not running at all. Which makes more people willing to buy the GPU.
Then they go to the developers and ask why the implementation isn't optimized for this hardware lots of people have and the solution is to do an implementation in Vulkan etc.
Both Intel and AMD have the opportunity to create some actual competition for NVIDIA in the GPGPU space. Intel, at least, I can forgive since they only just entered the market. Why AMD has struggled so hard to get anything going for so long, I don't know...
AMD has made several attempts, their most recent effort apparently is the ROCm [0] software platform. There is an official PyTorch distro for Linux that supports ROCm [1] for acceleration. There's also frameworks like tinygrad [2] that (claim) support for all sorts of accelerators. Thats as far as the claims go, I don't know how it handles the real world. If the occasional George Hotz livestream (creator of TinyGrad) is anything to go by, AMD has to rule out a lot of driver issues to be any actual competition for team green.
I really hope AMD manages a comeback like they showed a few years ago with their CPUs. Intel joining the market is certainly helping, but having three big players competing would certainly be desirable for all sorts of applications that require GPUs. AMD cars like the 7900 XTX are already fairly promising on paper with fairly big VRAMs, they'd probably be much more cost effective than NVIDIA cards if software support was anywhere near comparable.
I think the most weird thing is that. The ROCm works fine on linux. At least some dedicated workstation can use it with specific cards. And it already exists for many years. But somehow they can't make any single card work on windows after so many years passed. (Or they don't want for some reason?). It's really weird, given they already have a working implementation (just not for windows) so they are not lacking of the ability for making it work.
To get an idea, the 1050ti is a card with an MSRP of $140 - almost seven years ago when it was released. Between the driver and CUDA support matrix it will likely end up with a 10 year support life.
While it's not going to impress with an LLM or such it's the lowest minimum supported card for speech to text with Willow Inference Server (I'm the creator) and it still puts up impressive numbers.
Same for the GTX 1060/1070, which you can get with up to 8GB VRAM today for ~$100 used. Again, not impressive for the hotter LLMs, etc but it will do ASR, TTS, video encoding/decoding, Frigate, Plex transcoding, and any number of other things with remarkable performance (considering cost). People also run LLMs on them and from a price/performance/power standpoint it's not even close compared to CPU.
The 15 year investment and commitment to universal support for any Nvidia GPU across platforms (with very long support lifecycles) is extremely hard to compete with (as we see again and again with AMD attempts in the space).
The 390.x driver series (updated 11/22) supports cards that are at least 13 years old[0]. From a quick glance of available drivers for AMD cards of that vintage the drivers haven't been updated since 2014/2015.
When comparing the last driver release from Nvidia in the 2014-205 range it supports cards going back to at least 2004[1].
Agreed. What they've also been doing is stalling/removing support on gfx803 and that literally could have allowed people to still use the GPU for doing many decent small nets.
Yes, but I don't think its debatable to say that the entire ecosystem is firmly behind NVIDIA's. Usually it comes as a surprise when something does support their framework, whether directly ROCm or even HIP which should be easier....
I shouldn't be surprised that AMD's ecosystem is lagging behind, since their GPU division spent a good decade suffering to be even relevant. Not to mention that NVIDIA has spent a lot of effort on their HPC tools.
I don't want this to be too negative towards AMD, they have been steadily growing in this space. Some things do work well, e.g. stable diffusion is totally fine on AMD GPUs. So they seem to be catching up. I just feel a little impatient, especially since their cards are more than powerful enough to be useful. I suppose my point is that the gap in HPC software between NVIDIA and AMD is much larger than the actual capability gap in their hardware, and that's a shame.
Apple was a few months from bankruptcy during most of the 90s competing with IBM and Microsoft, then turned around to become the most profitable company on the planet. It takes a leader and a plan and a lot of talent and the exact right conditions, but industry behemoths get pulled down from the top spot all the time.
Apple's success is mostly UX and marketing with a walled garden for an application tax. AMD has to actually achieve on the hardware side, not just marketing. Beyond this, AMD has demonstrated that they are, indeed working on closing and pulling ahead. AMD is well ahead of Intel on the server CPU front, they're neck and neck on desktop, with spans ahead in the past few years. And on the GPU side, they've closed a lot of gaps.
While I am a little bit of a fan of AMD, there's still work to do. I think AMD really needs to take advantage of their production margins to gain more market share. They also need to get something a bit closer to the 4090 on a performance gpu + entry workstation api/gpgpu workload card. The 7900 XTX is really close, but if they had something with say 32-48gb vram in the sub-2000 space it would really get a lot of the hobbiest and soho types to consider them.
Yeah, sure, changing their platform 3 times in the space of some twenty years is just marketing and UX from Apple.
They are just a bunch of MBAs.
Sometimes I feel like I am reading slashdot.
The platform changes had very little to do with their success. They switched from PowerPC to Intel because PowerPC was uncompetitive, but that doesn't explain why they did any better than Dell or anyone else using the exact same chips. Then they developed their own chips because Intel was stagnant, but they barely came out before AMD had something competitive and it's not obvious they'd have been in a meaningfully different position had they just used that.
Their hardware is good but if all they were selling was Macbooks and iPhones with Windows and Android on them, they wouldn't have anything near their current margins.
If they hadn’t made platform changes they would have never been able to turn into what they are today. I hardly thing that is ‘little to do’.
They would likely barely exist. They have ‘achieved product market fit’ as the saying goes.
Which requires more than just a sharp UI, as their history shows
I think the point here is that their first platform shift onto x86/x86-64 was driven by how far Power had fallen behind. Even their fans were having difficulty justifying the slowness of their comparatively expensive computers.
It was more forced upon them than anything else.
The move to M1 was an actual innovation that came after their success.
The real story of the company though is the iPhone, which is absolutely their own technical innovation.
I didn't say they didn't have technical prowess... I said that wasn't the key to their overall success. iPod/Phone/Pad came out ahead of competition with better UX than what, generally, came before. They marketed consistently and did it well. They built up buzz. They paid for placement throughout TV and Movies.
They are a brand first, and a technical company second. That doesn't mean they aren't doing cool technical things. But a lot of companies do cool technical things and still fail.
NV has a huge advantage over AMD: they only do one thing. And that has helped them to relentlessly focus on optimizing that one thing. AMD is fighting on three different fronts at once.
Yeah, but try running ML projects on your AMD card and you'll quickly see that they're an afterthought nearly everywhere, even projects that use PyTorch (which has backend support for AMD). If consumers can't use it, they're going to learn nvidia and experience has shown that people opt for enterprise tech that they're familiar with, and most people get familiar hacking on it locally
Oh wow, I think this translates PTX (nvidia's high-level assembly code) to SPIR-V? Am I reading this right? That's...a lot.
A note to any systems hackers who might try something like this, you can also retarget clang's CUDA support to SPIR-V via a LLVM-to-SPIR-V translator. I can say with confidence that this works. :)
It's a drop-in replacement in the sense that you don't need to modify your code. But it has limitations/incompatibilities. Contrast to something that isn't a drop-in replacement... That would require changes to the application.
"It's compatible with CUDA as long as you don't use all the features of CUDA."
So it's a drop in replacement for some subset of modern CUDA. I feel like most folks who are upvoting this don't program CUDA professionally or aren't very advanced in their usage of it.
The issue being that in the original question “drop-in” implies complete, that’s what being a drop-in replacement actually means in other contexts. If it’s not complete, then it’s not really drop-in, even though I don’t necessarily disagree with your definition. You can be right, and parent can be right too, IMHO. The FAQ question is stated ambiguously and in misleadingly black and white terms, and the answer really does look kinda funny starting with the word “Yes” and then following that with “but… not exactly”. Wouldn’t it be better to say drop-in is the goal, and because it’s not complete, we’re not there yet?
Does the statement make no sense or you simply don't like it? I can understand what it says, sounds like plain English to me.
I think that what you're trying to say is that before claiming to be a "drop-in replacement", make sure that your supported feature set is representative enough of mainline CUDA development.
This project is a Proof of Concept. About the only thing that works currently is Geekbench. It's amazingly buggy and incomplete. You should not rely on it for anything serious."
> Is ZLUDA a drop-in replacement for CUDA?
Yes, but certain applications use CUDA in ways which make it incompatible with ZLUDA
I think this might get better if and when people redesign their dev workflows, CI/CD pipelines, builds, etc, to deploy code to both hardware platforms to ensure matching functionality and stability. I'm not going to hold my breath just yet. But it would be really great to have two viable platforms/players in this space where code can be run and behave equally.
Why is this trending when there hasn't been a commit since Jan 2021? There's a comment here like "it's early days"... the repo has been dormant for longer than it was active.
> Yes, but certain applications use CUDA in ways which make it incompatible with ZLUDA
> What is the status of the project?
> This project is a Proof of Concept. About the only thing that works currently is Geekbench. It's amazingly buggy and incomplete. You should not rely on it for anything serious
It is a cool proof of concept but we don’t know how far away it is from becoming something that a company would willingly endorse. And I suspect AMD or Intel wouldn’t want to put a ton of effort into… helping people continue to write code in their competitor’s ecosystem.
Cuda has won though, it's not about helping people write code for your competitors, it's about allowing the most used packages for ML to run on your hardware.
Yes it is a legal issue. AMD cannot implement CUDA.
However, they have gone around that by creating HIP which is a CUDA adjacent language that runs on AMD and also translates to CUDA for Nvidia GPUs. There is also the HIPify tool to automatically convert existing sources from CUDA to HIP.
https://docs.amd.com/bundle/HIP-Programming-Guide-v5.3/page/...
x86 is protected by patents and only a few companies can use them, that’s why you don’t see random companies making x86 CPUs like they can with ARM (which also needs licensing but it’s much easier to get)
The first x86-64 processor has been released more than 20 years ago, so any patents on that base architecture (which includes SSE2) have already expired.
Good luck enforcing those software parents though in the current environment. My sense is that hardware companies put more faith in the enforceability of patent claims to something like CUDA than software companies with experience with litigating patent claims to APIs would. Post Oracle v. Google, something like CUDA is vulnerable to being knocked off.
And FWIW, that seems to be a reasonable result given the overall market structure at the moment. Having all eggs in the Nvidia basket is great for Nvidia shareholders, but not for customers and probably not even for the health of the surrounding industry.
genuine question, why don't these protections apply to emulators? How could emulators get away with emulating x86 but chip manufacturers cannot use x86 for their chips without a license?
Probably fair use, which is a subjective thing but one Intel must be confident enough they would lose. Or just lack of incentive, there is no money for Intel to gain if they win.
HIPify is such a half baked effort, in everything from installation to benchmark to marketing. It doesn't look like they are trying to support this method at all.
No, they ruled that APIs are copyrightable, but that Google's re-implementation was fair use. Based on the reasoning of the decision one would expect that in most cases independently reimplementing an API would generally be fair use. However, from a practical point of view, if you are defending yourself in a copyright lawsuit, fair use decisions happen much later in the process and are more subjective.
Furthermore, CUDA is a language (dialect of C/C++) not an API, so that precedent may not have much weight.
In the end, google reimplemented Java, and the supreme court ruled on a very narrow piece of the reimplementation. I think it came down to a former sun/oracle employee at google actually copy pasting code from the original java code base.
I'm reasonably sure they could reimplement CUDA from a copyright / trademark perspective. It's possible that they could be blocked with patents though.
> think it came down to a former sun/oracle employee at google actually copy pasting code from the original java code base.
IIRC, the verbatim copying of rangeCheck didn't make it to SCOTUS. They really did instead rule on the copyrightability of the "structure, sequence, and organization" of the Java API as a whole.
Because they ran their numbers and realized they have more to lose by going against Valve than... amicably find a compromise.
A more aggressive approach was tried during the Xbox 360 era with the Games For Windows Live framework and by removing their games from the Steam store. It ended up catastrophically bad and they had to backtrack on both decisions.
The irony of Proton de facto killing any chance for native linux ports of windows games isn't lost to them, either.
MS has been building lots of goodwill with gamers by bringing games to PC and subverting expectations by not being opposed to using game pass on Steam Deck. Suing Valve or trying to shut down Proton/DXVK would instantly burn all that.
Because these lawsuits are costly, a PR nightmare, loosing them is a serious possibility and going around fighting your competition with lawsuits can put you into a bad place with government agencies.
Playing games on linux is not a threat to microsoft. The money they loose on that is miniscule.
This looks really interesting but also early days. "this is a very incomplete proof of concept. It's probably not going to work with your application." I hope it develops into something broadly usable!
Llama.cpp just added CUDA GPU acceleration yesterday, so this would be very interesting for the emerging space of running local LLMs on commodity hardware.
Running CUDA on an AMD RDNA3 APU is what I'd like to see as its probably the cheapest 16GB shared VRAM solution (UMA Frame Buffer BIOS setting) and creates the possibility of running 13b LLM locally on an underutilized iGPU.
- llama.cpp already has OpenCL acceleration. It has had it for some time.
- AMD already has a CUDA translator: ROCM. It should work with llama.cpp CUDA, but in practice... shrug
- Copies the CUDA/OpenCL code make (that are unavoidable for discrete GPUs) are problematic for IGPs. Right now acceleration regresses performance on IGPs.
Llama.cpp would need tailor made IGP acceleration. And I'm not even sure what API has the most appropriate zero copy mechanism. Vulkan? OneAPI? Something inside ROCM?
Are there any projects that go the opposite direction, to run CPU code on GPU? I understand that there might be limitations, like not being able to access system calls or the filesystem. What I'm mainly looking for is a way to write C-style code and have it run auto-parallelized, without having to drop into a different language or annotate my code or manually manage buffers.
I think C code is full of branches and CPUs are designed to guess and parallelise those (possible) decisions. Graphics cards are designed for running tiny programs against millions of pixels per second. I’m not sure it possible to make these two different concepts the same.
Longer term, I think we can write GPU code that is portable, but it will require building out the infrastructure for it. Vulkan compute shaders are one good starting point, and as of Vulkan 1.3 the "subgroup size control" feature is mandatory. WebGPU is another possible path to get there, but it's currently lacking a lot of important features, including subgroups at all. There's more discussion of subgroups as a potential WebGPU feature in [1], including how to handle subgroup size.
[1]: https://github.com/gpuweb/gpuweb/issues/3950