Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Zluda: Run CUDA code on Intel GPUs, unmodified (github.com/vosen)
237 points by goranmoomin on June 15, 2023 | hide | past | favorite | 90 comments


This is something that fundamentally can't work, unfortunately. One showstopper (and there may be others) is subgroup size. Nvidia hardware has a subgroup (warp) size of 32, while Intel's subgroup size story is far more complicated, and depends on a compiler heuristic to tune. The short version of the story is that it's usually 16 but can be 8 if there's a lot of register pressure, or 32 for a big workgroup and not much register pressure (and for those who might reasonably question whether forcing subgroup size to 32 can solve the compatibility issue, the answer is that it will frequently cause registers to spill and performance to tank). CUDA code is not written to be agile in subgroup size, so there is no automated translation that works efficiently on Intel GPU hardware.

Longer term, I think we can write GPU code that is portable, but it will require building out the infrastructure for it. Vulkan compute shaders are one good starting point, and as of Vulkan 1.3 the "subgroup size control" feature is mandatory. WebGPU is another possible path to get there, but it's currently lacking a lot of important features, including subgroups at all. There's more discussion of subgroups as a potential WebGPU feature in [1], including how to handle subgroup size.

[1]: https://github.com/gpuweb/gpuweb/issues/3950


Things like this are often useful even if they're not optimal. Before you had a piece of code that simply would not run on your GPU. Now it runs. Even if it's slower than it should be, that's better than not running at all. Which makes more people willing to buy the GPU.

Then they go to the developers and ask why the implementation isn't optimized for this hardware lots of people have and the solution is to do an implementation in Vulkan etc.


The CUDA block size is likely to be a good proxy for register pressure so if the block size is small you can try running with a small subgroup, etc.

NVIDIA used to discourage code which relies on the subgroup or warp size. I'm not sure how much this is true of real world code though.


Only if SPIR-V tooling ever gets half as good as PTX ecosystem.


This project has been unmaintained for a while.

Both Intel and AMD have the opportunity to create some actual competition for NVIDIA in the GPGPU space. Intel, at least, I can forgive since they only just entered the market. Why AMD has struggled so hard to get anything going for so long, I don't know...


AMD has made several attempts, their most recent effort apparently is the ROCm [0] software platform. There is an official PyTorch distro for Linux that supports ROCm [1] for acceleration. There's also frameworks like tinygrad [2] that (claim) support for all sorts of accelerators. Thats as far as the claims go, I don't know how it handles the real world. If the occasional George Hotz livestream (creator of TinyGrad) is anything to go by, AMD has to rule out a lot of driver issues to be any actual competition for team green.

I really hope AMD manages a comeback like they showed a few years ago with their CPUs. Intel joining the market is certainly helping, but having three big players competing would certainly be desirable for all sorts of applications that require GPUs. AMD cars like the 7900 XTX are already fairly promising on paper with fairly big VRAMs, they'd probably be much more cost effective than NVIDIA cards if software support was anywhere near comparable.

[0]: https://www.amd.com/en/graphics/servers-solutions-rocm

[1]: https://pytorch.org/

[2]: https://github.com/geohot/tinygrad


I think the most weird thing is that. The ROCm works fine on linux. At least some dedicated workstation can use it with specific cards. And it already exists for many years. But somehow they can't make any single card work on windows after so many years passed. (Or they don't want for some reason?). It's really weird, given they already have a working implementation (just not for windows) so they are not lacking of the ability for making it work.


The issue with ROCm is that its completely unaccessible for most users. It only support high end GPUs.

While cuda works on a 1050ti.


Supported on CUDA 12 no less!

To get an idea, the 1050ti is a card with an MSRP of $140 - almost seven years ago when it was released. Between the driver and CUDA support matrix it will likely end up with a 10 year support life.

While it's not going to impress with an LLM or such it's the lowest minimum supported card for speech to text with Willow Inference Server (I'm the creator) and it still puts up impressive numbers.

Same for the GTX 1060/1070, which you can get with up to 8GB VRAM today for ~$100 used. Again, not impressive for the hotter LLMs, etc but it will do ASR, TTS, video encoding/decoding, Frigate, Plex transcoding, and any number of other things with remarkable performance (considering cost). People also run LLMs on them and from a price/performance/power standpoint it's not even close compared to CPU.

The 15 year investment and commitment to universal support for any Nvidia GPU across platforms (with very long support lifecycles) is extremely hard to compete with (as we see again and again with AMD attempts in the space).


To be fair, AMD does offer good long term support for cards, just not with ROCm.


The 390.x driver series (updated 11/22) supports cards that are at least 13 years old[0]. From a quick glance of available drivers for AMD cards of that vintage the drivers haven't been updated since 2014/2015.

When comparing the last driver release from Nvidia in the 2014-205 range it supports cards going back to at least 2004[1].

[0] - https://download.nvidia.com/XFree86/Linux-x86_64/390.157/REA...

[1] - https://download.nvidia.com/XFree86/Linux-x86_64/96.43.23/RE...


Agreed. What they've also been doing is stalling/removing support on gfx803 and that literally could have allowed people to still use the GPU for doing many decent small nets.


Does ROCm count as an attempt? They burned so many people by not supporting any of the cards anyone cares about.


All it would take to remedy that, is actually providing good support going forward and a bit of advertising. Not a huge barrier IMO.


"AMD has made several attempts..."

And failed to make any of them work, which to my mind means they've burned their possibilities more than if they flat-out did nothing.


> Both Intel and AMD have the opportunity to create some actual competition for NVIDIA in the GPGPU space.

Apple Silicon being on Metal Performance Shaders (I think they deprecated OpenCL support?) kind of makes this all more confusing.

It definitely feels like CUDA is the leader and anything else is backseat/a non starer, which is fine. The community support isn't there.

I haven't heard anybody talk about AMD Radeon GPUs in a looong time.


All the competition fails in tooling and not being polyglot like CUDA.

So it is already a non starter if they can't meet those baselines.


AMD has HIP.


Yes, but I don't think its debatable to say that the entire ecosystem is firmly behind NVIDIA's. Usually it comes as a surprise when something does support their framework, whether directly ROCm or even HIP which should be easier....

I shouldn't be surprised that AMD's ecosystem is lagging behind, since their GPU division spent a good decade suffering to be even relevant. Not to mention that NVIDIA has spent a lot of effort on their HPC tools.

I don't want this to be too negative towards AMD, they have been steadily growing in this space. Some things do work well, e.g. stable diffusion is totally fine on AMD GPUs. So they seem to be catching up. I just feel a little impatient, especially since their cards are more than powerful enough to be useful. I suppose my point is that the gap in HPC software between NVIDIA and AMD is much larger than the actual capability gap in their hardware, and that's a shame.


Apple was a few months from bankruptcy during most of the 90s competing with IBM and Microsoft, then turned around to become the most profitable company on the planet. It takes a leader and a plan and a lot of talent and the exact right conditions, but industry behemoths get pulled down from the top spot all the time.


Apple's success is mostly UX and marketing with a walled garden for an application tax. AMD has to actually achieve on the hardware side, not just marketing. Beyond this, AMD has demonstrated that they are, indeed working on closing and pulling ahead. AMD is well ahead of Intel on the server CPU front, they're neck and neck on desktop, with spans ahead in the past few years. And on the GPU side, they've closed a lot of gaps.

While I am a little bit of a fan of AMD, there's still work to do. I think AMD really needs to take advantage of their production margins to gain more market share. They also need to get something a bit closer to the 4090 on a performance gpu + entry workstation api/gpgpu workload card. The 7900 XTX is really close, but if they had something with say 32-48gb vram in the sub-2000 space it would really get a lot of the hobbiest and soho types to consider them.


Yeah, sure, changing their platform 3 times in the space of some twenty years is just marketing and UX from Apple. They are just a bunch of MBAs. Sometimes I feel like I am reading slashdot.


The platform changes had very little to do with their success. They switched from PowerPC to Intel because PowerPC was uncompetitive, but that doesn't explain why they did any better than Dell or anyone else using the exact same chips. Then they developed their own chips because Intel was stagnant, but they barely came out before AMD had something competitive and it's not obvious they'd have been in a meaningfully different position had they just used that.

Their hardware is good but if all they were selling was Macbooks and iPhones with Windows and Android on them, they wouldn't have anything near their current margins.


You’re not really making sense.

If they hadn’t made platform changes they would have never been able to turn into what they are today. I hardly thing that is ‘little to do’.

They would likely barely exist. They have ‘achieved product market fit’ as the saying goes. Which requires more than just a sharp UI, as their history shows


I think the point here is that their first platform shift onto x86/x86-64 was driven by how far Power had fallen behind. Even their fans were having difficulty justifying the slowness of their comparatively expensive computers.

It was more forced upon them than anything else.

The move to M1 was an actual innovation that came after their success.

The real story of the company though is the iPhone, which is absolutely their own technical innovation.


I would say the turning point was either the iPod or the rainbow Macs.


I didn't say they didn't have technical prowess... I said that wasn't the key to their overall success. iPod/Phone/Pad came out ahead of competition with better UX than what, generally, came before. They marketed consistently and did it well. They built up buzz. They paid for placement throughout TV and Movies.

They are a brand first, and a technical company second. That doesn't mean they aren't doing cool technical things. But a lot of companies do cool technical things and still fail.


NV has a huge advantage over AMD: they only do one thing. And that has helped them to relentlessly focus on optimizing that one thing. AMD is fighting on three different fronts at once.


Yeah, but try running ML projects on your AMD card and you'll quickly see that they're an afterthought nearly everywhere, even projects that use PyTorch (which has backend support for AMD). If consumers can't use it, they're going to learn nvidia and experience has shown that people opt for enterprise tech that they're familiar with, and most people get familiar hacking on it locally


* and Apple.


Oh wow, I think this translates PTX (nvidia's high-level assembly code) to SPIR-V? Am I reading this right? That's...a lot.

A note to any systems hackers who might try something like this, you can also retarget clang's CUDA support to SPIR-V via a LLVM-to-SPIR-V translator. I can say with confidence that this works. :)


> Is ZLUDA a drop-in replacement for CUDA? Yes, but certain applications use CUDA in ways which make it incompatible with ZLUDA

So no then.


It's a drop-in replacement in the sense that you don't need to modify your code. But it has limitations/incompatibilities. Contrast to something that isn't a drop-in replacement... That would require changes to the application.


This statement makes no sense.

"It's compatible with CUDA as long as you don't use all the features of CUDA."

So it's a drop in replacement for some subset of modern CUDA. I feel like most folks who are upvoting this don't program CUDA professionally or aren't very advanced in their usage of it.


“Drop-in” = to the extent it works, it requires no app changes.

“Complete” = it covers everything.

It is drop-in, but not complete.


The issue being that in the original question “drop-in” implies complete, that’s what being a drop-in replacement actually means in other contexts. If it’s not complete, then it’s not really drop-in, even though I don’t necessarily disagree with your definition. You can be right, and parent can be right too, IMHO. The FAQ question is stated ambiguously and in misleadingly black and white terms, and the answer really does look kinda funny starting with the word “Yes” and then following that with “but… not exactly”. Wouldn’t it be better to say drop-in is the goal, and because it’s not complete, we’re not there yet?


Does the statement make no sense or you simply don't like it? I can understand what it says, sounds like plain English to me.

I think that what you're trying to say is that before claiming to be a "drop-in replacement", make sure that your supported feature set is representative enough of mainline CUDA development.


50% of the time it works every time.


that’s not quite accurate.

it works 100% of the time. until it does not.


yeah would probably be good to include where it doesn't work. 1% of the time? 10%?


It's near the bottom:

"What is the status of the project?

This project is a Proof of Concept. About the only thing that works currently is Geekbench. It's amazingly buggy and incomplete. You should not rely on it for anything serious."


I appreciate how the name translates to "delusion", considering how "cuda" translates to "miracles" in the same language (Polish).


> Is ZLUDA a drop-in replacement for CUDA? Yes, but certain applications use CUDA in ways which make it incompatible with ZLUDA

I think this might get better if and when people redesign their dev workflows, CI/CD pipelines, builds, etc, to deploy code to both hardware platforms to ensure matching functionality and stability. I'm not going to hold my breath just yet. But it would be really great to have two viable platforms/players in this space where code can be run and behave equally.


Related question, what is the best way to handle kernel compatibility for CUDA, OpenCL, etc ... ?

I had to write a cross-platform kernel a few weeks ago, and I ended using pre-processor guards to make it work with the OpenCL and CUDA compilers [1].

[1] https://github.com/RaphaelJ/libhum/blob/main/libhum/match.ke...


Answer is unconventional. Run CI/CD ... It's way easier to see if things will break live when you have it being run-tested on these stacks.


Why is this trending when there hasn't been a commit since Jan 2021? There's a comment here like "it's early days"... the repo has been dormant for longer than it was active.


Intel should pour money into this project until the code is hosted in scrooge's money bin


Related:

Zluda: CUDA on Intel GPUs - https://news.ycombinator.com/item?id=26262038 - Feb 2021 (77 comments)


If a single dev could do it, why can't AMD do the same for their GPUs.


From the README:

> Is ZLUDA a drop-in replacement for CUDA?

> Yes, but certain applications use CUDA in ways which make it incompatible with ZLUDA

> What is the status of the project?

> This project is a Proof of Concept. About the only thing that works currently is Geekbench. It's amazingly buggy and incomplete. You should not rely on it for anything serious

It is a cool proof of concept but we don’t know how far away it is from becoming something that a company would willingly endorse. And I suspect AMD or Intel wouldn’t want to put a ton of effort into… helping people continue to write code in their competitor’s ecosystem.


Cuda has won though, it's not about helping people write code for your competitors, it's about allowing the most used packages for ML to run on your hardware.


Probably for the same reason JWZ reimplemented OpenGL 1.3 on top of OpenGL ES 1.1 in three days, but the vendors can't do it:

https://www.jwz.org/blog/2012/06/i-have-ported-xscreensaver-...

https://news.ycombinator.com/item?id=4134426


It's probably a good idea to hide the referrer on links to jwz's site, he holds some fairly strong options about HN.

https://dereferer.me/?https%3A//www.jwz.org/blog/2012/06/i-h...


Although true, I don't think we should be trying to circumvent his block.


It's better suggestion to just not visit JWZ site :P

Went there a few days ago. Got a colonoscopy picture. Even without deferrer/referrer.


I imagine they could, but it is probably more of a legal thing.


Yes it is a legal issue. AMD cannot implement CUDA.

However, they have gone around that by creating HIP which is a CUDA adjacent language that runs on AMD and also translates to CUDA for Nvidia GPUs. There is also the HIPify tool to automatically convert existing sources from CUDA to HIP. https://docs.amd.com/bundle/HIP-Programming-Guide-v5.3/page/...


Is that true? AMD implemented the x86 instruction set, Google implemented the Java APIs, what is different about CUDA?


Doesnt AMD and Intel have agreement on x86?


AMD has an x86 license from Intel and Intel has a x86-64 license from AMD.

You can guess how much money lawyers have been paid over that circumstance.


I see


x86 is protected by patents and only a few companies can use them, that’s why you don’t see random companies making x86 CPUs like they can with ARM (which also needs licensing but it’s much easier to get)


The first x86-64 processor has been released more than 20 years ago, so any patents on that base architecture (which includes SSE2) have already expired.


Good luck enforcing those software parents though in the current environment. My sense is that hardware companies put more faith in the enforceability of patent claims to something like CUDA than software companies with experience with litigating patent claims to APIs would. Post Oracle v. Google, something like CUDA is vulnerable to being knocked off.

And FWIW, that seems to be a reasonable result given the overall market structure at the moment. Having all eggs in the Nvidia basket is great for Nvidia shareholders, but not for customers and probably not even for the health of the surrounding industry.


genuine question, why don't these protections apply to emulators? How could emulators get away with emulating x86 but chip manufacturers cannot use x86 for their chips without a license?


Probably fair use, which is a subjective thing but one Intel must be confident enough they would lose. Or just lack of incentive, there is no money for Intel to gain if they win.


I guess so, it just never occurred to me that X86 emulators would be in a legal gray zone.


Legal question:

Let's suppose that an open-source CUDA API is in a legal gray zone that could only be clarified by a judge.

Could a company like AMD create a wholly owned subsisiary to make an attempt, whithout exposing the parent company to legal liability?


HIPify is such a half baked effort, in everything from installation to benchmark to marketing. It doesn't look like they are trying to support this method at all.


Didn't Scotus affirm that APIs are not copyrightable?


No, they ruled that APIs are copyrightable, but that Google's re-implementation was fair use. Based on the reasoning of the decision one would expect that in most cases independently reimplementing an API would generally be fair use. However, from a practical point of view, if you are defending yourself in a copyright lawsuit, fair use decisions happen much later in the process and are more subjective.

Furthermore, CUDA is a language (dialect of C/C++) not an API, so that precedent may not have much weight.


In the end, google reimplemented Java, and the supreme court ruled on a very narrow piece of the reimplementation. I think it came down to a former sun/oracle employee at google actually copy pasting code from the original java code base.

I'm reasonably sure they could reimplement CUDA from a copyright / trademark perspective. It's possible that they could be blocked with patents though.


> think it came down to a former sun/oracle employee at google actually copy pasting code from the original java code base.

IIRC, the verbatim copying of rangeCheck didn't make it to SCOTUS. They really did instead rule on the copyrightability of the "structure, sequence, and organization" of the Java API as a whole.


Then why isn't Microsoft suing Valve for Proton/DXVK?


Because they ran their numbers and realized they have more to lose by going against Valve than... amicably find a compromise.

A more aggressive approach was tried during the Xbox 360 era with the Games For Windows Live framework and by removing their games from the Steam store. It ended up catastrophically bad and they had to backtrack on both decisions.

The irony of Proton de facto killing any chance for native linux ports of windows games isn't lost to them, either.


MS has been building lots of goodwill with gamers by bringing games to PC and subverting expectations by not being opposed to using game pass on Steam Deck. Suing Valve or trying to shut down Proton/DXVK would instantly burn all that.


Not sure if this is relevant, but IIUC Proton and Wine implement Windows' ABI, rather than something involving copyrighted header files.


Because these lawsuits are costly, a PR nightmare, loosing them is a serious possibility and going around fighting your competition with lawsuits can put you into a bad place with government agencies.

Playing games on linux is not a threat to microsoft. The money they loose on that is miniscule.


Probably because they don't care?


Patents and software licenses probably.


it's because AMD drivers generate dead loop https://youtu.be/Mr0rWJhv9jU?t=320


This looks really interesting but also early days. "this is a very incomplete proof of concept. It's probably not going to work with your application." I hope it develops into something broadly usable!


CUDA is a polyglot development enviroment, usually all these projects fail short to focus only on C++.

I failed to find information regarding Fortran, Haskell, Julia, .NET, Java support for CUDA workloads.


How good is the tool at reporting missing cuda functionally?


Llama.cpp just added CUDA GPU acceleration yesterday, so this would be very interesting for the emerging space of running local LLMs on commodity hardware.

Running CUDA on an AMD RDNA3 APU is what I'd like to see as its probably the cheapest 16GB shared VRAM solution (UMA Frame Buffer BIOS setting) and creates the possibility of running 13b LLM locally on an underutilized iGPU.

Aaand its been dead for years, shame.


- llama.cpp already has OpenCL acceleration. It has had it for some time.

- AMD already has a CUDA translator: ROCM. It should work with llama.cpp CUDA, but in practice... shrug

- Copies the CUDA/OpenCL code make (that are unavoidable for discrete GPUs) are problematic for IGPs. Right now acceleration regresses performance on IGPs.

Llama.cpp would need tailor made IGP acceleration. And I'm not even sure what API has the most appropriate zero copy mechanism. Vulkan? OneAPI? Something inside ROCM?


Apple has a way for zero copy since they have one memory pool for both GPU and cpu.

But… I don’t know if it’s possible to do for iGPUs that partition memory in BIOS. I am curious for the answer.


Are there any projects that go the opposite direction, to run CPU code on GPU? I understand that there might be limitations, like not being able to access system calls or the filesystem. What I'm mainly looking for is a way to write C-style code and have it run auto-parallelized, without having to drop into a different language or annotate my code or manually manage buffers.


I think C code is full of branches and CPUs are designed to guess and parallelise those (possible) decisions. Graphics cards are designed for running tiny programs against millions of pixels per second. I’m not sure it possible to make these two different concepts the same.


Beautiful! Waiting for someone to use this and get benchmarks with PyTorch now.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: