Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
AMD Expands AI Product Lineup with GPU-Only Instinct Mi300X with 192GB Memory (anandtech.com)
114 points by mfiguiere on June 13, 2023 | hide | past | favorite | 104 comments


I'm genuinely rooting for AMD to develop a competitive alternative to NVIDIA. Currently, NVIDIA dominates the machine learning landscape, and there doesn't seem to be a justifiable reason for the price discrepancy between the RTX 4090 and the A100.


Its not so black and white. The A100 is exponentially more difficult to assemble even though it is older. The silicon is far more specialized. You pay a price for such a "fat" node and server hardware guarantees.

At the same time, the cost is outrageous. Its not a low volume product.

Also, a 48GB 4090 (or 3090) would be trivial. So would a 48GB 7900. Its not done for purely anticompetitive reasons, one that AMD and Nvidia are unfortunately happy to go along with.


48Gb 4090 exists already, just with a different name, branding, and pricing strategy: https://www.nvidia.com/en-us/design-visualization/rtx-6000/

48Gb 3090: https://www.nvidia.com/en-us/design-visualization/rtx-a6000/


It looks like the rtx-6000 costs around USD $10k. Quite the step up from the 4090.

https://www.bhphotovideo.com/c/product/1753962-REG/pny_vcnrt...


I'm always surprised that people bring up cost as a main factor determining price, especially here on HN. Maybe it would be on a commoditized market with many competitors, which the GPUarket clearly isn't.

As most economics 101 lecture would tell you, price is determined by supply and demand and in this case Nvidia essentially maximizing their profit based on the demand and segmentation of the market.


Its not that simple, as Nvidia is supply constrained and technically not a monopoly. And they have to manage a gaming market and share supply from TSMC/Samsung with with other markets.


> Also, a 48GB 4090 (or 3090) would be trivial. So would a 48GB 7900. Its not done for purely anticompetitive reasons, one that AMD and Nvidia are unfortunately happy to go along with.

Is this why Intel started taking dGPU production more seriously in recent years?


There kind of are 48 GB 4090s and 7900s; they're just under different names.


Yeah, but Quadros (and the AMD equivalents) are outrageously priced too.

Nvidia and AMD used to "let" their manufacturers double up the vram on gaming cards, but no more. Its pure, collusive, anticompetitive market segmentation.


what names?


comma.ai dude is making a whole company to achieve just that https://geohot.github.io/blog/jekyll/update/2023/05/24/the-t...


Nope, he got a look at the absolute state of ROCm and realized he was wasting his time.

https://github.com/RadeonOpenCompute/ROCm/issues/2198#issuec...

One might speculate he’s perhaps pivoting to Intel. They’re not well-developed in application terms but that’s a piece he can develop, and actually with OneAPI that’s a lot of potential bang for the buck. And intel has actual ML accelerators and has some relatively powerful GPGPU stuff and is currently in a position of being forced to offer a lot of bang for buck to drive adoption, all of which makes sense for what he’s trying to do.

But AMD wants you to basically write and debug their runtime for them and nah not worth it, after fighting the installer on the official system config and then filing a couple bugs for the demo apps reproducibly crashing the kernel it’s just not worth the time.

ROCm is unserious even when you’re operating on supported hardware. This is the experience most people have with it.



Ah yes, simply be famous enough to get the closed internal build that actually works.

You know, it doesn't surprise me that's what they're doing, because the HPC people aren't paying megabucks and then tolerating something that crashes the kernel... but people's problems/experiences aren't invalid either, and the delta is they're not running the same software. I don't like it but it makes sense.


Interesting, seems like it's very similar to Modular's runtime portion of their business case


>a justifiable reason for the price discrepancy between the RTX 4090 and the A100

1. they're a publicly listed company operating in a free market in an industry they themselves helped develop who's purpose is to make returns for their investors, not be liked by the public, they don't need to justify their pricing to their buyers. They're not selling essentials for survival like insulin, baby formula, or housing, they can charge as much as the market will bear for their consumer electronics products. Don't like the pricing? Don't buy it. Simple. Buy from the competition instead or older generations off the second hand market that fit your budget.

2. the price justification is that cutting edge silicon is and will always be in short supply, and buyers of the silicon in A100 form, like datacenters, use it to make money, therefore it's an investment that will yield returns, therefore they can justify spending way more money to outbid the gamers who buy the same silicon in RTX 4090 form and don't use it to make money but use it to play games therefore for them the product is worth less and it makes Nvidia smaller margins than what selling it to datacenters can. It's basic price segmentation that's been going on for decades.

I also don't like the GPU pricing situation but that's the market reality I can't change and downvoting the messenger won't change it either. My 2 cents.


It's because you're responding as if OP said it wasn't justifiable legally. They're just saying that AMD is, in their estimation, making a bad business decision.


That's was really not his point. His point was that Nvidia is price gouging and I explained him why they can get away with it since the current market conditions favor their products a lot.

AMD has been consistently making bad business decisions in the GPU space for years now. What's new? They're pricing their hardware near NVidia but with less performance and way less features and poor track record, especially for the ML users.


Nvidia is price gouging because there is no credible competition. That's the entire explanation and there is nothing more to it.


Why are people here continuously surprised by this and keep asking the same redundant questions: "Why does Nvidia charge so much?" I just don't get it.

Thinking that Nvidia is some evil villain doing it to consumer out of spite, when they're just doing what any other company in their dominant position would do: charge as much as the market will bear.

Apple also doesn't have to charge you $200 for configuring the 512GB SSD over the 256 SSD which only costs them an extra $5 NAND chip, but they do it because they can. So does Nvidia and any other company in a dominant position with virtually no competition.


> His point was that Nvidia is price gouging

You might be reading into their comment too much, they don't really blame NVidia, just wish for more competition. IMHO you're being downvoted because you come across as needlessly confrontational to an innocent comment.


I sound confrontational because it seems like the upvoted poster didn't understand the laws of supply and demand and neither do the people who upvoted it to the top, or was just karma farming by asking redundant/self explanatory questions hoping to get votes from the mobs with their pitchforks out.

Maybe I'm wrong and OP really didn't understand the curent supply/demand issues which is why I gave a lengthy explanation of how it looks.


Yeah but the issue is that they can make more money if they actually did affordable ML hardware and didn't purposefully gimp the gaming cards.

LLms running on standalone affordable boxes could become a ubiquitous home accessory, not to mention the potential for things like being able to leverage learning from widespread use to retrain the models, like Tesla does with their fleet.


NVIDIA are not stupid, they understand their marginal revenue and marginal cost analysis, and they obviously decided they're making the most money doing what they're doing. It's asinine to believe we, with access to almost none of the requisite data, could tell NVIDIA how to turn a better profit.


>decided they're making the most money doing what they're doing

Yes, exactly like all the big tech companies with vast departments for market research that hired a shitload of people, then when economy turned, they started doing layoffs. A.k.a short term profit seeking.


>Yeah but the issue is that they can make more money if they actually did affordable ML hardware and didn't purposefully gimp the gaming cards.

It's a dick move but it's not illegal (at least not now). Their purpose is to make greater returns for their investors, not make affordable ML hardware for every consumer.

>LLms running on standalone affordable boxes could become a ubiquitous home accessory

They could be, but why is it Nvidia's fault if they don't happen? They're in the business of making money, not fulfilling dreams, and currently they can barely fulfill their orders for the datacenter customers. Consumers could move the LLM stuff on cheaper Apple HW or Intel or AMD SoCs if they can't outbid the datacenter companies for Nvidia silicone.

Seems like a market opening for Nvidia's competitors to price cut them. If they don't exploit it and let Nvidia dominate, it's their fault. It's not Nvidia's fault their competitors were asleep at the wheel since they launched CUDA in 2007 and were helping researchers put their GPUs to use for parallel computing for over 15 years.


If AMD were smart(er), their cpu's are awesome at least... They would team up with intel, google, apple, etc to create a machine learning open cuda alternative, such that all GPU's could essentially be used by ML software equally, I think Nvidia's real hold on AI is that most AI libs use CUDA, and nobody else can really compete when all the tools use X, it's like if firefox launched with their own different HTML syntax because HTML was proprietary, and every other browser did the same thing.

Something like chrome : chromium dynamic would work wonders for bringing a sort of standard, not to mention more minds could create better and better implementations.


They did try, apple made OpenCL and released it to khronos to make a standard, but never got wide support(well, they got plenty of manufacturers on board, not so much users).

By that time nvidia already had an advantage with CUDA and their opencl support always lagged.


If you want to go an run inference/train on one of these fresh and sweet LLMs using bitsandbytes/PEFT, which is really where the excitement is at, you gotta use CUDA pretty much. This is the story now. And the story for the innovation before was the same. Use CUDA, or wait for AMD to catch up a year late and with a worse version of everything.

I mean, sure, you could compile your stuff to XLA or just, I dunno, set up 800 of these cards and train the whole thing on ROCM. But then would you really, really use AMD instead of some TPUs?

AMD made Machine Learning either impossible, unsupported or a chore on most of their hardware for years. Their stack sucks and no one wanted to implement it, and in fact they rarely supported most of their own GPUs.

Yes you have GPUs. But we need also drivers. And software. People have been yelling at AMD about this for years and years.

Instead, AMD has made it clear that Machine Learning is not a priority for the company. Hence, they deserve the lack of traction. Investing in AMD hardware for ML has literally been a mistake at every point in recent history. Imagine if you bought a bunch of (insert last Gen card which is no longer supported by their stack) how dumb you'd look.

Releasing a GPGPU card now? Honestly, why bother? No one is gonna buy it.


Can someone who works at AMD please print this out, roll it up, and smack some senior managers on the nose with it?

NVIDIA is about to walk off with a trillion dollars because nobody at AMD “gets it”.

With no meaningful competition, NVIDIA will gouge as hard as they can. Such as charging $50K for a card that’s not too different to a 4090 but with more memory.


AMD took aim at HPC and shipped Frontier. The ROCm stack is quite HPC themed because that was the driving project. Porting to run AI models is a work in progress but the back end is the same compiler stack that was written for graphics (largely games consoles iiuc) then upgraded for HPC, it'll get there.


NVIDIA is about to walk off with a trillion dollars because nobody at AMD “gets it”.

I think they "get it" OK. Whether or not they can formulate a viable strategy and execute it is one question, but they get the idea that "AI is important" and they know where they stand vis-a-vis NVIDIA.

This is another reason I'm willing to invest some time and money in working with AMD products for AI/ML. History has shown us their ability to go toe-to-toe with a seemingly unassailable industry titan before, and they came out in pretty good shape then.

https://www.forbes.com/sites/iainmartin/2023/05/31/lisa-su-s...


Been playing with Stable Diffusion on 6750XT 12GB for a while now with minimal issues (low memory issues sometimes on higher res, but it's less of a problem using grid upscaler).

ROCm is getting there, slowly.


Also good news was that running olive and downloading the ONNX converter to run stable diffusion shows something like 120% speedup on Nvidia and 50% speedup on AMD from the numbers I see. Some UIs are beginning to support ONNX format. ONNX being the format that has Microsoft's support behind it for AI models to my understanding.


There's more to machine learning and AI than just training the latest LLM's though. Speaking for myself, I am very interested in supporting AMD and the ROCm ecosystem, whatever AMD's past sins may be. I'm building a machine now which will be based on a high-end consumer GPU from AMD. Not going for something like this as it's almost certainly way out of my budget, but perhaps in the future.

So basically, I'm betting that AMD has had (or is having) a change of heart and is genuinely committed to AI/ML on their products. Time may ultimately prove me wrong, but so be it if that proves to be the case.

And FWIW, one reason for my optimism is that, whatever you think about the state of ROCm today, they are clearly investing heavily into the platform and continually working on it. You can see that just from looking at the activity on their Github repos:

https://github.com/orgs/ROCmSoftwarePlatform/repositories

There is constant activity and has been for some time, which I take as a good sign. Yes, it's just one signal among many that one could consider, but I think it's an important one.


Yeah but you see, we have been here before. "This time we are serious about ML/AI..." And if you went and bought an AMD card then, you'd have been wrong.

My example about LLMs was just to show that AMD is simply not part of the conversation. Three month before you could have made the same point about another approach.

And still, if you'd actually have to risk money, you and me both know you'd never invest in AMD hardware for AI or start developing on it something high stakes.

I mean, look at geohot he tried and just gave up entirely and AMD.


And still, if you'd actually have to risk money, you and me both know you'd never invest in AMD hardware for AI or start developing on it something high stakes.

I assume you meant "you" in the royal sense. Since I personally am, in point of fact, investing in AMD hardware for AI. Yeah, it's a gamble, but that's my style. And I have to admit, part of it is driven by ideological reasons (ROCm being open source) and a simple desire to support AMD since I want them to become a serious competitor to NVIDIA. I believe that's an outcome that would benefit everybody.

I mean, look at geohot he tried and just gave up entirely and AMD.

I have to admit, I don't find that particularly compelling in any regard. Nothing against geohot, he's clearly a smart dude, but... I'm not judging a company based on his interactions with them. shrug


I wish you luck then. Right now, you need CUDA to run things like bf16 or quantization, which are truly enablers in this space.

So for now, I don't see AMD getting any traction. And apparently, the quality of the drivers has not been improving. Time will tell.


Agreed. And FWIW, I'm not saying I use AMD kit exclusively. I'm fine with using NVIDIA hardware and CUDA when needed. In addition to the AMD based box I'm building for the lab at home, I'm also building a sister box which will be nearly identical, except using an NVIDIA GPU. I am an ideologue but I'm also at least a little bit pragmatic. :-)


Speaking for yourself, what kind of workloads have you planned on running on ROCm when your machine is built?


Tough to say, as my interests are pretty spread around. Definitely some fairly vanilla neural networks in different domains, as a lot of my interest is in multi-modal AI. So I'll be working on combining vision, audio, language and "other" where "other" corresponds to things that aren't senses we humans have but which my "AI bot" does - GPS, temperature, altitude, barometric pressure, etc.

Beyond that, it's just going to be wherever the research takes me. And not everything is necessarily going to be particularly suited for running on ROCm. I'll use CUDA on NVIDIA hardware as and when needed, and I'm also open to dabbling in low-level hardware stuff and playing around with DSP's, analog computing, FPGA's, etc. One thing I want to work with a bit is some Spiking Neural Network stuff, neuromorphic approaches, etc. So not sure where this will end up.

Heck, it's possible that some of the stuff I work with will turn out to be well suited for running on a plain old CPU, which is one reason I spec's a relatively high-end Ryzen CPU and 64GB of system RAM for these machines, which otherwise wouldn't necessarily be all that important for "pure" GPU computation.

Maybe I'll even wind up going "old school" and building myself a Beowulf cluster! So I can finally answer the age old question about "imagine a Beowulf cluster of these..."


This is a naive question, but how hard would it be for AMD to make their cards/firmware CUDA compatible? Feels like that's what they would need to do to sell hardware in the space, other than banking on sufficiently severe shortages.


Cuda runs in the layer above firmware. Compiling cuda to amdgpu isa could be done but might invite lawsuits.

There's a language called HIP which is a fairly close approximation to cuda. You can probably convert one to the other with determination and regex. The GPUs themselves are fundamentally different in ways that hopefully don't matter to your application (warp synchronisation is the big one in my opinion, but I suspect cuda applications ignore it and just live with the race conditions).


The issue with AMD and AI is, as always, the software stack. Even if the hardware is great ROCM simply doesn't have industry traction and accessiblity.


Doesn't have the traction for now. Cloud providers (ms, google, amz, etc) are quickly tiring of paying Nvidia monopoly premiums for their gpu hardware. Google has already invested in tpus and it wouldn't surprise me at all if they got together to fund ROCm development or even went so far as to develop their own NN asics.

Cuda is great, but it's not strictly necessary for much of the latest AI / ML developments.


ROCm is so terrible the cloud providers rolled out their own chips rather than use AMD which has perfectly good GPUs and the worst software stack ever.


ROCm still doesn't support consumer GPUs; that means people building random things (as opposed to more serious work things) won't be using their stack, so none of the innovation will be there.

It may be possible to use it with consumer GPUs anyway, but many won't try because it's not officially supported.

https://rocm.docs.amd.com/en/latest/release/gpu_os_support.h... https://developer.nvidia.com/cuda-gpus


Intel, AMD, Google, Amazon, etc should team up to create some sort of standards/consortium around an open source CUDA alternative, something that anyone who can fabricate chips could use, and the consortium could have their own team of devs/researchers to make improvements / next gen versions of their CUDA alternative.

Something like the way chrome vs chromium is, or even a foundation like the linux foundation, where you have multiple distros contributing packages/etc back into the ecosystem.



they are already doing that with XLA, google has TPUs, amazon has tranium/inferentia. common interface in future that you basically just cast model to `.toDevice` of an enum of accelerated computing types seems to be the goal.


> Cloud providers (ms, google, amz, etc) are quickly tiring of paying Nvidia monopoly premiums for their gpu hardware.

I think cloud providers love exclusivity(Nvidia MSRP is significantly higher than it is available to clouds) and based on pricing compared to competitors like lambdalabs they have highest profit margin on GPU instances. Also based on availability, they likely have the highest utilisation. They definitely wouldn't want to commoditize the space. Google already has TPU that they could scale and sell to everyone but it would make the margins significantly smaller if they do it.


One thing AMD can do is working with ggml to make llama.cpp running on AMD GPUs. Compile modern ML framework is a quite complex due to number of OPs. However, running LLMs does not need a lot of OPs. Just einsum & relu & softmax. Having LLM works with llama.cpp could be done by a team within a week or so.


I agree that AMD should be dedicating an engineer to making sure all the top popular ML/AI projects can run on their hardware, but maybe they also need to spin up a wiki listing compatibility...

AMD GPU acceleration via CLBlast was merged back in mid-May in llama.cpp:master - it works and gives a boost (although not all AMD GPUs have been tuned for CLBLast - this is something that AMD should be doing tbt: https://github.com/RadeonOpenCompute/ROCm/issues/2161)

There is also a hipBLAS fork, which is slightly faster (~10% on my old Radeon VII) which maybe someone at AMD should be supporting to make its way into master: https://github.com/ggerganov/llama.cpp/pull/1087

I'll also note that exllama merged ROCm support and it runs pretty impressively - it runs 2X faster than the hipBLAS llama.cpp, and in fact, on exllama, my old Radeon VII manages to run inference >50% faster than my old (roughly equal class/FP32 perf) GTX 1080 Ti (GCN5 has 2X fp16, and 2X memory bandwidth, so there's probably more headroom even) for a relatively easy port. That's really impressive: https://github.com/turboderp/exllama/pull/7

(It's worth noting that for the latter, all you need to do is install the ROCm version of PyTorch and "it just works," which is refreshing: https://pytorch.org/get-started/locally/)


> An engineer

More like a dozen of engineers over a couple of weeks. Make sure popular LLMs can run on their hardware, as well as Stable Diffusion and other popular projects, and then they will see consumers flock to their hardware.

With consumers I mean something like what "gamers" used to be during all these past decades (and still are), those who won't be using it for business cases (what Quadro used to be) but for their hobby; this, but oriented towards AI, which currently is limited to LLM, image generation and ASR.

If they focus on this, the community will start helping, maybe even cleaning up their mess of repositories they have on GitHub.

There are two benefits over Nvidia: They have more VRAM and they have an open source software stack.

They just need to get the basics working for all those hobbyists, those who want to run little projects on their home hardware.


MI300A looks really nice, if only one anyone could buy it and do a DIY "Mac Studio"-like linux machine. Imagine just swapping it for the next generation in 2 or 3 years and keeping all the rest: peripherals, case, motherboard and etc.


And the x86/GPU version would be an awesome workstation. Just add M.2 and ethernet (and USB-C with video) and we are in business.

As for a Hackintosh, I'd imagine an Nvidia Grace or Grace Hopper as a good option, even though the 500+W TDP would require a MacPro-sized heatsink. And it can have up to 960GB of RAM.

edit: got a lot of details confused between the MI300 and the Grace.


Honestly, I find the x86+GPU parts more interesting. One could make a very Apple-like PC with a SoM and very little glue around it (a bit like what you can do with some top-of-the-line Xeons that have HBM chiplets). And with between 1 and 6 x86 chiplets, the Mi300 could have between 8 and 96 Zen 4 cores.

These things are so interesting it's a shame they aren't cheap.


AMD and Intel are both making quad channel APUs akin to the M2 Pro.

https://videocardz.com/newz/intel-arrow-lake-p-with-320eu-gp...

(Sorry, I cannot find the AMD rumor link atm)

But TBH the hybrid design is less interesting than you think, just because nothing really takes advantage of it. Hence Intel canceled their datacenter APU in favor of a pure Falcon Shores GPU due to a lack of interest from customers.


AMD halo strix is the future APU with a 256 bit wide interface. Boggles my mind with a multi-year GPU shortage that AMD didn't bring a wider memory interface to iGPUs. Obviously they can do it since the ps5 and Xbox X have been shopping for some time.

Looks like someone ported llama to apples metal v3 already and are getting 5 tok/s on a 65b model.


AMD should have re-used the Xbox APU, but other than that it makes no economic sense?

The tape out cost would be huge, the die would be huge. Either the mobo/socket would be super expensive and niche, or consumers would be pissed about non expandable RAM.

Laptop OEMs didn't even want the Steam Deck chip or Broadwell-edram back then, much less a big expensive APU.


AMD has the pieces, they ship a Xbox version, PS5 version, the sTRX4 socket for the threadripper, and for any of their chiplet products the IOD would be the only changed piece of silicon. Any updates could be amortized over their pending threadripper upgrades (4 channel), sienna products (6 channel), or threadripper pro (8 channel).

Much like how Apple sells 128 bit wide (mini and mba), 256 bit wide (m1/m2 pro), 512 bit wide (m1/m2 max), and 1024 bit wide (m1/m2 ultra).

I think the desktop/laptop vendors would have jumped at a nice iGPU/APU when they couldn't get normal GPUs.

Obviously AMD agrees, they shipped the PS5/XboxX and have a 256 bit wide APU planned for 2024. Just way later than I had hoped.


The big reason why parallel memory buses aren’t wider is pin count. If you have all the memory interfaces over the substrate, the bus can be as wide as the substrate can accommodate.

BTW, what is the usual width of memory buses in discrete GPUs?


It boggles my mind that apple can do 128 bit, 256 bit, and 512 bit wide memory interfaces on thin/light laptops that are price competitive with PC laptops in the same segment while having excellent battery life.

In the previous gen even a relatively low end card like the 3060 Ti has a 256 bit wide memory interface. In the current generation the 4070 (which is a higher in the product stack) is 192 bits. The RTX 3080 (prev gen) is 320 bits wide the 4080 (current gen) is 256 bits.

Generally the trend is more cache, less width, and less bandwidth. Which is great ... for things that are cache friendly, but not everything is.


> just because nothing really takes advantage of it.

This is precisely why AMD, Intel, and Nvidia should think about making workstation-class machines with the lowest end of these - because until more people have one to play with, there won't be much to do with them.


There were rumors of an AMD one... That also never materialized.

Its a chicken and egg problem, I think. A big APU is so expensive that it doesn't really make sense without a very specific workload (like in a console), and the workloads dont really appear without the APUs.


An AMD Ryzen APU + RAM soldered onto the motherboard is very similar to what Apple's doing.

Afaik, because the memory controller is part of the CPU, the CPU-RAM connection on the consumer chips is entirely passive - just copper traces on the motherboard with no ICs inbetween?


IIRC the APUs aren't exactly typical APUs with shared memory, but more a CPU with a few GPU cores attached, operating independently memory whise (but I might be wrong).


The ones that have been shipping for a few years have one block of ddr that GPU and CPU cores both access. You can do things like synchronise code running in a thread on the CPU with a wavefront running in a kernel on the GPU using atomic operations on that shared memory.


AMD ROCm vs Nvidia CUDA has been discussed to death, but I'm curious how AMD fares compared to some of the AI training accelerator vendors. I think it would be much more damning if AMD were worse than some upstart, because the upstart wouldn't have the huge resource advantage and decade long head start of Nvidia. From my limited experience it seems like Google TPU and Cerebras are much nicer to use for AI training, from the standpoint of driver and software stability, documentation, and ecosystem support.

Perhaps that's not a fair comparison. From what I know AMD and NVIDIA use GPGPU cores (now with AI-focused instructions) plus separate AI-specific accelerator blocks. Conceptually, GPGPU + NPU on one die. NPUs can be much simpler than general-purpose GPUs. So AMD's driver and software stack likely needs to be an order of magnitude more complex than the NPU vendors' in order to accommodate other non-AI use cases. But to an end user it doesn't really matter why it sucks, only that it does.


...Not precisely. AMD implements the "AI" matrix instructions in the shaders themselves, not as big seperate blocks like Nvidia. But unlike Nvidia, the instructions are different on the consumer (RDNA3) and server line. I dont know anything about building rocm, but supporting rdna must indeed make things more difficult.

Intel takes this approach too.


Bit of a noob on these things here, but why 192GB? Why not, say, 256GB? I noticed the new mac pro also caps out at 192GB, is there a specific reason for that number?


In this case it's because memory scaling has slowed down, so the manufacturers have introduced sizes that are halfway between powers of two. It would simply take too long to wait for 2x before making a new product line.

So the biggest HBM you can get is 24GB, and 8 of them is a reasonable max.


The largest DRAM dies are 24 gigabit.


The MI300 has 8x 24GB stacks of memory, which is physically the absolute max they can do. And those 24GB stacks are brand new.

I believe the Apple M series is limited by package space? The LPDDR5(X?) bus is not physically/electrically limited to 192GB like the MI300.


The article says that HBM3 memory comes in 24GB stacks. (Not 32GB.)


Is it possible to do inference with Falcon 40B on this type of hardware or similar?


They showed Falcon 40B running on the mi300x


Hm. Do you have a link to that?

It would be nice if this became on option for something like RunPod or Modal. Especially if it could be slightly cheaper than the Nvidia hardware somehow.



The Falcon appears at t=5033: https://youtu.be/l3pe_qx95E0?t=5033


Yes, but would be massive overkill. Falcon 40B takes ~35GB of VRAM to load now, and probably less in the future with better quant from llama.cpp and such.

https://huggingface.co/TheBloke/falcon-40b-instruct-GPTQ

Large context size is becoming less of an issue now too.

But maybe it would be good for batched inference?


I was thinking maybe RunPod or Modal could use it to handle multiple containers per CPU, possibly including some that are much lighter like stable diffusion.


If its multiple llm client requests, batched inference would be massively more efficient.

And for multiple services... Probably just better to run multiple cheap instances and/or load dynamically? The MI300 is super expensive.


How expensive is it compared to A100?


That is an excellent question.

In theory its more expensive to produce than an H100, but in practice... shrug.

But I was thinking 3090 or lesser tesla/Quadro instances would be sufficient.


I mean the 35GB version of Falcon is maybe not something you'd want to use in production

Also ironically, this version of Falcon will require CUDA.


It might work on rocm? I am not sure about the status of GPTQ on rocm.


GPTQ for LLaMAs w ROCm works with https://github.com/turboderp/exllama/ but Falcon inferencing is a different beast.


I hope these new Instinct GPUs are less of a paper launch than previous Instinct GPUs. Historically they’ve been hard for consumers to buy.


Does anyone know what type of api you'd use for this? I know AMD has Rocm for their dedicated GPUs but it's barely usable from what I've heard.


This tools page makes me think they are sticking with ROCm compilers and OpenCL. https://docs.amd.com/category/compilers_and_tools They have a tool called "hipify" that supposedly converts CUDA code to something else. But the fact it's not just integrated as a CUDA compiler makes it look like they don't really trust it to work. Also they misspelled it.


HIP itself is a C++ language that can be utilized directly. It's extremely similar to CUDA. I think the use case AMD desires is the reverse. That software is written in HIP then converted to CUDA to also run on Nvidia GPUs.


ROCm just doesn't have the support these days. Supposedly people are working on getting stable diffusion working on it. https://www.videogames.ai/2022/11/06/Stable-Diffusion-AMD-GP...

But it's just too much of investment for me on something that MAY work. I ended up just buying a 4080rtx


Stable Diffusion works fine on rocm (and intel OpenVINO), the issue is out-of-the-box support in popular UIs.

TBH the whole space is kinda a mess. Tons of optimizations (like most ML compilers), even on Nvidia cards, are left on the table because the SD UI devs just dont have the throughout or motivation to implement them.

At the other end, hardware makers, ml compiler devs, researchers and such are making quick demos, but are not making any integration attempts for popular frameworks.

There is no one in the middle, so we are stuck with PyTorch eager mode and a perception that it only works on big Nvidia GPUs.


Stable Diffusion worked fine for me with Rocm & my RX580 (once I had compiled a custom torch library IIRC).

But I don’t know whether it works with the more recent RDNA2 cards.


It works without a hitch on my rx6900xt. Only pain was getting the amdpro drivers


Yeah that's kinda what I mean. I've heard to not even consider AMD for ML applications specifically ROCM, so I'm curious if these chips will use ROCM as their primary api or not.

It "technically" works, but their own examples crash, you get a fraction of the performance you'd expect for the level of hardware you have, chicken&egg problem with little other software having good support of it, etc.


I'm running stable diffusion on my 7900xtx and it's working fine. I had to screw around a little bit to get the newest ROCm and torch libraries since they aren't packed on my OS, but it wasn't that bad. I made a docker image if anybody is struggling to get it working: https://hub.docker.com/r/delusional/sd-rx7900xtx


Out of curiosity, how many it/s do you get with DPM2 at 512x512 with a batch size of 1, and then the it/s for whatever the max batch size you can fit?


It does 15-16 it/s with euler a and 2.16 it/s at 8 batchsize (the max in automatic1111), and that's only using 15GiB of vram.


To be honest, if I am building a CUDA alternative from scratch, I’d build it for Apple Silicon. Apple will have a monopoly on TSM’s state of the art.


AMD relations with TSMC are definitely much more tight than Nvidia's. AMD has been helping with developing HPC-optimized nodes and modern packaging techniques (3d-stacking and chiplets), while Nvidia tried to force TSMC to drop prices by going to Samsung fabs for 1.5 generations (did not go too well for them).

The problem with AMD is definitely not modern node access, but software, and with some investment it can probably change really fast.

Apple, on the other hand, is definitely not going to release standalone apple silicon products (GPUs or CPUs), so I don't really believe for Apple Silicon as an AI platform (apart from on-device inferrence).


Maybe they have something in the oven?

I am skeptical too, but hardware design takes a looong time, and surely Apple sees the writing on the wall and the potential of their own hardware.


The issue is Apple has zero presence in datacenters, and they dont seem very interested in changing that.

And they are optimizing more for power efficiency in multimedia workloads than raw AI throughout at any cost like the AI ASIC makers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: