The problem is that performance achievements on AMD consumer-grade GPUs (RX7900XTX) are not representative/transferrable to the Datacenter grade GPUs (MI300X). Consumer GPUs are based on RDNA architecture, while datacenter GPUs are based on the CDNA architecture, and only sometime in ~2026 AMD is expected to release unifying UDNA architecture [1]. At CentML we are currently working on integrating AMD CDNA and HIP support into our Hidet deep learning compiler [2], which will also power inference workloads for all Nvidia GPUs, AMD GPUs, Google TPU and AWS Inf2 chips on our platform [3]
The problem is that the specs of AMD consumer-grade GPUs do not translate to computer performance when you try and chain more than one together.
I have 7 NVidia 4090s under my desk happily chugging along on week long training runs. I once managed to get a Radeon VII to run for six hours without shitting itself.
The Radeon VII is special compared to most older (and current) affordable GPUs in that it used HBM giving it memory bandwidth comparable to modern cards ~1TB/s and has reasonable FP64 (1:4) throughput instead of (1:64). So this card can still be pretty interesting for running memory bandwidth intensive FP64 workloads. Anything affordable afterward by either AMD or Nvidia crippled realistic FP64 throughput to below what a AVX-512 many-core CPU can do.
On the other hand, for double precision a Radeon Pro VII is many times faster than a RTX 4090 (due to 1:2 vs. 1:64 FP64:FP32 ratio).
Moreover, for workloads limited by the memory bandwidth, a Radeon Pro VII and a RTX 4090 will have about the same speed, regardless what kind of computations are performed. It is said that speed limitation by memory bandwidth happens frequently for ML/AI inferencing.
You might find the journey of Tinycorp's Tinybox interesting, it's a machine with 6 to 8 4090 GPUs and you should be able to track down a lot of their hardware choices including pictures on their Twitter and other info on George his livestreams.
It looks like AMD's CDNA gpu's are supported by Mesa, which ought to suffice for Vulkan Compute and SYCL support. So there should be ways to run ML workloads on the hardware without going through HIP/ROCm.
I have come across quite few startups who are trying a similar idea: break the nvidia monopoly by utilizing AMD GPUs (for inference at least): Felafax, Lamini, tensorwave (partially), SlashML. Even saw optimistic claims like CUDA moat is only 18 months deep from some of them [1]. Let's see.
AMD GPUs are becoming a serious contender for LLM inference. vLLM is already showing impressive performance on AMD [1], even with consumer-grade Radeon cards (even support GGUF) [2]. This could be a game-changer for folks who want to run LLMs without shelling out for expensive NVIDIA hardware.
Peculiar business model, at a glance. It seems like they're doing work that AMD ought to be doing, and is probably doing behind the scenes. Who is the customer for a third-party GPU driver shim?
I've recently been poking around with Intel oneAPI and IPEX-LLM. While there are things that I find refreshing (like their ability to actually respond to bug reports in a timely manner, or at all) on a whole, support/maturity actually doesn't match the current state of ROCm.
PyTorch requires it's own support kit separate from the oneAPI Toolkit (and runs slightly different versions of everything), the vLLM xpu support doesn't work - both source and the docker failed to build/run for me. The IPEX-LLM whisper support is completely borked, etc, etc.
I've recently been trying to get IPEX working as well, apparently picking Ubuntu 24.04 was a mistake, because while things compile, everything fails at runtime. I've tried native, docker, different oneAPI versions, threw away a solid week of afternoons for nothing.
SYCL with llama.cpp is great though, at least at FP16 since it supports nothing else but even Arc iGPUs easily give 2-4x performance compared to CPU inference.
Intel should've just contributed to SYCL instead of trying to make their own thing and then forgot to keep maintaining it halfway through.
My testing has been w/ a Lunar Lake Core 258V chip (Xe2 - Arc 140V) on Arch Linux. It sounds like you've tried a lot of things already, but case it helps, my notes for installing llama.cpp and PyTorch: https://llm-tracker.info/howto/Intel-GPUs
I have some benchmarks as well, and the IPEX-LLM backend performed a fair bit better than the SYCL llama.cpp backend for me (almost +50% pp512 and almost 2X tg128) so worth getting it working if you plan on using llama.cpp much on an Intel system. SYCL still performs significantly better than Vulkan and CPU backends, though.
As an end-user, I agree that it'd be way better if they could just contribute upstream somehow (whether to the SYCL backend, or if not possible, to a dependency-minized IPEX backend). the IPEX backend is one of the more maintained parts of IPEX-LLM, btw. I found a lot of stuff in that repo that depend on versions of oneKit that aren't even downloadable on Intel's site. I couldn't help but smirk when I heard someone say "Intel has their software nailed down."
Well that's funny, I think we already spoke on Reddit. I'm the guy who was testing the 125H recently. I guess there's like 5 of us who have intel hardware in total and we keep running into each other :P
Honestly I think there's just something seriously broken with the way IPEX expects the GPU driver to be on 24.04 and there's nothing I can really do about it except wait for them to fix it if I want to keep using this OS.
I am vaguely considering adding another drive and installing 22.04 or 20.04 with the exact kernel they want to see if that might finally work in the meantime, but honestly I'm fairly satisfied with the speed I get from SYCL already. The problem is more that it's annoying to integrate it directly through the server endpoint, every projects expects a damn ollama api or llama-cpp-python these days and I'm a fan of neither since it's just another layer of headaches to get those compiled with SYCL.
> I found a lot of stuff in that repo that depend on versions of oneKit that aren't even downloadable on Intel's site. I couldn't help but smirk when I heard someone say "Intel has their software nailed down."
Yeah well the fact that oneAPI 2025 got released, broke IPEX, and they still haven't figured out a way to patch it for months makes me think it's total chaos internally, where teams work against each other instead of talking and coordinating.
More cynical take: this would be a bad strategy, because Intel hasn't shown much competence in its leadership for a long time, especially in regards to GPUs.
B580 being a "success" is purely a business decision as a loss leader to get their name into the market. A larger die on a newer node than either Nvidia or AMD means their per-unit costs are higher, and are selling it at a lower price.
That's not a long-term success strategy. Maybe good for getting your name in the conversation, but not sustainable.
It’s a long term strategy to release a hardware platform with minimal margins in the beginning to attract software support needed for long term viability.
I was reading this whole thread as about technical accomplishment and non-nvidia GPU capabilities, not business. So I think you're talking about different definitions of "Success". Definitely counts, but not what I was reading.
Yeah but MLID says they are losing money on every one and have been winding down the internal development resources. That doesn't bode well for the future.
I want to believe he's wrong, but on the parts of his show where I am in a position to verify, he generally checks out. Whatever the opposite of Gell-Mann Amnesia is, he's got it going for him.
MLID on Intel is starting to become the same as UserBenchmark on AMD (except for the generally reputable sources)... he's beginning to sound like he simply wants Intel to fail, to my insider-info-lacking ears. For competition's sake I really hope that MLID has it wrong (at least the opining about the imminent failure of Intel's GPU division), and that the B series will encourage Intel to push farther to spark more competition in the GPU space.
The die size of the B580 is 272 mm2, which is a lot of silicon for $249. The performance of the GPU is good for its price but bad for its die size. Manufacturing cost is closely tied to die size.
272 mm2 puts the B580 in the same league as the Radeon 7700XT, a $449 card, and the GeForce 4070 Super, which is $599. The idea that Intel is selling these cards at a loss sounds reasonable to me.
At a loss seems a bit overly dramatic. I'd guess Nvidia sells SKUs for three times their marginal cost. Intel is probably operating at cost without any hopes of recouping R&D with the current SKUs, but that's reasonable for an aspiring competitor.
Though you assume the prices of the competition are reasonable. There are plenty of reasons for them not to be. Availability issues, lack of competition, other more lucrative avenues etc.
Intel has neither, or at least not as much of them.
Wait, are they losing money on every one in the sense that they haven't broken even on research and development yet? Or in the sense that they cost more to manufacture than they're sold at? Because one is much worse than the other.
Interesting. I wonder if focusing on GPUs and CPUs is something that requires two companies instead of one, whether the concentration of resources just leads to one arm of your company being much better than the other.
> Tinygrad was another one, but they ended up getting frustrated with AMD and semi-pivoted to Nvidia.
From their announcement on 20241219[^0]:
"We are the only company to get AMD on MLPerf, and we have a completely custom driver that's 50x simpler than the stock one. A bit shocked by how little AMD cared, but we'll take the trillions instead of them."
From 20241211[^1]:
"We gave up and soon tinygrad will depend on 0 AMD code except what's required by code signing.
We did this for the 7900XTX (tinybox red). If AMD was thinking strategically, they'd be begging us to take some free MI300s to add support for it."
Is there no hope for AMD anymore? After George Hotz/Tinygrad gave up on AMD I feel there’s no realistic chance of using their chips to break the CUDA dominance.
Maybe from Modular (the company Chris Lattner is working for). In this recent announcement they said they had achieved competitive ML performance… on NVIDIA GPUs, but with their own custom stack completely replacing CUDA. And they’re targeting AMD next.
tl;dr there's a non-unsubstantial # of people who learn a lot from geohot. I'd say about 3% of people here will be confused if you thought of him as less than a top technical expert across many comp sci fields.
And he did the geohot thing recently, way tl;dr: acted like there was a scandal being covered up by AMD around drivers that was causing them to "lose" to nVidia.
He then framed AMD not engaging with him on this topic as further covering-up and choosing to lose.
So if you're of a certain set of experiences, you see an anodyne quote from the CEO that would have been utterly unsurprising dating back to when ATI was still a company, and you'd read it as the CEO breezily admitting in public that geohot was right about how there was malfeasance, followed by a cover up, implying extreme dereliction of duty, because she either helped or didn't realize till now.
I'd argue this is partially due to stonk-ification of discussions, there was a vague, yet often communicated, sense there was something illegal happening. Idea was it was financial dereliction of duty to shareholders.
IMO the hope shouldn't be that AMD specifically wins, rather it's best for consumers that hardware becomes commoditized and prices come down.
And that's what's happening, slowly anyway. Google, Apple and Amazon all have their own AI chips, Intel has Gaudi, AMD had their thing, and the software is at least working on more than just Nvidia. Which is a win. Even if it's not perfect. I'm personally hoping that everyone piles in on a standard like SYCL.
In CPUs, AMD has made many innovations that have been copied by Intel only after many years and this delay had an important contribution to Intel's downfall.
The most important has been the fact that AMD has predicted correctly that big monolithic CPUs will no longer be feasible in the future CMOS fabrication technologies, so they have designed the Zen family since the beginning with a chiplet-based architecture. Intel had attempted to ridicule them, but after losing many billions they have been forced to copy this strategy.
Also in the microarchitecture of their CPUs AMD has made the right choices since the beginning and then they have improved it constantly with each generation. The result is that now the latest Intel big core, Lion Cove, has a microarchitecture that is much more similar to AMD Zen 5 than to any of the previous Intel cores, because they had to do this to get a competitive core.
In the distant past, AMD has also introduced a lot of innovations long before they were copied by Intel, but it is true that those had not been invented by AMD, but they had been copied by AMD from more expensive CPUs, like DEC Alpha or Cray, but Intel has also copied them only after being forced by the competition with AMD.
Everything is comparative. AMD isn't perfect. As an Ex Shareholder I have argued they did well partly because of Intel's downfall. In terms of execution it is far from perfect.
But Nvidia is a different beast. It is a bit like Apple in the late 00s where you take business, forecast, marketing, operation, software, hardware, sales etc You take any part of it and they are all industry leading. And having industry leading capability is only part of the game, having it all work together is completely another thing. And unlike Apple where they lost direction once Steve Jobs passed away and weren't sure about how to deploy capital. Jensen is still here, and they have more resources now making Nvidia even more competitive.
It is often most people underestimate the magnitude of the task required, ( I like to tell the story again about an Intel GPU engineer in 2016 arguing they could take dGPU market shares by 2020, and we are now 2025 ), over estimate the capability of an organisation, under estimate the rival's speed of innovation and execution. These three thing combined is why most people are often off the estimate by an order of magnitude.
We are in the middle of a monopoly squeeze by NVidia on the most innovative part of the economy right now. I expect the DOJ to hit them harder than they did MS in the 90s given the bullshit they are pulling and the drag on the economy they are causing.
By comparison if AMD could write a driver that didn't shit itself when it had to multiply more than two matrices in a row they'd be selling cards faster than they can make them. You don't need to sell the best shovels in a gold rush to make mountains of money, but you can't sell teaspoons as premium shovels and expect people to come back.
They... do have a monopoly on foundry capacity, especially if you're looking at the most advanced nodes? Nobody's going to Intel or Samsung to build 3nm processors. Hell, there have been whispers over the past month that even Samsung might start outsourcing Exynos to TSMC; Intel already did that with Lunar Lake.
Having a monopoly doesn't mean that you are engaging in anticompetitive behavior, just that you are the only real option in town.
What effect did the DOJ have on MS in the 90s? Didn't all of that get rolled back before they had to pay a dime, and all it amounted to was that browser choice screen that was around for a while? Hardly a crippling blow. If anything that showed the weakness of regulators in fights against big tech, just outlast them and you're fine.
>I expect the DOJ to hit them harder than they did MS in the 90s given the bullshit they are pulling and the drag on the economy they are causing.
It sounds like you're expecting extreme competence from the DOJ. Given their history with regulating big tech companies, and even worse, the incoming administration, I think this is a very unrealistic expectation.
Also I'd take HN as being being an amazing platform for the overall consistency and quality of moderation. Anything beyond that depends more on who you're talking to than where at.
Everyone whose dug deep into what AMD is doing has left in disgust if they are lucky and bankruptcy if they are not.
If I can save someone else from wasting $100,000 on hardware and six months of their life then my post has done more good than the AMD marketing department ever will.
> If I can save someone else from wasting $100,000 on hardware and six months of their life then my post has done more good than the AMD marketing department ever will.
This seems like unuseful advice if you've already given up on them.
You tried it and at some point in the past it wasn't ready. But by not being ready they're losing money, so they have a direct incentive to fix it. Which would take a certain amount of time, but once you've given up you no longer know if they've done it yet or not, at which point your advice would be stale.
Meanwhile the people who attempt it apparently seem to get acquired by Nvidia, for some strange reason. Which implies it should be a worthwhile thing to do. If they've fixed it by now which you wouldn't know if you've stopped looking, or they fix it in the near future, you have a competitive advantage because you have access to lower cost GPUs than your rivals. If not, but you've demonstrated a serious attempt to fix it for everyone yourself, Nvidia comes to you with a sack full of money to make sure you don't finish, and then you get a sack full of money. That's win/win, so rather than nobody doing it, it seems like everybody should be doing it.
I've seen people try it every six months for two decades now.
At some point you just have to accept that AMD is not a serious company, but is a second rate copycat and there is no way to change that without firing everyone from middle management up.
I'm deeply worried about stagnation in the CPU space now that they are top dog and Intel is dead in the water.
Here's hoping China and Risk V save us.
>Meanwhile the people who attempt it apparently seem to get acquired by Nvidia
Everyone I've seen base jumping has gotten a sponsorship from redbull, ergo. everyone should basejump.
Have you tried compute shaders instead of that weird HPC-only stuff?
Compute shaders are widely used by millions of gamers every day. GPU vendors have huge incentive to make them reliable and efficient: modern game engines are using them for lots of thing, e.g. UE5 can even render triangle meshes with GPU compute instead of graphics (the tech is called nanite virtualized geometry). In practice they work fine on all GPUs, ML included: https://github.com/Const-me/Cgml
I'd be very concerned if somebody makes a $100K decision based on a comment where the author couldn't even differentiate between the words "constitutionally" and "institutionally", while providing as much substance as any other random techbro on any random forum and being overwhelmingly oblivious to it.
I have been playing around with Phi-4 Q6 on my 7950x and 7900XT (with HSA_OVERRIDE_GFX_VERSION). It's bloody fast, even with CPU alone - in practical terms it beats hosted models due to the roundtrip time. Obviously perf is more important if you're hosting this stuff, but we've definitely reached AMD usability at home.
It’s not terribly hard to port ML inference to alternative GPU APIs. I did it for D3D11 and the performance is pretty good too: https://github.com/Const-me/Cgml
The only catch is, for some reason developers of ML libraries like PyTorch aren’t interested in open GPU APIs like D3D or Vulkan. Instead, they focus on proprietary ones i.e. CUDA and to lesser extent ROCm. I don’t know why that is.
D3D-based videogames are heavily using GPU compute for more than a decade now. Since Valve shipped SteamDeck, the same now applies to Vulkan on Linux. By now, both technologies are stable, reliable and performant.
Isn't part of it because the first-party libraries like cuDNN are only available through CUDA? Nvidia has poured a ton of effort into tuning those libraries so it's hard to justify not using them.
Unlike training, ML inference is almost always bound by memory bandwidth as opposed to computations. For this reason, tensor cores, cuDNN, and other advanced shenanigans make very little sense for the use case.
OTOH, general-purpose compute instead of fixed-function blocks used by cuDNN enables custom compression algorithms for these weights which does help, by saving memory bandwidth. For example, I did custom 5 bits/weight quantization which works on all GPUs, no hardware support necessary, just simple HLSL codes: https://github.com/Const-me/Cgml?tab=readme-ov-file#bcml1-co...
Only local (read batch size 1) ML inference is memory bound, production loads are pretty much compute bound. Prefill phase is very compute bound, and with continuous batching generation phase is getting mixed with prefill, which makes whole process altogether to be compute bound too. So no, tensor cores and all other shenanigans absolutely critical for performant inference infrastructure.
PyTorch is a project by Linux foundation. The about page with the mission of the foundation contains phrases like “empowering generations of open source innovators”, “democratize code”, and “removing barriers to adoption”.
I would argue running local inference with batch size=1 is more useful for empowering innovators compared to running production loads on shared servers owned by companies. Local inference increases count of potential innovators by orders of magnitude.
BTW, in the long run it may also benefit these companies because in theory, an easy migration path from CUDA puts a downward pressure on nVidia’s prices.
Most people running local inference do so thorough quants with llamacpp (which runs on everything) or awq/exl2/mlx with vllm/tabbyAPI/lmstudio which are much faster to than using pytorch directly
llama.cpp has a much bigger supported model list, as does vLLM and of course PyTorch/HF transformers covers everything else, all of which work w/ ROCm on RDNA3 w/o too much fuss these days.
For inference, the biggest caveat is that Flash Attention is only an aotriton implementation, which besides being less performant sometimes, also doesn't support SWA. For CDNA there is a better CK-based version of FA, but CK doesn't not have RDNA support. There are a couple people at AMD apparently working on native FlexAttention, os I guess we'll how that turns out.
(Note the recent SemiAccurate piece was on training, which I'd agree is in a much worse state (I have personal experience with it being often broken for even the simplest distributed training runs). Funnily enough, if you're running simple fine tunes on a single RDNA3 card, you'll probably have a better time. OOTB, a 7900 XTX will train at about the same speed as an RTX 3090 (4090s blow both of those away, but you'll probably want more cards and VRAM of just move to H100s).
Great, I have yet to understand why does not the ML community really push or move away from CUDA? To me, it feel like a dinosaur move to build on top of CUDA which is screaming proprietary nothing about it is open source or cross platform.
The reason why I say its dinosaur is, imagine, we as a dev community continued to build on top of Flash or Microsoft Silverlight...
LLM and ML has been out for quiet a while, with AI/LLM advancement, the transition must have been much quicker to move cross platform. But this hasn't yet and not sure when it will happen.
Building a translation layer on top CUDA is not the answer either to this problem.
For me personally, hacking together projects as a hobbiest, 2 reasons :
1. It just works. When i tried to build things on Intel Arcs, i spent way more hours bikeshedding ipex and driver issues than developing
2. LLMs seem to have more cuda code in their training data. I can leverage claude and 4o to help me build things with cuda, but trying to get them to help me do the same things on ipex just doesn't work.
I'd very much love a translation layer for Cuda, like a dxvk or wine equivalent.
Would save a lot of money since Arc gpus are in the bargain bin and nvidia cloud servers are double the price of AMD.
As it stands now, my dual Intel Arc rig is now just a llama.cpp inference server for the family to use.
Except I never hear complaints about CUDA from a quality perspective. The complaints are always about lock in to the best GPUs on the market. The desire to shift away is to make cheaper hardware with inferior software quality more usable. Flash was an abomination, CUDA is not.
Flash was popular because it was an attractive platform for the developer. Back then there was no HTML5 and browsers didn't otherwise support a lot of the things Flash did. Flash Player was an abomination, it was crashy and full of security vulnerabilities, but that was a problem for the user rather than the developer and it was the developer choosing what to use to make the site.
This is pretty much exactly what happens with CUDA. Developers like it but then the users have to use expensive hardware with proprietary drivers/firmware, which is the relevant abomination. But users have some ability to influence developers, so as soon as we get the GPU equivalent of HTML5, what happens?
There are far more people running llama.cpp, various image generators, etc. than there are people developing that code.
We're also likely to see a stronger swing away from "do inference in the cloud" because of the aligned incentives of "companies don't want to pay for all that hardware and electricity" and "users have privacy concerns" such that companies doing inference on the local device will have both lower costs and a feature they can advertise over the competition.
What this is waiting for is hardware in the hands of the users that can actually do this for a mass market price, but there is no shortage of companies wanting a piece of that. In particular, Apple is going to be pushing that hard and despite the price they do a lot of volume, and then you're going to start seeing more PCs with high-VRAM GPUs or iGPUs with dedicated GDDR/HBM on the package as their competitors want feature parity for the thing everybody is talking about, the cost of which isn't actually that high, e.g. 40GB of GDDR6 is less than $100.
The cuda situation is definitely better. The nvidia struggles are now with the higher-level software they’re pushing (triton, tensor-llm, riva, etc), tools that are the most performant option when they work, but a garbage developer experience when you step outside the golden path
I want to double-down on this statement, and call attention to the competitive nature of it. Specifically, I have recently tried to set up Triton on arm hardware. One might presume Nvidia would give attention to an architecture they develop, but the way forward is not easy. For some version of Ubuntu, you might have the correct version of python ( usually older than packaged ) but current LTS is out of luck for guidance or packages.
Modular claims that it achieves 93% GPU utilization on AMD GPUs [1], official preview release coming early next year, we'll see. I must say I'm bullish because of feedback I've seen people give about the performance on Nvidia GPUs
Reality check for anyone considering this: I just got a used 3090 for $900 last month. It works great.
I would not recommend buying one for $600, it probably either won’t arrive or will be broken. Someone will reply saying they got one for $600 and it works, that doesn’t mean it will happen if you do it.
I’d say the market is realistically $900-1100, maybe $800 if you know the person or can watch the card running first.
All that said, this advice will expire in a month or two when the 5090 comes out.
I've bought 5 used and they're all perfect. But that's what buyer protection on ebay is for. Had to send back an Epyc mobo with bent pins and ebay handled it fine.
I believe these efforts are very important. If we want this stuff to be practical we are going to have to work on efficiency. Price efficiency is good. Power and compute efficiency would be better.
I have been playing with llama.cpp to run interference on conventional cpus. No conclusions but it's interesting. I need to look at llamafile next.
Just an FYI, this is writeup from August 2023 and a lot has changed (for the better!) for RDNA3 AI/ML support.
That being said, I did some very recent inference testing on an W7900 (using the same testing methodology used by Embedded LLM's recent post to compare to vLLM's recently added Radeon GGUF support [1]) and MLC continues to perform quite well. On Llama 3.1 8B, MLC's q4f16_1 (4.21MB weights) performed +35% faster than llama.cpp w/ Q4_K_M w/ their ROCm/HIP backend (4.30MB weights, 2% size difference).
That makes MLC still the generally fastest standalone inference engine for RDNA3 by a country mile. However, you have much less flexibility with quants and by and large have to compile your own for every model, so llama.cpp is probably still more flexible for general use. Also llama.cpp's (recently added to llama-server) speculative decoding can also give some pretty sizable performance gains. Using a 70B Q4_K_M + 1B Q8_0 draft model improves output token throughput by 59% on the same ShareGPT testing. I've also been running tests with Qwen2.5-Coder and using a 0.5-3B draft model for speculative decoding gives even bigger gains on average (depends highly on acceptance rate).
Note, I think for local use, vLLM GGUF is still not suitable at all. When testing w/ a 70B Q4_K_M model (only 40GB), loading, engine warmup, and graph compilation took on avg 40 minutes. llama.cpp takes 7-8s to load the same model.
At this point for RDNA3, basically everything I need works/runs for my use cases (primarily LLM development and local inferencing), but almost always slower than an RTX 3090/A6000 Ampere (a new 24GB 7900 XTX is $850 atm, used or refurbished 24 GB RTX 3090s are in in the same ballpark, about $800 atm; a new 48GB W7900 goes for $3600 while an 48GB A6000 (Ampere) goes for $4600). The efficiency gains can be sizable. Eg, on my standard llama-bench test w/ llama2-7b-q4_0, the RTX 3090 gets a tg128 of 168 t/s while the 7900 XTX only gets 118 t/s even though both have similar memory bandwidth (936.2 GB/s vs 960 GB/s). It's also worth noting that since the beginning of the year, the llama.cpp CUDA implementation has gotten almost 25% faster, while the ROCm version's performance has stayed static.
There is an actively (solo dev) maintained fork of llama.cpp that sticks close to HEAD but basically applies a rocWMMA patch that can improve performance if you use the llama.cpp FA (still performs worse than w/ FA disabled) and in certain long-context inference generations (on llama-bench and w/ this ShareGPT serving test you won't see much difference) here: https://github.com/hjc4869/llama.cpp - The fact that no one from AMD has shown any interest in helping improve llama.cpp performance (despite often citing llama.cpp-based apps in marketing/blog posts, etc is disappointing ... but sadly on brand for AMD GPUs).
Anyway, for those interested in more information and testing for AI/ML setup for RDNA3 (and AMD ROCm in general), I keep a doc with lots of details here: https://llm-tracker.info/howto/AMD-GPUs
Intriguing. I thought AMD GPUs didn't have tensor cores (or matrix multiplication units) like NVidia. I believe they are only dot product / fused multiply and accumulate instructions.
Are these LLMs just absurdly memory bound so it doesn't matter?
They absolutely do have similar cores to tensor cores, it's called matrix cores. And they have particular instructions to utilize them (MFMA).
Note I'm talking about DC compute chips, like MI300.
LLMs aren't memory bound in production loads, they are pretty much compute bound too, at least in prefill phase, but in practice in general too.
They don’t, but GPUs were designed for doing matrix multiplications even without the special hardware instructions for doing matrix multiplication tiles. Also, the forward pass for transformers is memory bound, and that is what does token generation.
> If RAM is the main bottleneck then CPUs should be on the table
That's certainly not the case. The graphics memory model is very different from the CPU memory model. Graphics memory is explicitly designed for multiple simultaneous reads (spread across several different buses) at the cost of generality (only portions of memory may be available on each bus) and speed (the extra complexity means reads are slower). This makes then fast at doing simple operations on a large amount of data.
CPU memory only has one bus, so only a single read can happen at a time (a cache line read), but can happen relatively quickly. So CPUs are better for workloads with high memory locality and frequent reuse of memory locations (as is common in procedural programs).
Gigabytes per second? What is this, bandwidth for ants?
My years old pleb tier non-HBM GPU has more than 4 times the bandwidth you would get from a PCIe Gen 7 x16 link, which doesn't even officially exist yet.
Yes CXL will soon benefit from PCIe Gen 7 x16 with expected 64GB/s in 2025 and the non-HBM bandwidth I/O alternative is increasing rapidly by the day. For most inferences of near real-time LLM it will be feasible. For majority of SME companies and other DIY users (humans or ants) with their localized LLM should not be any issues [1],[2]. In addition new techniques for more efficient LLM are being discover to reduce the memory consumption [3].
[1] Forget ChatGPT: why researchers now run small AIs on their laptops:
No. Memory bandwidth is the important factor for LLM inference. 64GB/s is 4x less than the hypothetical I granted you (Gen7x16 = 256GB/s), which is 4x less than the memory bandwidth on my 2 year old pleb GPU (1TB/s), which is 10x less than a state of the art professional GPU (10TB/s), which is what the cloud services will be using.
That's 160x worse than cloud and 16x worse than what I'm using for local LLM. I am keenly aware of the options for compression. I use them every day. The sacrifices I make to run local LLM cut deep compared to the cloud models, and squeezing it down by another factor of 16 will cut deep on top of cutting deep.
Nothing says it can't be useful. My most-used model is running in a microcontroller. Just keep those expectations tempered.
(EDIT: changed the numbers to reflect red team victory over green team on cloud inference.)
Bandwidth between where the LLM is stored and where your matrix*vector multiplies are done is the important figure for inference. You want to measure this in terabytes per second, not gigabytes per second.
A 7900XTX also has 1TB/s on paper, but you'll need awkward workarounds every time you want to do something (see: article) and half of your workloads will stop dead with driver crashes and you need to decide if that's worth $500 to you.
Stacking 3090s is the move if you want to pinch pennies. They have 24GB of memory and 936GB/s of bandwidth each, so almost as good as the 4090, but they're as cheap as the 7900XTX with none of the problems. They aren't as good for gaming or training workloads, but for local inference 3090 is king.
It's not a coincidence that the article lists the same 3 cards. These are the 3 cards you should decide between for local LLM, and these are the 3 cards a true competitor should aim to exceed.
I got a "gaming" PC for LLM inference with an RTX 3060. I could have gotten more VRAM for my buck with AMD, but didn't because at the time alot of inference needed CUDA.
As soon AMD is as good as Nvidia for inference, I'll switch over.
But I've read on here that their hardware engineers aren't even given enough hardware to test with...
[1] https://www.jonpeddie.com/news/amd-to-integrate-cdna-and-rdn.... [2] https://centml.ai/hidet/ [3] https://centml.ai/platform/
I have 7 NVidia 4090s under my desk happily chugging along on week long training runs. I once managed to get a Radeon VII to run for six hours without shitting itself.
I have 6 Radeon Pro VII under my desk (in a single system BTW), and they run hard for weeks until I choose to reboot e.g. for Linux kernel updates.
I bought them "new old stock" for $300 apiece. So that's $1800 for all six.
(I release it will be significantly lower, just try to get as much of a comparison as is possible).
Moreover, for workloads limited by the memory bandwidth, a Radeon Pro VII and a RTX 4090 will have about the same speed, regardless what kind of computations are performed. It is said that speed limitation by memory bandwidth happens frequently for ML/AI inferencing.
There's a bunch of similar setups and there are a couple of dozen people that have done something similar on /r/localllama.
[1] https://www.linkedin.com/feed/update/urn:li:activity:7275885...
[1] https://blog.vllm.ai/2024/10/23/vllm-serving-amd.html [2] https://embeddedllm.com/blog/vllm-now-supports-running-gguf-...
https://x.com/nisten/status/1871325538335486049
PyTorch requires it's own support kit separate from the oneAPI Toolkit (and runs slightly different versions of everything), the vLLM xpu support doesn't work - both source and the docker failed to build/run for me. The IPEX-LLM whisper support is completely borked, etc, etc.
SYCL with llama.cpp is great though, at least at FP16 since it supports nothing else but even Arc iGPUs easily give 2-4x performance compared to CPU inference.
Intel should've just contributed to SYCL instead of trying to make their own thing and then forgot to keep maintaining it halfway through.
I have some benchmarks as well, and the IPEX-LLM backend performed a fair bit better than the SYCL llama.cpp backend for me (almost +50% pp512 and almost 2X tg128) so worth getting it working if you plan on using llama.cpp much on an Intel system. SYCL still performs significantly better than Vulkan and CPU backends, though.
As an end-user, I agree that it'd be way better if they could just contribute upstream somehow (whether to the SYCL backend, or if not possible, to a dependency-minized IPEX backend). the IPEX backend is one of the more maintained parts of IPEX-LLM, btw. I found a lot of stuff in that repo that depend on versions of oneKit that aren't even downloadable on Intel's site. I couldn't help but smirk when I heard someone say "Intel has their software nailed down."
Honestly I think there's just something seriously broken with the way IPEX expects the GPU driver to be on 24.04 and there's nothing I can really do about it except wait for them to fix it if I want to keep using this OS.
I am vaguely considering adding another drive and installing 22.04 or 20.04 with the exact kernel they want to see if that might finally work in the meantime, but honestly I'm fairly satisfied with the speed I get from SYCL already. The problem is more that it's annoying to integrate it directly through the server endpoint, every projects expects a damn ollama api or llama-cpp-python these days and I'm a fan of neither since it's just another layer of headaches to get those compiled with SYCL.
> I found a lot of stuff in that repo that depend on versions of oneKit that aren't even downloadable on Intel's site. I couldn't help but smirk when I heard someone say "Intel has their software nailed down."
Yeah well the fact that oneAPI 2025 got released, broke IPEX, and they still haven't figured out a way to patch it for months makes me think it's total chaos internally, where teams work against each other instead of talking and coordinating.
That's not a long-term success strategy. Maybe good for getting your name in the conversation, but not sustainable.
One of the benefits of being Intel.
I want to believe he's wrong, but on the parts of his show where I am in a position to verify, he generally checks out. Whatever the opposite of Gell-Mann Amnesia is, he's got it going for him.
272 mm2 puts the B580 in the same league as the Radeon 7700XT, a $449 card, and the GeForce 4070 Super, which is $599. The idea that Intel is selling these cards at a loss sounds reasonable to me.
Intel has neither, or at least not as much of them.
Is this something anyone sets out to do?
We just onboarded a customer to move from openai API to on-prem solution, currently evaluating MI300x for inference.
Email me at my profile email.
[1] https://youtube.com/watch?v=dNrTrx42DGQ&t=3218
From their announcement on 20241219[^0]:
"We are the only company to get AMD on MLPerf, and we have a completely custom driver that's 50x simpler than the stock one. A bit shocked by how little AMD cared, but we'll take the trillions instead of them."
From 20241211[^1]:
"We gave up and soon tinygrad will depend on 0 AMD code except what's required by code signing.
We did this for the 7900XTX (tinybox red). If AMD was thinking strategically, they'd be begging us to take some free MI300s to add support for it."
---
[^0]: https://x.com/__tinygrad__/status/1869620002015572023
[^1]: https://x.com/__tinygrad__/status/1866889544299319606
https://www.modular.com/blog/introducing-max-24-6-a-gpu-nati...
But that is irrelevant to the conversation because this is not about Mojo but something they call MAX. [1]
1. https://www.modular.com/max
I assume the part where she said there's "gaps in the software stack", because that's the only part that's attributed to her.
But I must be wrong because that hasn't been in dispute or in the news in a decade, it's not a geohot discovery from last year.
Hell I remember a subargument of a subargument re: this being an issue a decade ago in macOS dev (TL;Dr whether to invest in opencl)
tl;dr there's a non-unsubstantial # of people who learn a lot from geohot. I'd say about 3% of people here will be confused if you thought of him as less than a top technical expert across many comp sci fields.
And he did the geohot thing recently, way tl;dr: acted like there was a scandal being covered up by AMD around drivers that was causing them to "lose" to nVidia.
He then framed AMD not engaging with him on this topic as further covering-up and choosing to lose.
So if you're of a certain set of experiences, you see an anodyne quote from the CEO that would have been utterly unsurprising dating back to when ATI was still a company, and you'd read it as the CEO breezily admitting in public that geohot was right about how there was malfeasance, followed by a cover up, implying extreme dereliction of duty, because she either helped or didn't realize till now.
I'd argue this is partially due to stonk-ification of discussions, there was a vague, yet often communicated, sense there was something illegal happening. Idea was it was financial dereliction of duty to shareholders.
And that's what's happening, slowly anyway. Google, Apple and Amazon all have their own AI chips, Intel has Gaudi, AMD had their thing, and the software is at least working on more than just Nvidia. Which is a win. Even if it's not perfect. I'm personally hoping that everyone piles in on a standard like SYCL.
AMD is constitutionally incapable of shipping anything but mid range hardware that requires no innovation.
The only reason why they are doing so well in CPUs right now is that Intel has basically destroyed itself without any outside help.
The most important has been the fact that AMD has predicted correctly that big monolithic CPUs will no longer be feasible in the future CMOS fabrication technologies, so they have designed the Zen family since the beginning with a chiplet-based architecture. Intel had attempted to ridicule them, but after losing many billions they have been forced to copy this strategy.
Also in the microarchitecture of their CPUs AMD has made the right choices since the beginning and then they have improved it constantly with each generation. The result is that now the latest Intel big core, Lion Cove, has a microarchitecture that is much more similar to AMD Zen 5 than to any of the previous Intel cores, because they had to do this to get a competitive core.
In the distant past, AMD has also introduced a lot of innovations long before they were copied by Intel, but it is true that those had not been invented by AMD, but they had been copied by AMD from more expensive CPUs, like DEC Alpha or Cray, but Intel has also copied them only after being forced by the competition with AMD.
But Nvidia is a different beast. It is a bit like Apple in the late 00s where you take business, forecast, marketing, operation, software, hardware, sales etc You take any part of it and they are all industry leading. And having industry leading capability is only part of the game, having it all work together is completely another thing. And unlike Apple where they lost direction once Steve Jobs passed away and weren't sure about how to deploy capital. Jensen is still here, and they have more resources now making Nvidia even more competitive.
It is often most people underestimate the magnitude of the task required, ( I like to tell the story again about an Intel GPU engineer in 2016 arguing they could take dGPU market shares by 2020, and we are now 2025 ), over estimate the capability of an organisation, under estimate the rival's speed of innovation and execution. These three thing combined is why most people are often off the estimate by an order of magnitude.
We are in the middle of a monopoly squeeze by NVidia on the most innovative part of the economy right now. I expect the DOJ to hit them harder than they did MS in the 90s given the bullshit they are pulling and the drag on the economy they are causing.
By comparison if AMD could write a driver that didn't shit itself when it had to multiply more than two matrices in a row they'd be selling cards faster than they can make them. You don't need to sell the best shovels in a gold rush to make mountains of money, but you can't sell teaspoons as premium shovels and expect people to come back.
I am not sure which part of Nvidia is monopoly. That is like suggesting TSMC has a monopoly.
They... do have a monopoly on foundry capacity, especially if you're looking at the most advanced nodes? Nobody's going to Intel or Samsung to build 3nm processors. Hell, there have been whispers over the past month that even Samsung might start outsourcing Exynos to TSMC; Intel already did that with Lunar Lake.
Having a monopoly doesn't mean that you are engaging in anticompetitive behavior, just that you are the only real option in town.
Maybe the US will do something if GPU price becomes the limit instead of the supply of chips and power.
It sounds like you're expecting extreme competence from the DOJ. Given their history with regulating big tech companies, and even worse, the incoming administration, I think this is a very unrealistic expectation.
Also I'd take HN as being being an amazing platform for the overall consistency and quality of moderation. Anything beyond that depends more on who you're talking to than where at.
Personally, I think that's when somebody who has no real information to contribute doesn't try to pretend that they do.
So thanks for the offer, but I think I'm already delivering on that realm.
Everyone whose dug deep into what AMD is doing has left in disgust if they are lucky and bankruptcy if they are not.
If I can save someone else from wasting $100,000 on hardware and six months of their life then my post has done more good than the AMD marketing department ever will.
This seems like unuseful advice if you've already given up on them.
You tried it and at some point in the past it wasn't ready. But by not being ready they're losing money, so they have a direct incentive to fix it. Which would take a certain amount of time, but once you've given up you no longer know if they've done it yet or not, at which point your advice would be stale.
Meanwhile the people who attempt it apparently seem to get acquired by Nvidia, for some strange reason. Which implies it should be a worthwhile thing to do. If they've fixed it by now which you wouldn't know if you've stopped looking, or they fix it in the near future, you have a competitive advantage because you have access to lower cost GPUs than your rivals. If not, but you've demonstrated a serious attempt to fix it for everyone yourself, Nvidia comes to you with a sack full of money to make sure you don't finish, and then you get a sack full of money. That's win/win, so rather than nobody doing it, it seems like everybody should be doing it.
I've seen people try it every six months for two decades now.
At some point you just have to accept that AMD is not a serious company, but is a second rate copycat and there is no way to change that without firing everyone from middle management up.
I'm deeply worried about stagnation in the CPU space now that they are top dog and Intel is dead in the water.
Here's hoping China and Risk V save us.
>Meanwhile the people who attempt it apparently seem to get acquired by Nvidia
Everyone I've seen base jumping has gotten a sponsorship from redbull, ergo. everyone should basejump.
Ignore the red smears around the parking lot.
Have you tried compute shaders instead of that weird HPC-only stuff?
Compute shaders are widely used by millions of gamers every day. GPU vendors have huge incentive to make them reliable and efficient: modern game engines are using them for lots of thing, e.g. UE5 can even render triangle meshes with GPU compute instead of graphics (the tech is called nanite virtualized geometry). In practice they work fine on all GPUs, ML included: https://github.com/Const-me/Cgml
Ignoring the very old (in ML time) date of the article...
What's the catch? People are still struggling with this a year later so I have to assume it doesn't work as well as claimed.
I'm guessing this is buggy in practice and only works for the HF models they chose to test with?
The only catch is, for some reason developers of ML libraries like PyTorch aren’t interested in open GPU APIs like D3D or Vulkan. Instead, they focus on proprietary ones i.e. CUDA and to lesser extent ROCm. I don’t know why that is.
D3D-based videogames are heavily using GPU compute for more than a decade now. Since Valve shipped SteamDeck, the same now applies to Vulkan on Linux. By now, both technologies are stable, reliable and performant.
OTOH, general-purpose compute instead of fixed-function blocks used by cuDNN enables custom compression algorithms for these weights which does help, by saving memory bandwidth. For example, I did custom 5 bits/weight quantization which works on all GPUs, no hardware support necessary, just simple HLSL codes: https://github.com/Const-me/Cgml?tab=readme-ov-file#bcml1-co...
I would argue running local inference with batch size=1 is more useful for empowering innovators compared to running production loads on shared servers owned by companies. Local inference increases count of potential innovators by orders of magnitude.
BTW, in the long run it may also benefit these companies because in theory, an easy migration path from CUDA puts a downward pressure on nVidia’s prices.
llama.cpp has a much bigger supported model list, as does vLLM and of course PyTorch/HF transformers covers everything else, all of which work w/ ROCm on RDNA3 w/o too much fuss these days.
For inference, the biggest caveat is that Flash Attention is only an aotriton implementation, which besides being less performant sometimes, also doesn't support SWA. For CDNA there is a better CK-based version of FA, but CK doesn't not have RDNA support. There are a couple people at AMD apparently working on native FlexAttention, os I guess we'll how that turns out.
(Note the recent SemiAccurate piece was on training, which I'd agree is in a much worse state (I have personal experience with it being often broken for even the simplest distributed training runs). Funnily enough, if you're running simple fine tunes on a single RDNA3 card, you'll probably have a better time. OOTB, a 7900 XTX will train at about the same speed as an RTX 3090 (4090s blow both of those away, but you'll probably want more cards and VRAM of just move to H100s).
The reason why I say its dinosaur is, imagine, we as a dev community continued to build on top of Flash or Microsoft Silverlight...
LLM and ML has been out for quiet a while, with AI/LLM advancement, the transition must have been much quicker to move cross platform. But this hasn't yet and not sure when it will happen.
Building a translation layer on top CUDA is not the answer either to this problem.
1. It just works. When i tried to build things on Intel Arcs, i spent way more hours bikeshedding ipex and driver issues than developing
2. LLMs seem to have more cuda code in their training data. I can leverage claude and 4o to help me build things with cuda, but trying to get them to help me do the same things on ipex just doesn't work.
I'd very much love a translation layer for Cuda, like a dxvk or wine equivalent.
Would save a lot of money since Arc gpus are in the bargain bin and nvidia cloud servers are double the price of AMD.
As it stands now, my dual Intel Arc rig is now just a llama.cpp inference server for the family to use.
This is pretty much exactly what happens with CUDA. Developers like it but then the users have to use expensive hardware with proprietary drivers/firmware, which is the relevant abomination. But users have some ability to influence developers, so as soon as we get the GPU equivalent of HTML5, what happens?
What do you mean by that? People trying to run their own models are not “the users” they are a tiny insignificant niche segment.
We're also likely to see a stronger swing away from "do inference in the cloud" because of the aligned incentives of "companies don't want to pay for all that hardware and electricity" and "users have privacy concerns" such that companies doing inference on the local device will have both lower costs and a feature they can advertise over the competition.
What this is waiting for is hardware in the hands of the users that can actually do this for a mass market price, but there is no shortage of companies wanting a piece of that. In particular, Apple is going to be pushing that hard and despite the price they do a lot of volume, and then you're going to start seeing more PCs with high-VRAM GPUs or iGPUs with dedicated GDDR/HBM on the package as their competitors want feature parity for the thing everybody is talking about, the cost of which isn't actually that high, e.g. 40GB of GDDR6 is less than $100.
https://github.com/triton-lang/triton/issues/4978
1.https://www.modular.com/max
Making AMD GPUs competitive for LLM inference https://news.ycombinator.com/item?id=37066522 (August 9, 2023 — 354 points, 132 comments)
I would not recommend buying one for $600, it probably either won’t arrive or will be broken. Someone will reply saying they got one for $600 and it works, that doesn’t mean it will happen if you do it.
I’d say the market is realistically $900-1100, maybe $800 if you know the person or can watch the card running first.
All that said, this advice will expire in a month or two when the 5090 comes out.
I have been playing with llama.cpp to run interference on conventional cpus. No conclusions but it's interesting. I need to look at llamafile next.
For now, it is as if AMD does not exist in this field for me.
That being said, I did some very recent inference testing on an W7900 (using the same testing methodology used by Embedded LLM's recent post to compare to vLLM's recently added Radeon GGUF support [1]) and MLC continues to perform quite well. On Llama 3.1 8B, MLC's q4f16_1 (4.21MB weights) performed +35% faster than llama.cpp w/ Q4_K_M w/ their ROCm/HIP backend (4.30MB weights, 2% size difference).
That makes MLC still the generally fastest standalone inference engine for RDNA3 by a country mile. However, you have much less flexibility with quants and by and large have to compile your own for every model, so llama.cpp is probably still more flexible for general use. Also llama.cpp's (recently added to llama-server) speculative decoding can also give some pretty sizable performance gains. Using a 70B Q4_K_M + 1B Q8_0 draft model improves output token throughput by 59% on the same ShareGPT testing. I've also been running tests with Qwen2.5-Coder and using a 0.5-3B draft model for speculative decoding gives even bigger gains on average (depends highly on acceptance rate).
Note, I think for local use, vLLM GGUF is still not suitable at all. When testing w/ a 70B Q4_K_M model (only 40GB), loading, engine warmup, and graph compilation took on avg 40 minutes. llama.cpp takes 7-8s to load the same model.
At this point for RDNA3, basically everything I need works/runs for my use cases (primarily LLM development and local inferencing), but almost always slower than an RTX 3090/A6000 Ampere (a new 24GB 7900 XTX is $850 atm, used or refurbished 24 GB RTX 3090s are in in the same ballpark, about $800 atm; a new 48GB W7900 goes for $3600 while an 48GB A6000 (Ampere) goes for $4600). The efficiency gains can be sizable. Eg, on my standard llama-bench test w/ llama2-7b-q4_0, the RTX 3090 gets a tg128 of 168 t/s while the 7900 XTX only gets 118 t/s even though both have similar memory bandwidth (936.2 GB/s vs 960 GB/s). It's also worth noting that since the beginning of the year, the llama.cpp CUDA implementation has gotten almost 25% faster, while the ROCm version's performance has stayed static.
There is an actively (solo dev) maintained fork of llama.cpp that sticks close to HEAD but basically applies a rocWMMA patch that can improve performance if you use the llama.cpp FA (still performs worse than w/ FA disabled) and in certain long-context inference generations (on llama-bench and w/ this ShareGPT serving test you won't see much difference) here: https://github.com/hjc4869/llama.cpp - The fact that no one from AMD has shown any interest in helping improve llama.cpp performance (despite often citing llama.cpp-based apps in marketing/blog posts, etc is disappointing ... but sadly on brand for AMD GPUs).
Anyway, for those interested in more information and testing for AI/ML setup for RDNA3 (and AMD ROCm in general), I keep a doc with lots of details here: https://llm-tracker.info/howto/AMD-GPUs
[1] https://embeddedllm.com/blog/vllm-now-supports-running-gguf-...
Are these LLMs just absurdly memory bound so it doesn't matter?
LLMs aren't memory bound in production loads, they are pretty much compute bound too, at least in prefill phase, but in practice in general too.
https://rocm.docs.amd.com/projects/rocWMMA/en/latest/what-is...
And AMD has passable Raytracing units (NVidias are better but the difference is bigger than these LLM results).
If RAM is the main bottleneck then CPUs should be on the table.
That's certainly not the case. The graphics memory model is very different from the CPU memory model. Graphics memory is explicitly designed for multiple simultaneous reads (spread across several different buses) at the cost of generality (only portions of memory may be available on each bus) and speed (the extra complexity means reads are slower). This makes then fast at doing simple operations on a large amount of data.
CPU memory only has one bus, so only a single read can happen at a time (a cache line read), but can happen relatively quickly. So CPUs are better for workloads with high memory locality and frequent reuse of memory locations (as is common in procedural programs).
If people are paying $15,000 or more per GPU, then I can choose $15,000 CPUs like EPYC that have 12-channels or dual-socket 24-channel RAM.
Even desktop CPUs are dual-channel at a minimum, and arguably DDR5 is closer to 2 or 4 buses per channel.
Now yes, GPU RAM can be faster, but guess what?
https://www.tomshardware.com/pc-components/cpus/amd-crafts-c...
GPUs are about extremely parallel performance, above and beyond what traditional single-threaded (or limited-SIMD) CPUs can do.
But if you're waiting on RAM anyway?? Then the compute-method doesn't matter. Its all about RAM.
Though the distinction between the two categories is blurring.
1) Compute Express Link (CXL):
https://en.wikipedia.org/wiki/Compute_Express_Link
PCIe vs. CXL for Memory and Storage:
https://news.ycombinator.com/item?id=38125885
My years old pleb tier non-HBM GPU has more than 4 times the bandwidth you would get from a PCIe Gen 7 x16 link, which doesn't even officially exist yet.
[1] Forget ChatGPT: why researchers now run small AIs on their laptops:
https://news.ycombinator.com/item?id=41609393
[2] Welcome to LLMflation – LLM inference cost is going down fast:
https://a16z.com/llmflation-llm-inference-cost/
[3] New LLM optimization technique slashes memory costs up to 75%:
https://news.ycombinator.com/item?id=42411409
That's 160x worse than cloud and 16x worse than what I'm using for local LLM. I am keenly aware of the options for compression. I use them every day. The sacrifices I make to run local LLM cut deep compared to the cloud models, and squeezing it down by another factor of 16 will cut deep on top of cutting deep.
Nothing says it can't be useful. My most-used model is running in a microcontroller. Just keep those expectations tempered.
(EDIT: changed the numbers to reflect red team victory over green team on cloud inference.)
So you have a full terabyte per second of bandwidth? What GPU is that?
(The 64GB/s number is an x4 link. If you meant you have over four times that, then it sounds like CXL would be pretty competitive.)
A 7900XTX also has 1TB/s on paper, but you'll need awkward workarounds every time you want to do something (see: article) and half of your workloads will stop dead with driver crashes and you need to decide if that's worth $500 to you.
Stacking 3090s is the move if you want to pinch pennies. They have 24GB of memory and 936GB/s of bandwidth each, so almost as good as the 4090, but they're as cheap as the 7900XTX with none of the problems. They aren't as good for gaming or training workloads, but for local inference 3090 is king.
It's not a coincidence that the article lists the same 3 cards. These are the 3 cards you should decide between for local LLM, and these are the 3 cards a true competitor should aim to exceed.
Later consumer GPUs have regressed and only RTX 4090 offers the same memory bandwidth in the current NVIDIA generation.
During inference? Definitely. Training is another story.
Btw, this is from MLC-LLM which makes WebLLM and other good stuff.
As soon AMD is as good as Nvidia for inference, I'll switch over.
But I've read on here that their hardware engineers aren't even given enough hardware to test with...