Natanox

Natanox@discuss.tchncs.de · 9 hours ago

Depends on which GPU you compare it with, what model you use, what kind of RAM it has to work with, ecetera. NPU’s are purpose-built chips after all. Unfortunately the whole tech is still very young, so we’ll have to wait for stuff like ollama to introduce native support for an apples-to-apples comparison. The raw numbers to however do look promising.

Natanox@discuss.tchncs.de · edit-2 13 hours ago

May take a look at systems with the newer AMD SoC’s first. They utilize the systems’ RAM and come with a proper NPU, once ollama or mistral.rs are supporting those they might give you sufficient performance for your needs for way lower costs (incl. power consumption). Depending on how NPU support gets implemented it might even become possible to use NPU and GPU in tandem, that would probably enable pretty powerful models to be run on consumer-grade hardware at reasonable speed.

Natanox@discuss.tchncs.de · 7 days ago

They would run with 8x speed each. Should not be too much of a bottleneck though, I don’t expect the performance to suffer noticeably more than 5% from this. Annoying, but getting a CPU+Board with 32 lanes or more would throw off the price/performance ratio.

Natanox@discuss.tchncs.de · 8 days ago

I’m currently looking for this as well. As far as my investigation went right now I’ll probably go for 2x AMD Instinct MI50. Each of them has equivalent to slightly higher performance than a P40, however usually only 16gb VRAM (If you’re super lucky you might get one with 32gb, those are usually not labeled as such though; probably binned MI60). With two of them you got 32gb VRAM and quite the performance for, right now, 200€ / card. Alternatively you should be able to run quantized models on a single card as well.

If you don’t mind running ROCm instead of CUDA this seems like a good bang for the buck. Alternatively you might look into AMDs new line of “AI” SoCs (for example Frameworks Desktop computer). They seem to be really good as well, and depending on your usecase might be more useful than an equally priced 4090.