i’m running moderately quantized models on 24GB VRAM and getting like 30-40 tokens a second. add a zero to the price and it’s still not a lot for a company.
Sure, but you’re running a very small model compared to what we are talking about.
GLM-5.1 is over 200GB even when quantizied to 1-bit. Kimi K2.6 is even bigger. A framework desktop cannot run either of these. Qwen3.6 is significantly smaller and the model weights could fit, but consider the KV-cache you’d need for all of the company’s users, and the throughput required to serve them all.
You’re right that it is within reach for a company but framework desktop makes zero sense for this
It’s a capex and that type of hardware needs to be replaced every 3 years minimum and you need people to set it up and maintain a cluster. And it’s not straight forward.
You are never going to get that approved without a serious business case.
Claude on the other end is a opex and much easier to just try out and then build a solution on it
Not saying it doesn’t happen but it’s not as easy as people make it sound like
Qwen3.6 27b beats Claude Opus 4.5 in most benchmarks. Qwen3.6 35b beats Opus 4.5 in a few specific benchmarks, but most benchmarks have Opus 4.5 beating Qwen3.6 35b, although there is not a big gap between Opus 4.5 and Qwen3.6 27b or 35b either way.
deepseek distilled is an alternative that works on more modest hardware.
and i’m not really interested in what claude and chatgpt, mistral and the others are doing, i would never tuch those models with a ten foot pole. if i can’t run it it does not get run.
At Q8 it is around 35-40GB I think + memory for required context.
I have a Framework desktop. It gets you you around 6t/s. Not suitable for professional use but for personal use I think it is fine. I do prefer Gemma 4 though, but that comes with similar reqirements.
huh, i thought that ryzen ai thing would perform better than that. my 7900xtx regularly gets 30+tps with qwen, up to hundreds with more compressed models.
My system runs at 100W TDP though. That is maybe 140W at the power outlet, incl. monitor and everything.
This is also the dense 27B model at Q8. But yeah, it is not terribly fast. I think the best use case is on MoE models. GPT-OSS-120B runs on it for example and at 50T/s speed is not a n issue anymore either. (I could get it to run even on just 64GB but the new llama.cpp might need a tiny bit more memory which pushed it just across the limit. yeah I know, for seriously using it you’d need the 128GB version)
a 128GB framework desktop could do that job. it’s increased a bit in price since i last looked at it but €4500 isn’t that much for a company.
Maybe to serve an aggressively quantized model to one very patient user.
i’m running moderately quantized models on 24GB VRAM and getting like 30-40 tokens a second. add a zero to the price and it’s still not a lot for a company.
Yes. It will probably work for 1-2 users at peak.
Sure, but you’re running a very small model compared to what we are talking about.
GLM-5.1 is over 200GB even when quantizied to 1-bit. Kimi K2.6 is even bigger. A framework desktop cannot run either of these. Qwen3.6 is significantly smaller and the model weights could fit, but consider the KV-cache you’d need for all of the company’s users, and the throughput required to serve them all.
You’re right that it is within reach for a company but framework desktop makes zero sense for this
isn’t qwen like 40-50GB? that could work i think. performance is okay even quantised down to 10.
And then add 200k context on top
And then add hundred of users needing to do things in paralell
If it’s a large enough company to have hundreds of users, it can afford several beefy machines tbh
It’s a capex and that type of hardware needs to be replaced every 3 years minimum and you need people to set it up and maintain a cluster. And it’s not straight forward.
You are never going to get that approved without a serious business case.
Claude on the other end is a opex and much easier to just try out and then build a solution on it
Not saying it doesn’t happen but it’s not as easy as people make it sound like
nobody said anything about it being a large company :P
anyway, seems the framework is hampered by a slow gpu so the memory issues are apparently moot.
deleted by creator
Qwen3.6 27b beats Claude Opus 4.5 in most benchmarks. Qwen3.6 35b beats Opus 4.5 in a few specific benchmarks, but most benchmarks have Opus 4.5 beating Qwen3.6 35b, although there is not a big gap between Opus 4.5 and Qwen3.6 27b or 35b either way.
deleted by creator
https://github.com/QwenLM/Qwen3.6#benchmarks
we were talking about 3.6.
deepseek distilled is an alternative that works on more modest hardware.
and i’m not really interested in what claude and chatgpt, mistral and the others are doing, i would never tuch those models with a ten foot pole. if i can’t run it it does not get run.
At Q8 it is around 35-40GB I think + memory for required context.
I have a Framework desktop. It gets you you around 6t/s. Not suitable for professional use but for personal use I think it is fine. I do prefer Gemma 4 though, but that comes with similar reqirements.
huh, i thought that ryzen ai thing would perform better than that. my 7900xtx regularly gets 30+tps with qwen, up to hundreds with more compressed models.
My system runs at 100W TDP though. That is maybe 140W at the power outlet, incl. monitor and everything.
This is also the dense 27B model at Q8. But yeah, it is not terribly fast. I think the best use case is on MoE models. GPT-OSS-120B runs on it for example and at 50T/s speed is not a n issue anymore either. (I could get it to run even on just 64GB but the new llama.cpp might need a tiny bit more memory which pushed it just across the limit. yeah I know, for seriously using it you’d need the 128GB version)
that’s fair, i’m at like 7x the power. the gpu alone easily pulls 350-400W and the rest of the system isn’t exactly running lean either.
…man now i really want more vram.