Qwen3.5 122B Achieves 25 tok/s on 72GB VRAM Setup

1 min read
r/LocalLLaMAcommunity

A significant milestone for local LLM practitioners: Qwen3.5 122B can run efficiently on consumer-grade hardware when distributed across multiple GPUs. Users report 25 tokens per second throughput with the full model and context loaded into VRAM, with the key breakthrough being proper configuration to avoid infinite "but wait" loops that plagued early deployments.

This performance metric matters because it proves that state-of-the-art 120B+ parameter models are now accessible to enthusiasts with ~$2000-3000 in GPU hardware (three RTX 3090s), not just enterprise deployments. The model's speed makes it practical for real-world applications beyond simple testing, opening possibilities for locally-hosted coding assistance, analysis, and creative tasks without relying on commercial APIs.

The community is actively sharing optimization techniques and configurations to help others replicate these results, indicating strong momentum for practical Qwen3.5 deployments on consumer hardware.


Source: r/LocalLLaMA · Relevance: 9/10