Qwen3.5-397B Achieves 282 tok/s on 4x RTX PRO 6000 Blackwell Through Custom CUTLASS Kernel
1 min readA developer in the LocalLLaMA community has achieved a remarkable performance breakthrough with the Qwen3.5-397B model, reaching 282 tokens per second on 4x RTX PRO 6000 Blackwell GPUs. The optimization journey involved identifying and fixing a critical bottleneck in NVIDIA's SM120 architecture where MoE GEMM tiles were broken, resulting in a 5x speedup from an initial 55 tok/s baseline.
The optimization process progressed through multiple stages: moving from WSL2 to native Linux (55→119 tok/s), then applying driver and configuration optimizations (119→142 tok/s), and finally implementing the custom K=64 CUTLASS kernel (142→282 tok/s). The complete work includes a PR submitted to FlashInfer and a pre-built Docker image, making it accessible to other practitioners.
For local LLM operators, this breakthrough is significant because it shows that even with cutting-edge hardware constraints, custom kernel optimization can unlock dramatically better performance for massive models. This opens up viable local inference paths for 397B parameter models that were previously impractical, shifting the economics of self-hosted deployment.
Source: r/LocalLLaMA · Relevance: 10/10