Qwen 3.5 27B Achieves 1.1M Tokens/Second on B200 GPUs with Optimized vLLM Config

27 March 2026 1 min read

r/LocalLLaMAsource

Achieving 1.1M tokens per second with Qwen 3.5 27B represents a significant throughput milestone for practical inference deployment. The optimization journey—from 9,500 to 1.1M tokens/second—was enabled by strategic configuration changes: using 8-way distributed parallelism over 8-way tensor parallelism, reducing context window from 131K to 4K tokens, enabling FP8 KV cache quantisation, and implementing MTP-1 speculative decoding. Notably, speculative decoding alone provided the largest performance jump.

The community benefits immensely from the published GitHub configurations, allowing others to replicate these results and adapt the strategies to their hardware. For organizations serving dense inference workloads, these optimization techniques demonstrate how thoughtful configuration choices can dramatically improve throughput without requiring larger models. This is particularly valuable for production deployments where latency and cost per token directly impact service quality and profitability.

Source: r/LocalLLaMA · Relevance: 8/10