Qwen3.5-35B RTX 5080 Experiments Confirm KV q8_0 as Free Lunch, Q4_K_M Remains Optimal
1 min readDetailed follow-up benchmarking confirms that KV cache quantisation to q8_0 provides performance improvements without quality degradation—a "free lunch" optimisation for local deployments. The investigation validates Q4_K_M as the optimal quantisation level and demonstrates that proper batch configuration flags can achieve 74.7 tokens/second on consumer RTX 5080 hardware.
These experiments directly address real deployment questions: which quantisation strategies preserve quality while maximising speed, and how much performance can be extracted from mid-range consumer GPUs. The 7% performance improvement from optimised configuration flags shows that inference speed is not purely hardware-bound but also depends on runtime settings.
For practitioners deploying on RTX 5080 or similar hardware, these results provide empirically-validated configuration recipes for balancing quality and speed, reducing trial-and-error experimentation time.
Read the full article on r/LocalLLaMA.
Source: r/LocalLLaMA · Relevance: 9/10