Qwen3.5-35B RTX 5080 Experiments Confirm KV q8_0 as Free Lunch, Q4_K_M Remains Optimal

1 min read
r/LocalLLaMApublisher

Detailed follow-up benchmarking confirms that KV cache quantisation to q8_0 provides performance improvements without quality degradation—a "free lunch" optimisation for local deployments. The investigation validates Q4_K_M as the optimal quantisation level and demonstrates that proper batch configuration flags can achieve 74.7 tokens/second on consumer RTX 5080 hardware.

These experiments directly address real deployment questions: which quantisation strategies preserve quality while maximising speed, and how much performance can be extracted from mid-range consumer GPUs. The 7% performance improvement from optimised configuration flags shows that inference speed is not purely hardware-bound but also depends on runtime settings.

For practitioners deploying on RTX 5080 or similar hardware, these results provide empirically-validated configuration recipes for balancing quality and speed, reducing trial-and-error experimentation time.

Read the full article on r/LocalLLaMA.


Source: r/LocalLLaMA · Relevance: 9/10