110 Tokens/Second on RTX 4070 Super with Qwen 3.6 35B
1 min readRecent benchmarking data shows that an RTX 4070 Super can achieve 110 tokens per second when running the Qwen 3.6 35B model, a result that validates the viability of consumer-grade GPUs for practical local LLM deployment. This performance level makes real-time inference workloads feasible for individual developers and small teams without enterprise-scale hardware.
The 35B model size represents a sweet spot for local deployment—larger models offer better reasoning and instruction-following capability than 7B models, while remaining manageable on mid-range consumer GPUs through quantization. At 110 tok/s, latency for typical 500-token generations falls below 5 seconds, making these models practical for interactive applications.
This benchmark underscores why the local LLM community continues to focus on open 30-40B parameter models and optimization techniques like quantization: they deliver production-quality performance on hardware that practitioners can actually afford.
Source: Google News · Relevance: 9/10