NAS System Achieves 18 tok/s with 80B LLM Using Only Integrated Graphics

1 min read

A LocalLLaMA community member has achieved remarkable performance running an 80B parameter LLM at 18 tokens per second on a NAS system using only integrated graphics, with no discrete GPU required. This breakthrough demonstrates that high-performance local LLM inference is possible on systems not traditionally designed for AI workloads.

The achievement is particularly significant because it challenges conventional wisdom about hardware requirements for large model inference. By leveraging integrated GPU capabilities and optimization techniques, this setup proves that substantial LLM performance can be extracted from modest hardware configurations, opening new possibilities for cost-effective local deployment.

This development is especially valuable for practitioners seeking to consolidate AI capabilities into existing infrastructure. The ability to run large models on NAS hardware means users can combine storage, networking, and AI inference in a single system, reducing complexity and cost while maintaining impressive performance for local LLM applications.


Source: r/LocalLLaMA · Relevance: 8/10