DIY NAS Achieves 18 tok/s on 80B LLM Using Integrated Graphics
A hobbyist has successfully run an 80-billion-parameter AI model on a custom-built NAS using only integrated graphics, achieving 18 tokens per second. The project demonstrates the potential of repurposing high-specification network-attached storage for demanding AI inference tasks without dedicated hardware.

From Media Server to AI Powerhouse: How One Hobbyist's NAS Runs a Massive Language Model
By Tech Investigations Desk
In a demonstration of hardware ingenuity, a technology enthusiast has successfully configured a custom-built Network-Attached Storage (NAS) system to run an 80-billion-parameter large language model (LLM) at usable speeds, leveraging only the system's integrated graphics processor (iGPU). The project, detailed in a technical forum, challenges conventional wisdom that such demanding AI tasks require expensive, discrete GPUs or specialized hardware.
The Dual-Purpose Build
According to a detailed post on the r/LocalLLaMA subreddit, the user's primary motivation was consolidation. "I needed a NAS. I also wanted to mess around with local LLMs. And I really didn't want to explain to my wife why I needed a second box," the user, known online as BetaOp9, wrote. The solution was to over-specify a single system from the outset.
The core hardware, as reported in the source, consists of a Minisforum N5 Pro mini-PC equipped with an AMD Ryzen AI 9 HX PRO 370 processor. This chip features 12 cores, 24 threads, and a RDNA 3.5 integrated GPU with 16 compute units. The builder paired this with 96GB of DDR5 RAM and a substantial storage array: five 26TB hard drives in a RAIDZ2 configuration, offering roughly 70TB of usable space, alongside high-speed NVMe drives for metadata.
The system runs TrueNAS SCALE and typically handles a full media stack—including Jellyfin for streaming, Sonarr, Radarr, and various download clients—proving its worth as a NAS before the AI experiment began.
The AI Inference Breakthrough
The experiment focused on running Qwen3-Coder-Next, an 80-billion-parameter Mixture of Experts (MoE) model. Using the llama.cpp framework with a Vulkan backend and a quantized (Q4_K_M) version of the model, the user began with a sluggish 3 tokens per second (tok/s) on CPU-only inference.
Through iterative tuning, performance climbed. A key breakthrough, according to the source, was realizing that a common command-line flag (--no-mmap) was causing critical memory allocation issues on the Unified Memory Architecture (UMA), where the CPU and iGPU share the same RAM pool. Removing this flag allowed the full model—all 49 layers—to load into a 46GB Vulkan buffer successfully.
Further gains came from enabling flash attention, a memory-efficient attention mechanism. The final configuration, as documented by the user, achieves up to 18 tok/s for text generation and 53.8 tok/s for prompt processing, all while the NAS continues its primary duties, such as streaming media to other household devices.
Performance Context and Future Potential
The builder provided a point of comparison, noting that an Apple Mac Mini M4 Pro with 64GB of unified memory achieves roughly 20-25 tok/s on the same model, benefiting from Apple's high-bandwidth memory architecture. The NAS build's value proposition, however, lies in its consolidation. "I'm not trying to dunk on the Mac at all," the user clarified. "Just saying I didn't have to buy one AND a NAS."
Significant headroom for improvement remains. The user identified several untapped optimizations, including speculative decoding (which could potentially double or triple effective speed), memory profile tuning, and further Vulkan backend refinements. Notably, the Qwen3-Coder-Next model was originally designed for DeltaNet, a linear attention mechanism that scales better with long context. An active effort to implement this in llama.cpp could yield substantial performance gains, especially for extended conversations.
A Niche but Illustrative Trend
This project highlights a growing trend among tech enthusiasts: pushing general-purpose consumer hardware beyond its intended use cases. While comprehensive, professional NAS reviews like those referenced from Zhihu focus on reliability, features, and ease of use for traditional storage and media tasks, this experiment explores the outer limits of what such a platform can do.
The builder is clear that this isn't a recommendation for everyone. "I'm not saying everyone should overbuild their NAS for an LLM machine or that this was even a good idea," they wrote. However, for tinkerers already planning a high-end NAS build who are curious about local AI, the project serves as a compelling proof-of-concept. It demonstrates that with careful component selection and software tuning, a single, powerful system can effectively serve as both a robust data hub and a capable platform for experimenting with the frontier of generative AI.
The journey from 3 to 18 tokens per second was, in the user's words, "one discovery away from quitting" at several steps. The final result proves that for dedicated hobbyists, the line between a storage appliance and an AI workstation is becoming increasingly blurred.


