LLM in a Flash: Run Qwen 397B Locally on Mac M3 Max

Run Qwen 397B on Mac M3 Max (2026): LLM in a Flash with Apple MLX & 48GB RAM

In 2026, researcher Dan Woods shattered the myth that massive LLMs require cloud GPUs. Using LLM in a Flash and Apple’s MLX framework, he successfully ran the 397B-parameter Qwen 3.5 model on a MacBook Pro M3 Max—with just 48GB of RAM. This breakthrough leverages model quantization, Mixture-of-Experts (MoE) architecture, and intelligent memory streaming to bypass traditional hardware limits.

How Apple MLX Enables Memory Streaming

Apple’s 2023 MLX research introduced a revolutionary approach: streaming model weights directly from SSD to RAM on-demand. Unlike traditional inference that loads entire models into DRAM, MLX optimizes for contiguous flash reads, reducing latency and memory pressure. Woods adapted this to Qwen 3.5’s MoE structure, activating only the necessary experts per token—cutting live memory usage from 209GB to under 5.5GB.

Step-by-Step: Quantizing Qwen 397B to 2-Bit

Woods applied 2-bit quantization to the model’s expert weights, slashing their footprint by 75% compared to 4-bit. Critical components like embeddings and routing matrices retained 4-bit precision to preserve output quality. Using automated experimentation via Claude Code and Karpathy’s method, he iterated 90 times to find the optimal balance between speed, accuracy, and memory. The result: near-identical performance to 4-bit models, with only minor degradation below 3 activated experts per token.

Why Mixture-of-Experts Reduces RAM Usage

Traditional dense LLMs load all parameters into memory. Qwen 3.5’s MoE architecture splits the model into 128 experts, activating only 4–10 per token. LLM in a Flash exploits this sparsity: only active experts are streamed from SSD. This reduces peak RAM demand by over 95%, making 397B parameters feasible on consumer hardware. Unlike Llama 3 70B, which requires 80GB+ RAM, Qwen 397B runs smoother here due to its sparse design.

Performance Comparison: Qwen 397B vs. Llama 3 70B on M3 Max

Model	Params	RAM Usage	Quantization	Latency (tokens/sec)	Offline Capable
Qwen 3.5-397B-A17B	397B	5.5GB	2-bit (experts), 4-bit (embed)	18.2	Yes
Llama 3 70B	70B	82GB	4-bit	12.1	No (on M3 Max)

The AI Co-Authorship Debate

The accompanying 22-page research paper was largely drafted by Claude, raising ethical questions: is AI a tool or co-author? Woods transparently disclosed its use, positioning AI as an accelerator—not a replacement—for human insight. This transparency sets a new standard for AI-assisted research in 2026.

LLM in a Flash isn’t just a demo—it’s a paradigm shift. With Apple MLX, quantization, and MoE architectures, powerful AI now runs offline, on battery, and in privacy-sensitive environments. From secure enterprise deployments to field researchers in remote zones, the future of LLMs is local, lean, and lightning-fast.

AI-Powered Content

Sources: Apple MLX Paper (2023) • Qwen 3.5 on Hugging Face • danveloper/flash-moe