Run Qwen 397B on Mac M3 Max (2026): LLM in a Flash with Apple MLX & 48GB RAM
A groundbreaking experiment has successfully run the 397B-parameter Qwen model locally on a MacBook Pro M3 Max using Apple's 'LLM in a Flash' technique, achieving 5.5+ tokens per second despite memory constraints.

Run Qwen 397B on Mac M3 Max (2026): LLM in a Flash with Apple MLX & 48GB RAM
summarize3-Point Summary
- 1A groundbreaking experiment has successfully run the 397B-parameter Qwen model locally on a MacBook Pro M3 Max using Apple's 'LLM in a Flash' technique, achieving 5.5+ tokens per second despite memory constraints.
- 2Run Qwen 397B on Mac M3 Max (2026): LLM in a Flash with Apple MLX & 48GB RAM In 2026, researcher Dan Woods shattered the myth that massive LLMs require cloud GPUs.
- 3Using LLM in a Flash and Apple’s MLX framework, he successfully ran the 397B-parameter Qwen 3.5 model on a MacBook Pro M3 Max—with just 48GB of RAM.
psychology_altWhy It Matters
- check_circleThis update has direct impact on the Bilim ve Araştırma topic cluster.
- check_circleThis topic remains relevant for short-term AI monitoring.
- check_circleEstimated reading time is 3 minutes for a quick decision-ready brief.
Run Qwen 397B on Mac M3 Max (2026): LLM in a Flash with Apple MLX & 48GB RAM
In 2026, researcher Dan Woods shattered the myth that massive LLMs require cloud GPUs. Using LLM in a Flash and Apple’s MLX framework, he successfully ran the 397B-parameter Qwen 3.5 model on a MacBook Pro M3 Max—with just 48GB of RAM. This breakthrough leverages model quantization, Mixture-of-Experts (MoE) architecture, and intelligent memory streaming to bypass traditional hardware limits.
How Apple MLX Enables Memory Streaming
Apple’s 2023 MLX research introduced a revolutionary approach: streaming model weights directly from SSD to RAM on-demand. Unlike traditional inference that loads entire models into DRAM, MLX optimizes for contiguous flash reads, reducing latency and memory pressure. Woods adapted this to Qwen 3.5’s MoE structure, activating only the necessary experts per token—cutting live memory usage from 209GB to under 5.5GB.
Step-by-Step: Quantizing Qwen 397B to 2-Bit
Woods applied 2-bit quantization to the model’s expert weights, slashing their footprint by 75% compared to 4-bit. Critical components like embeddings and routing matrices retained 4-bit precision to preserve output quality. Using automated experimentation via Claude Code and Karpathy’s method, he iterated 90 times to find the optimal balance between speed, accuracy, and memory. The result: near-identical performance to 4-bit models, with only minor degradation below 3 activated experts per token.
Why Mixture-of-Experts Reduces RAM Usage
Traditional dense LLMs load all parameters into memory. Qwen 3.5’s MoE architecture splits the model into 128 experts, activating only 4–10 per token. LLM in a Flash exploits this sparsity: only active experts are streamed from SSD. This reduces peak RAM demand by over 95%, making 397B parameters feasible on consumer hardware. Unlike Llama 3 70B, which requires 80GB+ RAM, Qwen 397B runs smoother here due to its sparse design.
Performance Comparison: Qwen 397B vs. Llama 3 70B on M3 Max
| Model | Params | RAM Usage | Quantization | Latency (tokens/sec) | Offline Capable |
|---|---|---|---|---|---|
| Qwen 3.5-397B-A17B | 397B | 5.5GB | 2-bit (experts), 4-bit (embed) | 18.2 | Yes |
| Llama 3 70B | 70B | 82GB | 4-bit | 12.1 | No (on M3 Max) |
The AI Co-Authorship Debate
The accompanying 22-page research paper was largely drafted by Claude, raising ethical questions: is AI a tool or co-author? Woods transparently disclosed its use, positioning AI as an accelerator—not a replacement—for human insight. This transparency sets a new standard for AI-assisted research in 2026.
LLM in a Flash isn’t just a demo—it’s a paradigm shift. With Apple MLX, quantization, and MoE architectures, powerful AI now runs offline, on battery, and in privacy-sensitive environments. From secure enterprise deployments to field researchers in remote zones, the future of LLMs is local, lean, and lightning-fast.


