FlashLM v6 'SUPERNOVA': Ternary CPU-Only Model Shatters Speed Barriers Without Attention

A groundbreaking language model, FlashLM v6 "SUPERNOVA," has emerged from the sidelines of academic AI research, demonstrating that state-of-the-art language generation is possible without GPUs, attention mechanisms, or convolutional layers. Developed by a student researcher with no access to dedicated hardware, the 4.1-million-parameter model achieves an astonishing 3,500 tokens per second on a modest two-core CPU, using only 16MB of RAM and a free cloud notebook. The innovation lies not in scale, but in architecture: a novel P-RCSM (Parallel-Recursive Compositional State Machines) design that replaces traditional transformer components with ternary linear operations and memory slots, challenging long-held assumptions about what’s required for coherent text generation.

According to the developer’s detailed GitHub and Hugging Face release, FlashLM v6 eschews all forms of attention and convolution — two pillars of modern language models since 2017. Instead, it relies on three core innovations: a MultiScaleLinearBank that replaces convolutions with parallel ternary linear projections across temporal shifts; a HierarchicalStateGate that decouples slow-planning and fast-execution states using a compact 32-dimensional planner; and a SlotMemoryAttention mechanism that uses a fixed set of eight learned memory slots accessed via a single batched matrix multiplication, eliminating sequential memory reads. All components use only F.linear calls and element-wise operations, optimized for CPU execution via BLAS libraries. Remarkably, 81% of the model’s weights are ternary (-1, 0, +1), reducing memory footprint and computational load while maintaining performance.

Training was conducted entirely on a free Deepnote instance with two CPU threads and 5GB of RAM, using only 31 million tokens from the TinyStories dataset. Despite the minimal data and hardware constraints, the model achieved a validation perplexity of 14.0 — outperforming its predecessor FlashLM v4 (PPL 15.05) while delivering 2.4x the throughput. Speed improvements were dramatic: an early version using Conv1d layers ran at just 13 tokens per second due to a PyTorch 2.1.2 bug that crippled CPU performance. Upgrading to PyTorch 2.5.1+ and replacing all convolutions with linear layers boosted speed to 3,500 tok/s — a 270x improvement. This underscores a critical insight: on CPUs, optimized linear algebra operations outperform even well-tuned convolutions.

The implications extend far beyond toy story generation. The developer explicitly frames this as a proof-of-concept for lightweight, latency-critical AI applications: draft token generation for speculative decoding alongside large GPU models, routing in Mixture-of-Experts systems, or deployment on smartphones and microcontrollers. With a total model size of just 800KB when quantized, FlashLM v6 fits entirely within L2 cache on modern CPUs, suggesting potential for native C inference with AVX2 optimizations — a path the team is actively exploring.

While the model’s current performance is constrained by dataset size and architecture scale — the reasoning components (d_reason=64, d_planner=32) are small — the architecture shows promise for scaling. The developer plans to test P-RCSM on larger datasets and models exceeding 10M parameters, and is already exploring code generation via a new "Nano-Coder" series. MIT-licensed code and weights are publicly available on GitHub and Hugging Face, inviting collaboration from researchers and engineers seeking efficient alternatives to transformer-based systems.

FlashLM v6 doesn’t aim to replace GPT-4 or Llama 3. Instead, it offers a radical reimagining of what’s possible under extreme resource constraints — proving that efficiency, not just scale, can drive innovation in AI. In an era where AI models grow ever more energy-intensive, this student-led project may be a harbinger of a new class of lightweight, sustainable language systems.

AI-Powered Content

Sources: connorjdavis.substack.com • www.reddit.com

FlashLM v6 'SUPERNOVA': Ternary CPU-Only Model Shatters Speed Barriers Without Attention

FlashLM v6 'SUPERNOVA': Ternary CPU-Only Model Shatters Speed Barriers Without Attention

summarize3-Point Summary

psychology_altWhy It Matters

FlashLM v6 'SUPERNOVA': Ternary CPU-Only Model Shatters Speed Barriers Without Attention

AI Terms in This Article

recommendRelated Articles

Attention Residuals (2026): Moonshot AI's Breakthrough for Efficient Transformer Scaling

Amazon Nova 2 Lite Content Moderation (2026): How New Prompts Beat Larger AI Models

Cursor Composer 2 AI Model (2026 Review): Beats Claude Opus 4.6 with 86% Lower Cost & Superior Be...