FlashLM v6 'SUPERNOVA': Ternary CPU-Only Model Shatters Speed Barriers Without Attention
A student developer has unveiled FlashLM v6 'SUPERNOVA', a 4.1M-parameter language model achieving 3,500 tokens per second on a 2-thread CPU — entirely without attention or convolution layers. The breakthrough P-RCSM architecture, built with ternary weights and optimized linear operations, could redefine edge AI and speculative decoding.

FlashLM v6 'SUPERNOVA': Ternary CPU-Only Model Shatters Speed Barriers Without Attention
summarize3-Point Summary
- 1A student developer has unveiled FlashLM v6 'SUPERNOVA', a 4.1M-parameter language model achieving 3,500 tokens per second on a 2-thread CPU — entirely without attention or convolution layers. The breakthrough P-RCSM architecture, built with ternary weights and optimized linear operations, could redefine edge AI and speculative decoding.
- 2FlashLM v6 'SUPERNOVA': Ternary CPU-Only Model Shatters Speed Barriers Without Attention A groundbreaking language model, FlashLM v6 "SUPERNOVA," has emerged from the sidelines of academic AI research, demonstrating that state-of-the-art language generation is possible without GPUs, attention mechanisms, or convolutional layers.
- 3Developed by a student researcher with no access to dedicated hardware, the 4.1-million-parameter model achieves an astonishing 3,500 tokens per second on a modest two-core CPU, using only 16MB of RAM and a free cloud notebook.
psychology_altWhy It Matters
- check_circleThis update has direct impact on the Yapay Zeka Modelleri topic cluster.
- check_circleThis topic remains relevant for short-term AI monitoring.
- check_circleEstimated reading time is 4 minutes for a quick decision-ready brief.
FlashLM v6 'SUPERNOVA': Ternary CPU-Only Model Shatters Speed Barriers Without Attention
A groundbreaking language model, FlashLM v6 "SUPERNOVA," has emerged from the sidelines of academic AI research, demonstrating that state-of-the-art language generation is possible without GPUs, attention mechanisms, or convolutional layers. Developed by a student researcher with no access to dedicated hardware, the 4.1-million-parameter model achieves an astonishing 3,500 tokens per second on a modest two-core CPU, using only 16MB of RAM and a free cloud notebook. The innovation lies not in scale, but in architecture: a novel P-RCSM (Parallel-Recursive Compositional State Machines) design that replaces traditional transformer components with ternary linear operations and memory slots, challenging long-held assumptions about what’s required for coherent text generation.
According to the developer’s detailed GitHub and Hugging Face release, FlashLM v6 eschews all forms of attention and convolution — two pillars of modern language models since 2017. Instead, it relies on three core innovations: a MultiScaleLinearBank that replaces convolutions with parallel ternary linear projections across temporal shifts; a HierarchicalStateGate that decouples slow-planning and fast-execution states using a compact 32-dimensional planner; and a SlotMemoryAttention mechanism that uses a fixed set of eight learned memory slots accessed via a single batched matrix multiplication, eliminating sequential memory reads. All components use only F.linear calls and element-wise operations, optimized for CPU execution via BLAS libraries. Remarkably, 81% of the model’s weights are ternary (-1, 0, +1), reducing memory footprint and computational load while maintaining performance.
Training was conducted entirely on a free Deepnote instance with two CPU threads and 5GB of RAM, using only 31 million tokens from the TinyStories dataset. Despite the minimal data and hardware constraints, the model achieved a validation perplexity of 14.0 — outperforming its predecessor FlashLM v4 (PPL 15.05) while delivering 2.4x the throughput. Speed improvements were dramatic: an early version using Conv1d layers ran at just 13 tokens per second due to a PyTorch 2.1.2 bug that crippled CPU performance. Upgrading to PyTorch 2.5.1+ and replacing all convolutions with linear layers boosted speed to 3,500 tok/s — a 270x improvement. This underscores a critical insight: on CPUs, optimized linear algebra operations outperform even well-tuned convolutions.
The implications extend far beyond toy story generation. The developer explicitly frames this as a proof-of-concept for lightweight, latency-critical AI applications: draft token generation for speculative decoding alongside large GPU models, routing in Mixture-of-Experts systems, or deployment on smartphones and microcontrollers. With a total model size of just 800KB when quantized, FlashLM v6 fits entirely within L2 cache on modern CPUs, suggesting potential for native C inference with AVX2 optimizations — a path the team is actively exploring.
While the model’s current performance is constrained by dataset size and architecture scale — the reasoning components (d_reason=64, d_planner=32) are small — the architecture shows promise for scaling. The developer plans to test P-RCSM on larger datasets and models exceeding 10M parameters, and is already exploring code generation via a new "Nano-Coder" series. MIT-licensed code and weights are publicly available on GitHub and Hugging Face, inviting collaboration from researchers and engineers seeking efficient alternatives to transformer-based systems.
FlashLM v6 doesn’t aim to replace GPT-4 or Llama 3. Instead, it offers a radical reimagining of what’s possible under extreme resource constraints — proving that efficiency, not just scale, can drive innovation in AI. In an era where AI models grow ever more energy-intensive, this student-led project may be a harbinger of a new class of lightweight, sustainable language systems.


