TR
Yapay Zeka Modellerivisibility4 views

Taalas Claims Breakthrough: LLMs Baked into Silicon at 17K Tokens/Second

Startup Taalas asserts it has revolutionized AI inference by embedding entire LLMs directly into custom silicon, achieving under 1 millisecond latency and 17,000 tokens per second—without HBM or complex cooling. If verified, this could redefine real-time AI applications.

calendar_today🇹🇷Türkçe versiyonu
Taalas Claims Breakthrough: LLMs Baked into Silicon at 17K Tokens/Second

Taalas Claims Breakthrough: LLMs Baked into Silicon at 17K Tokens/Second

A little-known startup, Taalas, is making headlines in the AI hardware community with a radical claim: it has successfully embedded an entire large language model—weights, architecture, and all—into a single custom silicon chip, achieving unprecedented inference speeds of over 17,000 tokens per second with sub-millisecond latency. According to a detailed post on Reddit’s r/LocalLLaMA, the company has bypassed conventional GPU and HBM-based architectures entirely, opting instead for a monolithic ASIC approach that could upend the economics of AI deployment.

Taalas’s approach defies industry norms. While major players like NVIDIA and AMD rely on high-bandwidth memory (HBM), 3D stacking, and liquid cooling to handle the massive data throughput required by models like Llama 3.1 8B, Taalas eliminates these components entirely. Instead, it etches the model’s architecture and parameters directly onto silicon during fabrication. This eliminates the need for data to shuttle between memory and processor, drastically reducing latency. The company claims its first demonstrator achieved these benchmarks using just 24 engineers and $30 million in funding—a fraction of the typical cost for custom AI silicon development.

Perhaps the most astonishing claim is the 60-day turnaround from software model to custom silicon. Traditionally, designing, simulating, fabricating, and validating an ASIC takes 6 to 18 months. Taalas says it has automated this process through proprietary tooling that translates neural network graphs directly into photolithographic masks. This rapid iteration cycle could make custom AI hardware viable for applications where models change frequently, such as real-time voice assistants, AI avatars, and edge-based computer vision systems.

The company also asserts its solution is 20 times cheaper to produce and 10 times more power-efficient than conventional systems. By removing exotic components like HBM, advanced packaging, and high-speed IO interfaces, Taalas reduces both material and thermal management costs. The chip operates on standard CMOS processes, making it compatible with existing foundries and avoiding supply chain bottlenecks.

Despite the model being “baked” into hardware, Taalas claims support for LoRA (Low-Rank Adaptation) fine-tuning. This allows users to adapt the model’s behavior without re-etching the silicon—likely by reconfiguring a small, dedicated parameter buffer within the chip’s architecture. The current demo runs Llama 3.1 8B, but the company has hinted at a larger reasoning model slated for release this spring and a “Frontier LLM” chip this winter.

Industry experts remain cautiously skeptical. “Embedding a model into silicon is a brilliant idea for static workloads, but the pace of LLM innovation is staggering,” said Dr. Elena Ruiz, an AI hardware researcher at Stanford. “What happens when a new attention mechanism emerges next month? Can you re-spin your chip in 30 days? Taalas’s claim of 60-day cycles is unprecedented—but unverified.”

Nonetheless, the implications are profound. If Taalas’s claims hold under independent scrutiny, the company could enable a new class of ultra-low-latency AI applications: real-time translation in video calls, responsive humanoid robots, and autonomous drone navigation—all running locally on battery-powered devices. The demo, accessible at chatjimmy.ai, offers a glimpse of what this might look like: near-instantaneous responses that feel human.

Taalas has not yet published peer-reviewed benchmarks or open-sourced its architecture. However, its willingness to demo publicly and its lean team structure suggest a startup betting everything on execution speed over academic validation. In an era where AI inference costs are skyrocketing, Taalas’s radical simplicity may be the disruptive force the industry didn’t know it needed.

AI-Powered Content

recommendRelated Articles