MiniMax 2.5 AI Model Runs Locally on 8x Pro 6000 GPUs with FP8 Precision
An anonymous developer has successfully deployed MiniMax 2.5, a high-performance AI model, locally using eight AMD Instinct Pro 6000 GPUs in full FP8 precision via vLLM — a rare feat that challenges cloud-centric AI deployment norms. The system achieves 122 tokens per second with 728GB peak memory usage, signaling a new frontier in on-premise generative AI.

In a groundbreaking demonstration of local AI inference capabilities, an anonymous developer has successfully run the MiniMax 2.5 large language model on a local workstation equipped with eight AMD Instinct Pro 6000 GPUs, utilizing full-precision FP8 quantization through the vLLM inference engine. The achievement, first documented on the r/LocalLLaMA subreddit, marks one of the earliest known deployments of MiniMax 2.5 outside of cloud-based APIs and signals a potential shift toward decentralized, high-performance AI infrastructure.
The system, configured with a modified version of the vLLM framework previously used for M2.1 models, achieved a sustained generation rate of 70 tokens per second under single-user load and scaled to 122 tokens per second with dual concurrent connections. Peak memory consumption reached 728GB — a staggering figure that underscores the model’s massive parameter footprint and the computational intensity of full-precision FP8 inference. Notably, the developer emphasized that hosting the model was "easier than I thought," suggesting that the underlying architecture may be more accessible to well-resourced private entities than previously assumed.
MiniMax, a Chinese AI startup known for developing competitive large language models, has not officially confirmed the release of MiniMax 2.5 or its availability for local deployment. According to the company’s official website (minimax.si), MiniMax is a Slovenian accounting software provider offering financial management tools for small and medium enterprises — a clear mismatch with the AI model referenced in the Reddit post. This discrepancy raises urgent questions about model naming conventions, potential brand confusion, or the possibility of a different entity — possibly a Chinese firm using a similar name — being misattributed to the Slovenian software company.
Further investigation reveals that help.minimax.si, the official support portal, exclusively details accounting and bookkeeping software features, with no mention of artificial intelligence, machine learning, or LLMs. This confirms that the MiniMax referenced in the Reddit post is not the Slovenian accounting software firm, but rather an unrelated entity — likely Minimax AI, the Shanghai-based company behind models such as ABAB and MiniMax 2.5. The naming overlap appears to be coincidental but potentially misleading, especially for non-technical audiences searching for "MiniMax" in AI contexts.
The technical success of this deployment has significant implications for enterprise AI strategy. Running a model of this scale locally — without relying on OpenAI, Anthropic, or other cloud providers — could offer enhanced data sovereignty, reduced latency, and improved compliance for industries such as finance, healthcare, and government. The use of AMD’s Pro 6000 GPUs, which are not traditionally the first choice for LLM inference (NVIDIA dominates the market), also suggests growing viability of AMD’s ecosystem in high-end AI workloads.
While the developer did not disclose hardware specifications beyond GPU count, the 728GB memory requirement implies a multi-node or heavily over-provisioned single-server setup. The fact that vLLM — a framework optimized for NVIDIA GPUs — was successfully adapted to AMD hardware indicates strong community-driven innovation and potential for broader hardware interoperability in the future.
As organizations increasingly seek alternatives to proprietary cloud AI services, this experiment provides a compelling proof-of-concept: powerful generative AI can be owned, controlled, and operated on-premise. Whether MiniMax AI will officially support such deployments remains to be seen, but the technical barrier has been lowered — and the race toward private, scalable AI infrastructure has just accelerated.


