First-Ever Multi-Node MI325X Results Highlight AMD's Scaling Capabilities

resources

blog

MangoBoost Pushes AI Scalability in MLPerf Training v5.1: First-Ever Multi-Node MI325X Results Highlight AMD's Scaling Capabilities

November 13, 2025

by MangoBoost

MangoBoost continues to define the cutting edge of AI scalability with our latest MLPerf Training v5.1 submission. Building on our track record, this round establishes several new industry firsts, including the first MLPerf multi-node results for the AMD Instinct™ MI325X GPU.

With a close partnership with AMD, MangoBoost submission this round features the multi-node results on AMD InstinctTM MI300X and the AMD InstinctTM MI325X GPUs, our submission demonstrates significant performance gains and the power of our Mango LLMBoost™ Training solution on AMD hardware.

Key Takeaways for MLPerf Training v5.1

🔥 Multi-Node MI325X Submission: MangoBoost submissions in this round feature the new AMD Instinct™ MI325X multi-node systems showing strong scalability and proving our stack is ready for next-generation AI accelerators.

🏆 Exclusive AMD Multi-Node Leadership: Continuing our strong multi-node scaling performance on AMD GPUs from MLPerf Training 5.0, MangoBoost’s multi-node results on AMD GPUs in this round proves our unique scalability expertise—an achievement driven by software intelligence, optimized ROCm™ integration, and deep collaborations with AMD to push the frontiers of large-scale LLM training.

📈 Proven Performance Gains: We showcased our platform's continuous improvement, achieving a 4% performance improvement on the MI300X (vs. our 5.0 submission) and 4% on the MI325X (vs. AMD's 5.0 submission).

🤝 Partner Collaboration: Expanding the Open Ecosystem: This round marks our first-ever joint Training submission on AMD GPUs with our partner, Supermicro, demonstrating the growing momentum and collaborative power of AMD's open, partner-driven mission.

⚡ GPUBoost™ RDMA Acceleration: Our MI300X multi-node setup was accelerated by our Mango GPUBoost™ solution, enabling highly efficient, large-scale training over RoCEv2 networks.

Advancing Multi-Node Scalability on AMD Instinct™ MI325X

In MLPerf Training v5.1, MangoBoost has set a critical new milestone. Our submission represents the first-ever public demonstration of multi-node scalability for the AMD Instinct™ MI325X GPU.

By combining software intelligence, optimized ROCm™ integration, and deep collaborations with AMD, we continue to push the frontiers of performance, scalability, and cost efficiency for large-scale LLM training. This achievement validates that our Mango LLMBoost™ software stack is optimized and ready to harness the power of next-generation hardware from day one.

In addition to this milestone, we improved performance across all submitted AMD GPUs, delivering a 4% gain on the MI300X compared to our own 5.0 submission and a 4% gain on the MI325X compared to AMD's 5.0 submission. This demonstrates the continuous optimization and maturity of our LLMBoost platform.

This milestone was also recognized by industry leaders:

Mangoboost has collaborated with us at AMD across several rounds of competitive MI Instinct MLPerf submissions for both training and inference workloads. Through its LLMBoost stack — with rich features such as autotuning, parallelism, memory management, and optimal scheduling — Mangoboost has extended the capabilities of our ROCm™ AI software stack and system solutions for LLM and GenAI training and inference. In this MLPerf 5.1 round, using GPUBoost and LLMBoost, the team successfully delivered competitive AMD Instinct™ MI325X and MI300X submissions. Congratulations to Team Mangoboost! — Meena Arunachalam, Fellow and Director, AI Performance Design Engineering, AMD

The End-to-End AI Platform

Powering these record-setting results is MangoBoost’s end-to-end AI platform, which co-designs software and hardware for maximum efficiency.

Mango LLMBoost™ AI Enterprise MLOps Software: Our core software solution for AI training and finetuning. It features patent-pending technology for auto-tuning, model parallelism, batch scheduling, and memory management.

Mango GPUBoost™ RoCEv2 RDMA solution: Our DPU-based hardware acceleration, which enables customizable, low-latency, and high-throughput communication across nodes.

Figure 1: MangoBoost's end-to-end stack, showing how the LLMBoost software platform and the GPUBoost hardware acceleration work together over AMD Instinct™ hardware with RDMA-enabled interconnects to accelerate training

Spotlight on Mango GPUBoost™: Advanced RDMA for AI

Multi-node training isn’t just about adding more GPUs—it's about ensuring those GPUs communicate efficiently and reliably across nodes. That’s where MangoBoost’s GPUBoost™ RDMA comes in, as depicted in Figure 2.

Built on a custom RoCEv2-enabled DPU architecture, Mango GPUBoost™ delivers line-rate performance and scalability. It is UEC-ready and offers significant advantages over standard RNICs. Its uniqueness versus other RoCEv2 RNICs lies in its customized congestion control and load balancing, which ensures stable, high-throughput communication even under heavy, large-scale AI workloads.

Key Features of Mango GPUBoost™ RDMA:

UEC-ready, up to 3.2Tbps RDMA (RoCEv2) throughput in total (up to 400Gbps per NIC)

Packet spray, programmable congestion control, selective re-transmission.

Configuration-free RoCEv2: Enables large-scale RoCEv2 deployment without requiring switch configuration or additional congestion control setup

Peer-to-peer communication between RNIC and GPU (i.e., ROCm™ – RDMA support)

Workload-specific optimization hooks using RoCEv2 extended headers

High connection scalability: Maintains line-rate throughput at thousands of QPs—3.4× higher than standard RNICs under heavy load

MangoBoost’s RDMA stack ensures that your training isn’t bottlenecked by the network—one of the biggest pain points in distributed deep learning today.

The Mango LLMBoost™ Platform: Train Anywhere, Any Model

Mango LLMBoost™ is designed to be seamless and flexible, freeing ML developers from hardware or model lock-in.

🧠 Model-Agnostic by Design Our platform is rigorously tested to support popular open models and multi-modal models right out of the box, including:
✅ Dense models like Llama, Qwen, and DeepSeek
✅ Sparse Mixture-of-Experts models like Mixtral
✅ Multi-modal models like Whisper and Llava
✅ LoRA and PEFT-style fine-tuning variants

⚙️ Hardware-Flexible Architecture LLMBoost is optimized for a diverse range of GPUs from both NVIDIA and AMD, allowing you to make the best hardware choices based on price, performance, or availability.

☁️ Cloud & On-Premise Deployment Ready LLMBoost enables developers to effortlessly scale from a single-GPU proof of concept to large-scale, multi-GPU deployments with one-line deployment and robust API integration.

Public cloud: Available on AWS, Azure, and GCP

On-premise clusters: Full support for Kubernetes & SLURM-compatible environments

Beyond Training: A Unified Platform for GenAI Inference

MangoBoost’s LLMBoost platform is a comprehensive GenAI solution that excels at both training and inference. Our leadership was also highlighted in the recent MLPerf Inference v5.1 round, where we achieved record-breaking performance and several industry firsts.

Read more about our groundbreaking inference results in our MLPerf Inference v5.1 Blog Post.

Key Inference v5.1 Highlights:

Record-Breaking Throughput: Achieved the highest MLPerf inference performance for Llama2-70B (169K tok/s in the closed division and 648K tok/s in the open division).

First-Ever Heterogeneous Deployment: Demonstrated near-linear performance scaling across multi-architecture clusters combining AMD MI300X and MI325X GPUs.

Broad Partner Collaboration: Co-submissions with AMD, Dell, and Supermicro validated LLMBoost across diverse hardware platforms.

Beyond MLPerf, LLMBoost delivers breakthrough results on diverse workloads, proving up to 186× faster than Ollama and 4× faster than vLLM on models like Llama4-Scout MoE, and up to 43.5× faster than vLLM on Qwen2.5 multi-modal workloads.

MangoBoost AI Infrastructure Hardware Solutions

Beyond our LLMBoost software, MangoBoost offers a suite of DPU-powered hardware acceleration solutions for AI and cloud infrastructure:

Mango GPUBoost™: Provides RDMA acceleration for multi-node training and inference via RoCEv2, as showcased in this round.

Mango NetworkBoost™: Offloads web-serving and TCP stacks to dramatically reduce CPU utilization.

Mango StorageBoost™: Delivers high-performance NVMe initiator and target solutions for scalable AI storage systems.

What’s Next?

This is just the beginning. Our R&D team is already implementing next-generation communication optimizations, hybrid parallelism strategies, and topology-aware scheduling to push multi-node performance even further.

Best of all, anyone can replicate our results with ease by downloading our Docker image and running a single command.

If your organization is building large-scale AI models or exploring alternative GPU architectures, MangoBoost is your partner in unlocking full-stack efficiency and scalability.

📩 To learn more or request a demo, contact us at contact@mangoboost.io

Ready for
a boost

Our team is at the ready to create a customized plan for you to optimize and scale your business. Contact us today to set up a call.