MangoBoost sets a new industry benchmark with MLPerf Training v5.0 submission

resources

blog

MangoBoost Sets a New MLPerf Training v5.0 Standard for Multi-nodes Llama2-70B-LoRA on AMD Instinct™ MI300X GPU!

June 04, 2025

by MangoBoost

MangoBoost, a provider of cutting-edge system solutions designed to maximize AI data center efficiency, has set a new industry benchmark with its latest MLPerf Training v5.0 submission. The company’s Mango LLMBoost™ AI Enterprise MLOps Software together with Mango GPUBoost™ RoCEv2 RDMA solution has demonstrated unparalleled performance on AMD Instinct™ MI300X GPUs, delivering the first-ever MLPerf multi-node results on AMD GPUs.

Key Takeaways

🔥 MangoBoost sets a new milestone in MLPerf Training v5.0, showcasing its LLMBoost™ AI Enterprise MLOps Software and GPUBoost™ RoCEv2 RDMA solution on AMD Instinct™ MI300X servers.

🏆 First-ever multi-node MLPerf submission using AMD Instinct™ GPUs, achieving the fastest Llama2-70B-LoRA fine-tuning on AMD hardware at just 10.91 minutes, scaling across 32 MI300X GPUs over 4 nodes with high-speed Mango GPUBoost™ RDMA.

🛠️ Customizable RoCEv2-based DPU-powered communication stack ensures seamless multi-node scalability—purpose-built for high-throughput AI and cloud infrastructure.

📈 Demonstrates near-linear scaling (95–100% efficiency) on additional models like Llama2-7B and Llama3.1-8B, validating real-world scalability beyond benchmark tuning.

🌐 Model-agnostic, hardware-flexible, and deployment-ready across public cloud or on-premise—MangoBoost enables enterprise-grade training anywhere.

End-to-End Software-Hardware Stack for Performance at Scale

At the heart of this achievement is MangoBoost’s integrated platform, illustrated in Figure 1:

LLMBoost™ AI Enterprise MLOps Software, featuring patent-pending technology for auto-tuning, model parallelism, batch scheduling, and memory management.

GPUBoost™ RoCEv2 RDMA solution, enabling customizable, low-latency, high-throughput communication across nodes using RDMA over converged Ethernet.

Together, these components enable predictable, high-speed LLM training on MI300X GPU infrastructure.

Figure 1: LLMBoost software stack layered over AMD MI300X hardware with RDMA-enabled interconnects

Figure 2: MLPerf Training performance comparison of Llama2-70B-LoRA on AMD MI300X and NVIDIA H100 GPUs. The plot includes all latest H100 submissions for 1,2,4 nodes running Llama2-70B-LoRA, and MangoBoost's MI300X results. The trendlines show similar scaling behavior, highlighting the competitiveness of MI300X in multi-node training

First-Ever Multi-Node MLPerf Submission on AMD GPUs

MangoBoost successfully scaled Llama2-70B-LoRA fine-tuning across four AMD Instinct™ MI300X nodes (32 GPUs)—powered by our LLMBoost™ Enterprise MLOps Software and GPUBoost™ RDMA solution, which features a custom RoCEv2-enabled DPU hardware layer. This breakthrough enabled us to achieve:

The fastest Llama2-70B-LoRA fine-tuning time on AMD GPUs: just 10.91 minutes

Seamless orchestration across 32 GPUs using MangoBoost’s end-to-end AI stack

MLPerf-compliant results that demonstrate the maturity of AMD MI300X for real-world AI workloads

This submission marks the first-ever multi-node MLPerf Training result on AMD Instinct™ GPUs, proving that high-performance large-scale AI training is now possible beyond vendor-locked platforms.

As shown in Figure 2, our AMD MI300X runs—spanning 8, 16, and 32 GPUs—exhibit scaling behavior comparable to prior NVIDIA H100 submissions in MLPerf Training v4.1. The trendlines validate that MI300X can deliver competitive multi-node performance backed by our software-hardware co-design.

This milestone was also recognized by industry leaders:

"I'm excited to see MangoBoost's first MLPerf Training results, pairing their LLMBoost AI Enterprise MLOps software with their RoCEv2-based GPUBoost DPU hardware to unlock the full power of AMD GPUs, demonstrated by their scalable performance from a single-node MI300X to 2- and 4-node MI300X results on Llama2-70B LoRA. Their results underscore that a well-optimized software stack is critical to fully harness the capabilities of modern AI accelerators."
— David Kanter, Founder, Head of MLPerf, MLCommons

"We congratulate Mangoboost on their MLPerf 5.0 training results on AMD GPUs and are excited to continue our collaboration with them to unleash the full power of AMD GPUs. In this MLperf Training submission, MangoBoost has achieved a key milestone in demonstrating training results on AMD GPUs across 4 nodes (32 GPUs). This showcases how the AMD Instinct™ MI300X GPUs and ROCm™ software stack synergize with MangoBoost's LLMBoost™ AI Enterprise software and GPUBoost™ RoCEv2 NIC."
— Meena Arunachalam, Fellow, AI Performance Design Engineering, AMD

As our first multi-node MLPerf Training submission on AMD Instinct™ MI300X, this result already demonstrates strong performance and scalability. With continued tuning and deeper hardware-software co-optimization, we expect even stronger results in future rounds.

Figure 3: Near-Linear Training Scalability of Llama2-7B and Llama3.1-8B across multiple GPU configurations

Demonstrating Near-Linear Scaling on Llama2-7B and Llama3.1-8B

MangoBoost has benchmarked additional large language models—including Llama2-7B and Llama3.1-8B—to demonstrate the real-world scalability of its LLMBoost™ platform. Unlike our MLPerf Training v5.0 submission, these results are from internal full-training benchmarks, and are not part of the MLPerf suite, as MLPerf currently does not support these two models.

As shown in Figure 3, both models exhibit near-linear scaling as GPU count increases, confirming that our MLOps software and RDMA communication stack can efficiently orchestrate training across a variety of cluster sizes and workloads.

Achieves ~95–100% scaling efficiency across GPU configurations

Validates architecture-aware scheduling and optimized RDMA communication in LLMBoost™ and GPUBoost™

Confirms that performance generalizes beyond benchmark constraints, adapting well to diverse real-world training scenarios.

These results further demonstrate that MangoBoost’s LLMBoost™ platform is production-ready, scalable, and optimized for modern AI training workloads—regardless of model size or structure.

Figure 4: A multi-node cluster with Mango GPUBoost™ – RDMA providing hardware acceleration for the ROCmRDMA peer-to-peer communication.

Mango GPUBoost™ RDMA: Unlocking Multi-Node Training at Scale

Multi-node training isn’t just about adding more GPUs—it's about ensuring those GPUs communicate efficiently and reliably across nodes. That’s where MangoBoost’s GPUBoost™ RDMA comes in as depicted in Figure 4.

Built on a custom RoCEv2-enabled DPU architecture, Mango GPUBoost™ delivers line-rate performance, congestion control, and scalability that outpace standard RNICs in the market.

Key Features of Mango GPUBoost™ RDMA:

Up to 3.2Tbps RDMA (RoCEv2) throughput in total (up to 400Gbps per NIC)

Peer-to-peer communication between RNIC and GPU (i.e., ROCmRDMA support)

Customized congestion control and load balancing

Workload-specific optimization hooks using RoCEv2 extended headers

High connection scalability: Maintains line-rate throughput at thousands of QPs—3.4× higher than standard RNICs under heavy load

MangoBoost’s RDMA stack ensures that your training isn’t bottlenecked by the network—one of the biggest pain points in distributed deep learning today.

Collaboration with AMD: Unlocking the Full Potential of MI300X GPUs

MangoBoost’s ground-breaking MLPerf performance was made possible through deep collaboration with our partner AMD and full integration with the AMD ROCm software stack, unlocking the full potential of MI300X GPUs with industry-leading compute density and massive memory bandwidth.

Together, AMD’s ROCm platform and MangoBoost’s LLMBoost stack deliver an AI training and finetuning solution that is fast, scalable, and easy to deploy—whether on a single node or across a multi-node cluster. MangoBoost’s GPUBoost RoCEv2 RDMA solutions further pushes the performance and scalability of the GPUs, and allows training across multiple MI300X nodes.

Train Anywhere with Mango LLMBoost™: Model-Agnostic, Hardware-Flexible, and Deployment-Ready

Modern AI infrastructure demands flexibility—not lock-in. MangoBoost’s LLMBoost AI Enterprise MLOps Software is built to support enterprise-grade LLM training wherever you need it, whether that’s on public cloud, private datacenters, or hybrid environments.

🧠 Model-Agnostic by Design

MangoBoost supports a broad range of open and custom LLM architectures out of the box:

Dense models like Llama2-8B, Llama2-70B, and GPT

Sparse Mixture-of-Experts models like Mixtral-12x7B

LoRA and PEFT-style fine-tuning variants

Whether you're training from scratch or fine-tuning with adapters, MangoBoost’s stack adapts to your model with minimal configuration.

⚙️ Hardware-Flexible Architecture

Our full-stack platform is optimized for multiple GPU backends, including:

AMD Instinct™ MI300X — deeply optimized with ROCm and RDMA acceleration

NVIDIA H100 and A100 — compatible via standard CUDA/NCCL support

Future support for other AI accelerators via our abstraction layer

This flexibility ensures you’re not tied to a single vendor and can make hardware choices based on price, performance, or availability.

☁️ Cloud & On-Premise Deployment Ready

MangoBoost offers seamless deployment modes:

Public cloud: Preconfigured Docker images for AWS, Azure, and GCP

On-premise clusters: Hardware-aware deployment scripts and RDMA stack support

Kubernetes & SLURM-compatible: Easily integrated into your existing scheduling and orchestration frameworks

MangoBoost enables turnkey LLM training at scale—any model, any hardware, any environment.

MangoBoost AI Infrastructure Hardware Solutions based on DPUs

Beyond LLMBoost software and GPUBoost RDMA solutions, MangoBoost offers hardware acceleration solutions based on Data Processing Unit (DPUs) for AI and cloud infrastructure, such as:

Mango NetworkBoost™ – Offloads TCP/IP stack to free up CPU resources

Mango StorageBoost™ – High-performance NVMe initiator/target stack for scalable AI storage and JBoF solutions

What’s Next?

This is just the beginning. Our R&D team is already implementing next-generation communication optimizations, hybrid parallelism strategies, topology-aware scheduling, and application-specific hardware acceleration to push multi-node performance even further.

If your organization is building large-scale AI models or exploring alternative GPU architectures, MangoBoost is your partner in unlocking full-stack efficiency and scalability.

To learn more or request a demo, contact us at contact@mangoboost.io.

Let us help you deploy high-performance, cost-efficient, and vendor-flexible AI infrastructure—powered by MangoBoost’s advanced MLOps software and communication stack.

Disclaimer
The performance claims in this document are based on the internal cluster environment. Actual performance may vary depending on the server configuration. Software and workloads used in performance tests may have been optimized for performance only on MangoBoost products. Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. Results that are based on pre-production systems and components as well as results that have been estimated or simulated using MangoBoost reference platform for informational purposes only. Results may vary based on future changes to any systems, components, specifications, or configurations. Statements in this document that refer to future plans or expectations are forward-looking statements. These statements are based on current expectations and involve many risks and uncertainties that could cause actual results to differ materially from those expressed or implied in such statements. MangoBoost does not guarantee any specific outcome. Nothing contained herein is, or shall be relied upon as, a promise or representation or warranty as to future performance of MangoBoost or any MangoBoost product. The information contained herein shall not be deemed to expand in any way the scope or effect of any representations or warranties contained in the definitive agreement for MangoBoost products.

The information contained herein may not be reproduced in whole or in part without prior written consent of MangoBoost. The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. MangoBoost assumes no obligation to update or otherwise correct or revise this information and MangoBoost reserves the right to make changes to the content hereof from time to time without any notice. Nothing contained herein is intended by MangoBoost, nor should it be relied upon, as a promise or a representation as to the future.

MANGOBOOST MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

Ready for
a boost?

Schedule a call with our team today to see how we can customize our products to boost your datacenter