resources
blog
MangoBoost Sets a New Standard for Multi-nodes Llama2-70B-LoRA on AMD Instinct™ MI300X GPU!
June 04, 2025
by MangoBoost
MangoBoost, a provider of cutting-edge system solutions designed to maximize AI data center efficiency, has set a new industry benchmark with its latest MLPerf Training v5.0 submission. The company’s Mango LLMBoost™ AI Enterprise MLOps Software together with Mango GPUBoost™ RoCEv2 RDMA solution has demonstrated unparalleled performance on AMD Instinct™ MI300X GPUs, delivering the first-ever MLPerf multi-node results on AMD GPUs.
At the heart of this achievement is MangoBoost’s integrated platform, illustrated in Figure 1:
Together, these components enable predictable, high-speed LLM training on MI300X GPU infrastructure.
Figure 1: LLMBoost software stack layered over AMD MI300X hardware with RDMA-enabled interconnects
Figure 2: MLPerf Training performance comparison of Llama2-70B-LoRA on AMD MI300X and NVIDIA H100 GPUs. The plot includes all latest H100 submissions for 1,2,4 nodes running Llama2-70B-LoRA, and MangoBoost's MI300X results. The trendlines show similar scaling behavior, highlighting the competitiveness of MI300X in multi-node training
MangoBoost successfully scaled Llama2-70B-LoRA fine-tuning across four AMD Instinct™ MI300X nodes (32 GPUs)—powered by our LLMBoost™ Enterprise MLOps Software and GPUBoost™ RDMA solution, which features a custom RoCEv2-enabled DPU hardware layer. This breakthrough enabled us to achieve:
This submission marks the first-ever multi-node MLPerf Training result on AMD Instinct™ GPUs, proving that high-performance large-scale AI training is now possible beyond vendor-locked platforms.
As shown in Figure 2, our AMD MI300X runs—spanning 8, 16, and 32 GPUs—exhibit scaling behavior comparable to prior NVIDIA H100 submissions in MLPerf Training v4.1. The trendlines validate that MI300X can deliver competitive multi-node performance backed by our software-hardware co-design.
This milestone was also recognized by industry leaders:
"I'm excited to see MangoBoost's first MLPerf Training results, pairing their LLMBoost AI Enterprise MLOps software with their RoCEv2-based GPUBoost DPU hardware to unlock the full power of AMD GPUs, demonstrated by their scalable performance from a single-node MI300X to 2- and 4-node MI300X results on Llama2-70B LoRA. Their results underscore that a well-optimized software stack is critical to fully harness the capabilities of modern AI accelerators."
— David Kanter, Founder, Head of MLPerf, MLCommons
"We congratulate Mangoboost on their MLPerf 5.0 training results on AMD GPUs and are excited to continue our collaboration with them to unleash the full power of AMD GPUs. In this MLperf Training submission, MangoBoost has achieved a key milestone in demonstrating training results on AMD GPUs across 4 nodes (32 GPUs). This showcases how the AMD Instinct™ MI300X GPUs and ROCm™ software stack synergize with MangoBoost's LLMBoost™ AI Enterprise software and GPUBoost™ RoCEv2 NIC."
— Meena Arunachalam, Fellow, AI Performance Design Engineering, AMD
As our first multi-node MLPerf Training submission on AMD Instinct™ MI300X, this result already demonstrates strong performance and scalability. With continued tuning and deeper hardware-software co-optimization, we expect even stronger results in future rounds.
Figure 3: Near-Linear Training Scalability of Llama2-7B and Llama3.1-8B across multiple GPU configurations
MangoBoost has benchmarked additional large language models—including Llama2-7B and Llama3.1-8B—to demonstrate the real-world scalability of its LLMBoost™ platform. Unlike our MLPerf Training v5.0 submission, these results are from internal full-training benchmarks, and are not part of the MLPerf suite, as MLPerf currently does not support these two models.
As shown in Figure 3, both models exhibit near-linear scaling as GPU count increases, confirming that our MLOps software and RDMA communication stack can efficiently orchestrate training across a variety of cluster sizes and workloads.
These results further demonstrate that MangoBoost’s LLMBoost™ platform is production-ready, scalable, and optimized for modern AI training workloads—regardless of model size or structure.
Figure 4: A multi-node cluster with Mango GPUBoost™ – RDMA providing hardware acceleration for the ROCmRDMA peer-to-peer communication.
Multi-node training isn’t just about adding more GPUs—it's about ensuring those GPUs communicate efficiently and reliably across nodes. That’s where MangoBoost’s GPUBoost™ RDMA comes in as depicted in Figure 4.
Built on a custom RoCEv2-enabled DPU architecture, Mango GPUBoost™ delivers line-rate performance, congestion control, and scalability that outpace standard RNICs in the market.
Key Features of Mango GPUBoost™ RDMA:
MangoBoost’s RDMA stack ensures that your training isn’t bottlenecked by the network—one of the biggest pain points in distributed deep learning today.
MangoBoost’s ground-breaking MLPerf performance was made possible through deep collaboration with our partner AMD and full integration with the AMD ROCm software stack, unlocking the full potential of MI300X GPUs with industry-leading compute density and massive memory bandwidth.
Together, AMD’s ROCm platform and MangoBoost’s LLMBoost stack deliver an AI training and finetuning solution that is fast, scalable, and easy to deploy—whether on a single node or across a multi-node cluster. MangoBoost’s GPUBoost RoCEv2 RDMA solutions further pushes the performance and scalability of the GPUs, and allows training across multiple MI300X nodes.
Modern AI infrastructure demands flexibility—not lock-in. MangoBoost’s LLMBoost AI Enterprise MLOps Software is built to support enterprise-grade LLM training wherever you need it, whether that’s on public cloud, private datacenters, or hybrid environments.
MangoBoost supports a broad range of open and custom LLM architectures out of the box:
Whether you're training from scratch or fine-tuning with adapters, MangoBoost’s stack adapts to your model with minimal configuration.
Our full-stack platform is optimized for multiple GPU backends, including:
This flexibility ensures you’re not tied to a single vendor and can make hardware choices based on price, performance, or availability.
MangoBoost offers seamless deployment modes:
MangoBoost enables turnkey LLM training at scale—any model, any hardware, any environment.
Beyond LLMBoost software and GPUBoost RDMA solutions, MangoBoost offers hardware acceleration solutions based on Data Processing Unit (DPUs) for AI and cloud infrastructure, such as:
This is just the beginning. Our R&D team is already implementing next-generation communication optimizations, hybrid parallelism strategies, topology-aware scheduling, and application-specific hardware acceleration to push multi-node performance even further.
If your organization is building large-scale AI models or exploring alternative GPU architectures, MangoBoost is your partner in unlocking full-stack efficiency and scalability.
To learn more or request a demo, contact us at contact@mangoboost.io.
Let us help you deploy high-performance, cost-efficient, and vendor-flexible AI infrastructure—powered by MangoBoost’s advanced MLOps software and communication stack.
Disclaimer
The performance claims in this document are based on the internal cluster environment. Actual performance may vary depending on the server configuration. Software and workloads used in performance tests may have been optimized for performance only on MangoBoost products. Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. Results that are based on pre-production systems and components as well as results that have been estimated or simulated using MangoBoost reference platform for informational purposes only. Results may vary based on future changes to any systems, components, specifications, or configurations. Statements in this document that refer to future plans or expectations are forward-looking statements. These statements are based on current expectations and involve many risks and uncertainties that could cause actual results to differ materially from those expressed or implied in such statements. MangoBoost does not guarantee any specific outcome. Nothing contained herein is, or shall be relied upon as, a promise or representation or warranty as to future performance of MangoBoost or any MangoBoost product. The information contained herein shall not be deemed to expand in any way the scope or effect of any representations or warranties contained in the definitive agreement for MangoBoost products.
The information contained herein may not be reproduced in whole or in part without prior written consent of MangoBoost. The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. MangoBoost assumes no obligation to update or otherwise correct or revise this information and MangoBoost reserves the right to make changes to the content hereof from time to time without any notice. Nothing contained herein is intended by MangoBoost, nor should it be relied upon, as a promise or a representation as to the future.
MANGOBOOST MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
© 2025 MangoBoost, Inc. All rights reserved.