Reduce ML training costs with Amazon SageMaker HyperPod

In this post, we explore the challenges of large-scale frontier model training, focusing on hardware failures and the benefits of Amazon SageMaker HyperPod - a solution that minimizes disruptions, enhances efficiency, and reduces training costs.

Jat AI

Apr 10, 2025 - 22:00

Reduce ML training costs with Amazon SageMaker HyperPod

Training a frontier model is highly compute-intensive, requiring a distributed system of hundreds, or thousands, of accelerated instances running for several weeks or months to complete a single job. For example, pre-training the Llama 3 70B model with 15 trillion training tokens took 6.5 million H100 GPU hours. On 256 Amazon EC2 P5 instances (p5.48xlarge, each with 8 NVIDIA H100 GPUs), this would take approximately 132 days.

Distributed training workloads run in a synchronous manner because each training step requires all participating instances to complete their calculations before the model can advance to the next step. It implies that if a single instance fails, it stops the entire job. As cluster sizes grow, the likelihood of failure increases due to the number of hardware components involved. Each hardware failure can result in wasted GPU hours and requires valuable engineering time to identify and resolve the issue, making the system prone to downtime that can disrupt progress and delay completion. To assess system reliability, engineering teams often rely on key metrics such as mean time between failures (MTBF), which measures the average operational time between hardware failures and serves as a valuable indicator of system robustness.

In this post, we explore the challenges of large-scale frontier model training, focusing on hardware failures and the benefits of Amazon SageMaker HyperPod—a resilient solution that minimizes disruptions, enhances efficiency, and reduces training costs.

Instance failure rate

To understand the typical MTBF for large-scale frontier model training, it helps to first understand instance failure rates by reviewing three noteworthy examples:

When training OPT-175B on 992 A100 GPUs, Meta AI encountered significant hardware reliability challenges. Across 2 months, the team managed 35 manual restarts and cycled over 100 hosts due to hardware issues, and automated systems triggered more than 70 restarts. Operating 124 instances (each with 8 GPUs) continuously over 1,440 hours, Meta accumulated a total of 178,560 instance-hours. The observed failure rate during this period was around 0.0588% per instance-hour, underscoring the reliability hurdles in training large frontier models at this scale.
During the training of Llama 3.1 405B on 16,000 H100 GPUs, a total of 417 unscheduled hardware failures occurred during a 54-day period. This translates to an effective failure rate of about 0.0161% per instance-hour.
MPT-7B was trained on 1 trillion tokens over the course of 9.5 days on 440 x A100-40GB. During this period, the training job experienced four hardware failures, resulting in an effective failure rate of approximately 0.0319% per instance-hour.

Based on these examples, it’s realistic to expect that in a single hour of large-scale distributed training, an instance will fail about 0.02%–0.06% of the time.

Larger clusters, more failures, smaller MTBF

As cluster size increases, the entropy of the system increases, resulting in a lower MTBF. The following table illustrates how the MTBF (in hours) changes with the number of instances in a cluster and the estimated failure rate for each instance. For example, with a 0.04% per-hour failure rate per instance, a 512-instance system is expected to experience a failure approximately every 5 hours. The following table shows MTBF (in hours) by failure rates.

.	Size of cluster (instances)
Failure rate (per instance per hour)	4	8	16	32	64	128	256	512
0.01%	2500	1250	625	313	157	79	40	20
0.02%	1250	625	313	157	79	40	20	10
0.04%	625	313	157	79	40	20	10	5
0.08%	313	157	79	40	20	10	5	3

Table 1: The change in MTBF (in hours) with the number of instances in a training cluster (with assumed failure rates in the columns)

What happens after a failure?

In a perfect world, without failures, the training job proceeds as shown in the following graph, which illustrates the total training time without failures, demonstrating a linear progression.

Figure 1: Training is linear in a perfect world without failures, since there are no interruptions to completion.

However, as previously noted, hardware failures are inevitable. Troubleshooting these failures typically involves several steps:

Root cause analysis (mean time to detect) – Identifying hardware failures as the root cause of training interruptions can be time-consuming, especially in complex systems with multiple potential failure points. The time taken to determine the root cause is referred to as mean time to detect (MTTD).
Hardware repair or replacement (mean time to replace) – Sometimes, a simple instance restart resolves the issue. At other times, the instance must be replaced, which can involve logistical delays, especially if specialized components aren’t readily available. If a replacement instance isn’t on hand when a GPU fails, the system must wait for one to become available. Common redistribution techniques, such as PyTorch FSDP, don’t permit workload redistribution among remaining instances.
System recovery and resumption (mean time to restart) – After resolving hardware issues and replacing the instance, additional time is needed to restore it to its previous state. The new instance must match the original configuration, and the entire cluster must load the model weights from the latest saved checkpoint.

Each failure incurs engineering effort to identify its root cause. When hardware issues arise, diagnostics confirm the problem and isolate the faulty instance, pausing the training job and increasing downtime. The impact of these failures is illustrated in the following figure and can be empirically measured for large distributed training jobs. The figure outlines the troubleshooting steps that follow a failure.

Figure 2: Impact of failures on a distributed training run. Once a failure occurs, time (idle GPUs) is spent on detecting (MTD), replacing (MTT Replace), and continuing (MTR Restart) a training run, often wasting time and expensive resources.

In a scenario where a distributed training job is running on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster with n reserved instances and an Auto Scaling group set to maintain a minimum of n instances, a hardware issue such as a GPU failure can cause the job to fail. The affected instance will be marked as Unhealthy by a Kubernetes health monitor such as Node Problem Detector, and Amazon EKS will attempt to reschedule the training pods to healthy instances. If no instances have sufficient resources, the pods remain in a Pending state, and because the instance count is limited to n, no new instance will be automatically provisioned.

In such cases, the failed job must be manually identified through pod logs or the Kubernetes API and deleted. The failed instance also needs to be isolated and terminated manually, either through the AWS Management Console, AWS Command Line Interface (AWS CLI), or tools like kubectl or eksctl. To restore cluster capacity, the user must increase the cluster size by modifying the Auto Scaling group or updating the instance group. After the new instance is provisioned, bootstrapped, and added to the cluster, the training job must be restarted manually. If checkpointing is enabled, the job can resume from the last saved state. The overall downtime depends on the time required to provision a new instance and restart the job by rescheduling the pods.

Faster failure detection (shorter MTTD), shorter replacement times (shorter MTTR), and rapid resumption will all contribute to reducing total training time. Automating these processes with minimal user intervention is a key advantage of Amazon SageMaker HyperPod.

Amazon SageMaker HyperPod resilient training infrastructure

SageMaker HyperPod is a compute environment optimized for large-scale frontier model training. This means users can build resilient clusters for machine learning (ML) workloads and develop or fine-tune state-of-the-art frontier models, as demonstrated by organizations such as Luma Labs and Perplexity AI. SageMaker HyperPod runs health monitoring agents in the background for each instance. When it detects a hardware failure, SageMaker HyperPod automatically repairs or replaces the faulty instance and resumes training from the last saved checkpoint. This automation alleviates the need for manual management, which means customers can train in distributed settings for weeks or months with minimal disruption. The benefits are particularly significant for customers deploying many instances (greater than 16) in a cluster.

Frontier model builders can further enhance model performance using built-in ML tools within SageMaker HyperPod. They can use Amazon SageMaker AI with MLflow to create, manage, and track ML experiments, or use Amazon SageMaker AI with TensorBoard to visualize model architecture and address convergence issues. Additionally, integrating with observability tools such as Amazon CloudWatch Container Insights, Amazon Managed Service for Prometheus, and Amazon Managed Grafana provides deeper insights into cluster performance, health, and utilization, ultimately saving valuable development time. The following figure compares the downtime of an infrastructure system using SageMaker HyperPod versus one without SageMaker HyperPod.

Figure 3: Comparing downtime chart from figure 1 with downtime on SageMaker HyperPod. When a failure occurs, it is detected automatically by HyperPod agents, and the instance is replaced in the background. Training is also resumed from the latest checkpoint

SageMaker HyperPod reduces the downtime per hardware failure by automatically detecting hardware issues. When these issues are detected, SageMaker HyperPod automatically replaces the faulty node(s) and resumes your training job from the latest checkpoint, assuming that checkpoints are written.

To evaluate this, we conducted experiments on SageMaker HyperPod using different cluster sizes of p5.48xlarge instances. The results in the following table, showing empirical measurements of time to resume by cluster size, displays the 90th percentile (P90), which represents a value that will be met or exceeded 90% of the time.

Cluster size (number of instances)	P90 time to detect (in seconds)	P90 time to replace (in seconds)	P90 time to resume (in seconds)	Total downtime per failure (in seconds)	Total downtime per failure (in minutes)
16	83	912	1212	2207	36.8
64	90	963	1320	2373	39.6
256	89	903	1398	2390	39.8
1024	80	981	1440	2501	41.7

Table 2: MTTResume (in seconds) on clusters with different sizes

As shown, the mean time to replace an instance is independent of cluster size. For a cluster of 256 x p5.48xlarge instances training Meta Llama 3.1 70B parameter model with batch size = 8, replacing an instance takes about 940 seconds (or 15.7 minutes). After replacement, the new instance must install additional packages using lifecycle scripts and run deep health checks before reading from the latest saved checkpoint. When it’s operational, the training job resumes from the most recent checkpoint, minimizing progress loss despite the interruption. For a 256-instance cluster, it took us about 2,390 seconds (about 40 minutes) to automatically resume the training job after each failure.

Without SageMaker HyperPod, when a GPU failure occurs during a training job, the time it takes to resume the training can vary widely depending on the infrastructure and processes in place. With proper check-pointing, automated job orchestration, and efficient hardware provisioning, the resume time can be reduced. However, without these optimizations, the impact can be much more severe. Empirical evidence from customer experiences—including a leading open source frontier model provider, a top large language model (LLM) startup, an AI company specializing in enterprise frontier models, and a cutting-edge scientific research institute—indicates that without SageMaker HyperPod, the total downtime per GPU failure can average approximately 280 minutes per failure. Thus, Amazon SageMaker HyperPod saves about 240 minutes (or about 4 hours) of downtime per failure:

.	Without SageMaker HyperPod (in minutes)	With SageMaker HyperPod (in minutes)
Mean time to root-cause	10	1.5
Mean time to replace	240	15
Mean time to resume	30	23.5
Total downtime per failure	280	40

Table 3: Typical failure numbers, in minutes (as described in section “What happens after a failure?” with and without SageMaker HyperPod)

Quantifying the downtime savings

Depending on the frequency of failures, we can calculate the time to train and the cost savings of using SageMaker HyperPod. To illustrate this calculation, we assume it takes 40 minutes to replace an instance with SageMaker HyperPod compared to 280 minutes without it (as previously explained). Additionally, for this calculation, let’s assume a training job requiring 10 million GPU hours on H100 instances, running on a 256-instance P5 cluster.

Although the actual overhead (in hours) depends on the size of the training job, the relative overhead remains constant. The benefits of SageMaker HyperPod in reducing total training time are demonstrated in the following chart. For example, in a 256-instance cluster with a failure rate of 0.05%, SageMaker HyperPod reduces total training time by 32%.

.	Size of cluster (instances)
Failure rate (per instance per hour)	4	8	16	32	64	128	256	512
0.01%	0%	0%	1%	1%	2%	5%	9%	17%
0.02%	0%	1%	1%	2%	5%	9%	17%	28%
0.05%	1%	2%	3%	6%	11%	20%	32%	48%
0.07%	1%	2%	4%	8%	15%	25%	40%	55%

Table 4: Total % of training time reduced by SageMaker HyperPod compared to a P5 cluster of comparable size

To translate this into actual savings, for a training job requiring 10 million GPU hours on a 256-instance cluster, SageMaker HyperPod saves 104 days of training time. As a result, customers can reduce time-to-market by 3.5 months. Without SageMaker HyperPod, the total time to train would be approximately 325 days, 121 of which are just spent on isolating and mitigating hardware issues. The following table shows the time to train benefits.

H100 GPU hours for training	10,000,000
Number of instances	256
Failure rate (per instance per hour)	0.05%
Additional time to fix per failure (hours)	4
Days lost due to hardware issues (with SageMaker HyperPod)	17
Days lost due to hardware issues (without SageMaker HyperPod)	121
Time to train with SageMaker HyperPod (days)	221
Time to train without SageMaker HyperPod (days)	325
SageMaker HyperPod improvement	32%
Time saved with SageMaker HyperPod (days)	104

Table 5: Benefits presented by SageMaker HyperPod for a training run requiring 10 million GPU hours and a 256 instance cluster. SageMaker HyperPod saves 104 days of training time overall, resulting in a faster time to market (by 3.5 months!)

For the same example, we can estimate the total cost savings using:

Days lost due to hardware issues = (Number of instances) × (Failure rate per instance per hour) × (24 hours per day) × (Total training days) × (Downtime per failure in hours)

The following shows cost to train benefits.

H100 GPU hours for training	10,000,000
Number of instances	256
Failure rate (per instance per hour)	0.05%
Time saved with SageMaker HyperPod (days)	104
Cost per GPU per hour	$5
Total cost saving with SageMaker HyperPod	$25,559,040

Table 6: Using the calculation described above, the cost to train benefits laid out for a training run requiring 10 million GPU hours, 256 GPU based instances, and an assumed failure rate of 0.05% per instance per hour

A training job requiring 10 million GPU hours and 104 additional days of resolving hardware issues results in significant idle cluster time. Assuming a GPU cost of $5 per hour (equivalent to the price of P5 instances on Capacity Blocks for ML), the total cost savings with SageMaker HyperPod amounts to $25,559,040.

Summary

Training frontier models is a complex, resource-intensive process that is particularly vulnerable to hardware failures. In this post, we explored the instance failure rate, which can range about 0.02%–0.07% per hour during large-scale distributed training. As cluster size grows, the likelihood of failures increases, and the MTBF decreases. We also examined what happens after failure, including root cause analysis, hardware repair or replacement, and system recovery and resumption.

Next, we examined Amazon SageMaker HyperPod—a purpose-built, fully resilient cluster for frontier model training. By incorporating robust fault-tolerance mechanisms and automated health monitoring, SageMaker HyperPod minimizes disruptions caused by hardware issues. This not only streamlines the training process but also enhances the reliability and efficiency of model development, enabling faster and more effective innovation delivery. The benefits are measurable and correlate with both cluster size and failure rate. For a 256-instance cluster with a 0.05% per-instance-per-hour failure rate, SageMaker HyperPod reduces total training time by 32%, resulting in an approximate savings of $25.6 million in total training costs.

By addressing the reliability challenges of frontier model training, SageMaker HyperPod allows ML teams to focus on model innovation rather than infrastructure management. Organizations can now conduct long training runs with confidence, knowing that hardware failures will be automatically detected and resolved with minimal disruption to their ML workloads. Get started with Amazon SageMaker HyperPod.

Special thanks to Roy Allela, Senior AI/ML Specialist Solutions Architect for his support on the launch of this post.

About the Authors

Anoop Saha is a Sr GTM Specialist at Amazon Web Services (AWS) focusing on generative AI model training and inference. He partners with top frontier model builders, strategic customers, and AWS service teams to enable distributed training and inference at scale on AWS and lead joint GTM motions. Before AWS, Anoop held several leadership roles at startups and large corporations, primarily focusing on silicon and system architecture of AI infrastructure.

Trevor Harvey is a Principal Specialist in generative AI at Amazon Web Services (AWS) and an AWS Certified Solutions Architect – Professional. Trevor works with customers to design and implement machine learning solutions and leads go-to-market strategies for generative AI services.

Aman Shanbhag is a Specialist Solutions Architect on the ML Frameworks team at Amazon Web Services (AWS), where he helps customers and partners with deploying ML training and inference solutions at scale. Before joining AWS, Aman graduated from Rice University with degrees in computer science, mathematics, and entrepreneurship.

Tags:

Automating regulatory compliance: A multi-agent solution using Amazon Bedrock an...

Jat AI Stay informed with the latest in artificial intelligence. Jat AI News Portal is your go-to source for AI trends, breakthroughs, and industry analysis. Connect with the community of technologists and business professionals shaping the future.