Train and deploy models on Amazon SageMaker HyperPod using the new HyperPod CLI and SDK
In this post, we demonstrate how to use the new Amazon SageMaker HyperPod CLI and SDK to streamline the process of training and deploying large AI models through practical examples of distributed training using Fully Sharded Data Parallel (FSDP) and model deployment for inference. The tools provide simplified workflows through straightforward commands for common tasks, while offering flexible development options through the SDK for more complex requirements, along with comprehensive observability features and production-ready deployment capabilities.

Training and deploying large AI models requires advanced distributed computing capabilities, but managing these distributed systems shouldn’t be complex for data scientists and machine learning (ML) practitioners. The newly released command line interface (CLI) and software development kit (SDK) for Amazon SageMaker HyperPod simplify how you can use the service’s distributed training and inference capabilities.
The SageMaker HyperPod CLI provides data scientists with an intuitive command-line experience, abstracting away the underlying complexity of distributed systems. Built on top of the SageMaker HyperPod SDK, the CLI offers straightforward commands for common workflows like launching training or fine-tuning jobs, deploying inference endpoints, and monitoring cluster performance. This makes it ideal for quick experimentation and iteration.
For more advanced use cases requiring fine-grained control, the SageMaker HyperPod SDK enables programmatic access to customize your ML workflows. Developers can use the SDK’s Python interface to precisely configure training and deployment parameters while maintaining the simplicity of working with familiar Python objects.
In this post, we demonstrate how to use both the CLI and SDK to train and deploy large language models (LLMs) on SageMaker HyperPod. We walk through practical examples of distributed training using Fully Sharded Data Parallel (FSDP) and model deployment for inference, showcasing how these tools streamline the development of production-ready generative AI applications.
Prerequisites
To follow the examples in this post, you must have the following prerequisites:
- An AWS account with access to SageMaker HyperPod, Amazon Simple Storage Service (Amazon S3) and Amazon FSx for Lustre.
- A local environment (either your local machine or a cloud-based compute environment) from which to run the SageMaker HyperPod CLI commands, configured as follows:
- Operating system based on Linux or MacOS.
- Python 3.8, 3.9, 3.10 or 3.11 installed.
- The AWS Command Line Interface (AWS CLI) configured with the appropriate credentials to use the aforementioned services.
- A SageMaker HyperPod cluster orchestrated through Amazon Elastic Kubernetes Service (Amazon EKS) running with an instance group configured with 8 ml.g5.8xlarge instances. For more information on how to create and configured a new SageMaker HyperPod cluster, refer to Creating a SageMaker HyperPod cluster with Amazon EKS orchestration.
- An FSx for Lustre persistent volume claim (PVC) to store checkpoints. This can be created either at cluster creation time or separately.
Because the use cases that we demonstrate are about training and deploying LLMs with the SageMaker HyperPod CLI and SDK, you must also install the following Kubernetes operators in the cluster:
- HyperPod training operator – For installation instructions, see Installing the training operator.
- HyperPod inference operator – For installation instructions, see Setting up your HyperPod clusters for model deployment and the corresponding notebook.
Install the SageMaker HyperPod CLI
First, you must install the latest version of the SageMaker HyperPod CLI and SDK (the examples in this post are based on version 3.1.0). From the local environment, run the following command (you can also install in a Python virtual environment):
This command sets up the tools needed to interact with SageMaker HyperPod clusters. For an existing installation, make sure you have the latest version of the package installed (sagemaker-hyperpod>=3.1.0
) to be able to use the relevant set of features. To verify if the CLI is installed correctly, you can run the hyp
command and check the outputs:
The output will be similar to the following, and includes instructions on how to use the CLI:
For more information on CLI usage and the available commands and respective parameters, refer to the CLI reference documentation.
Set the cluster context
The SageMaker HyperPod CLI and SDK use the Kubernetes API to interact with the cluster. Therefore, make sure the underlying Kubernetes Python client is configured to execute API calls against your cluster by setting the cluster context.
Use the CLI to list the clusters available in your AWS account:
Set the cluster context specifying the cluster name as input (in our case, we use The SageMaker HyperPod CLI provides a straightforward way to submit PyTorch model training and fine-tuning jobs to a SageMaker HyperPod cluster. In the following example, we schedule a Meta Llama 3.1 8B model training job with FSDP.
The CLI executes training using the First, clone the Then, log in to the Amazon Elastic Container Registry (Amazon ECR) to pull the base image and build the new container:
The Dockerfile in the Next, create a new registry for your training image if needed and push the built image to it:
After you have successfully created the Docker image, you can submit the training job using the SageMaker HyperPod CLI.
Internally, the SageMaker HyperPod CLI will use the Kubernetes Python client to build a You can modify the CLI command for other Meta Llama configurations by exchanging the In the given configuration, the training job will write checkpoints to The The preceding example code contains the following relevant arguments:
If successful, the command will output the following:
You can observe the status of the training job through the CLI. Running To list the pods that are created by this training job, run the following command:
You can observe the logs of one of the training pods that get spawned by running the following command:
We elaborate on more advanced debugging and observability features at the end of this section.
Alternatively, if you prefer a programmatic experience and more advanced customization options, you can submit the training job using the SageMaker HyperPod Python SDK. For more information, refer to the SDK reference documentation. The following code will yield the equivalent training job submission to the preceding CLI example:
In addition to monitoring the training pod logs as described earlier, there are several other useful ways of debugging training jobs:
The SageMaker HyperPod CLI provides commands to quickly deploy models to your SageMaker HyperPod cluster for inference. You can deploy both foundation models (FMs) available on Amazon SageMaker JumpStart as well as custom models with artifacts that are stored on Amazon S3 or FSx for Lustre file systems.
This functionality will automatically deploy the chosen model to the SageMaker HyperPod cluster through Kubernetes custom resources, which are implemented by the HyperPod inference operator, that needs to be installed in the cluster as discussed in the prerequisites section. It is optionally possible to automatically create a SageMaker inference endpoint as well as an Application Load Balancer (ALB), which can be used directly using HTTPS calls with a generated TLS certificate to invoke the model.
You can deploy an FM that is available on SageMaker JumpStart with the following command:
The preceding code includes the following parameters:
The CLI supports more advanced deployment configurations, including auto scaling, through additional parameters, which can be viewed by running the following command:
If successful, the command will output the following:
After a few minutes, both the ALB and the SageMaker inference endpoint will be available, which can be observed through the CLI. Running To get additional visibility into the deployment pod, run the following commands to find the pod name and view the corresponding logs:
The output will look similar to the following:
You can invoke the SageMaker inference endpoint you created through the CLI by running the following command:
You will get an output similar to the following:
Alternatively, if you prefer a programmatic experience and advanced customization options, you can use the SageMaker HyperPod Python SDK. The following code will yield the equivalent deployment to the preceding CLI example:
You can also use the CLI to deploy custom models with model artifacts stored on either Amazon S3 or FSx for Lustre. This is useful for models that have been fine-tuned on custom data. You must provide the storage location of the model artifacts as well as a container image for inference that is compatible with the model artifacts and SageMaker inference endpoints. In the following example, we deploy a TinyLlama 1.1B model from Amazon S3 using the DJL Large Model Inference container image.
In preparation, download the model artifacts locally and push them to an S3 bucket:
Now you can deploy the model with the following command:
The preceding code contains the following key parameters:
Similar to SageMaker JumpStart model deployment, the CLI supports more advanced deployment configurations, including auto scaling, through additional parameters, which you can view by running the following command:
If successful, the command will output the following:
After a few minutes, both the ALB and the SageMaker inference endpoint will be available, which you can observe through the CLI. Running To get additional visibility into the deployment pod, run the following commands to find the pod name and view the corresponding logs:
The output will look similar to the following:
You can invoke the SageMaker inference endpoint you created through the CLI by running the following command:
You will get an output similar to the following:
Alternatively, you can deploy using the SageMaker HyperPod Python SDK. The following code will yield the equivalent deployment to the preceding CLI example:
In addition to the monitoring of the inference pod logs, there are several other useful ways of debugging inference deployments:
For more information on the inference deployment features in SageMaker HyperPod, see Amazon SageMaker HyperPod launches model deployments to accelerate the generative AI model development lifecycle and Deploying models on Amazon SageMaker HyperPod.
To delete the training job from the corresponding example, use the following CLI command:
To delete the model deployments from the inference example, use the following CLI commands for SageMaker JumpStart and custom model deployments, respectively:
To avoid incurring ongoing costs for the instances running in your cluster, you can scale down the instances or delete instances.
The new SageMaker HyperPod CLI and SDK can significantly streamline the process of training and deploying large-scale AI models. Through the examples in this post, we’ve demonstrated how these tools provide the following benefits:
Getting started with these tools is as simple as installing the For more information about SageMaker HyperPod and these development tools, refer to the SageMaker HyperPod CLI and SDK documentation or explore the example notebooks.
ml-cluster
as Train models with the SageMaker HyperPod CLI and SDK
HyperPodPyTorchJob
Kubernetes custom resource, which is implemented by the HyperPod training operator, that needs to be installed in the cluster as discussed in the prerequisites section.
awsome-distributed-training
repository and create the Docker image that you will use for the training job:
awsome-distributed-training
repository referenced in the preceding code already contains the HyperPod elastic agent, which orchestrates lifecycles of training workers on each container and communicates with the HyperPod training operator. If you’re using a different Dockerfile, install the HyperPod elastic agent following the instructions in HyperPod elastic agent.
HyperPodPyTorchJob
custom resource and then create it on the Kubernetes the cluster.
--args
to the desired arguments and values; examples can be found in the Kubernetes manifests in the awsome-distributed-training repository.
/fsx/checkpoints
on the FSx for Lustre PVC.
hyp create hyp-pytorch-job
command supports additional arguments, which can be discovered by running the following:
--command
and --args
offer flexibility in setting the command to be executed in the container. The command executed is hyperpodrun
, implemented by the HyperPod elastic agent that is installed in the training container. The HyperPod elastic agent extends PyTorch’s ElasticAgent and manages the communication of the various workers with the HyperPod training operator. For more information, refer to HyperPod elastic agent.--environment
defines environment variables and customizes the training execution.--max-retry
indicates the maximum number of restarts at the process level that will be attempted by the HyperPod training operator. For more information, refer to Using the training operator to run jobs.--volume
is used to map persistent or ephemeral volumes to the container.hyp list hyp-pytorch-job
will show the status
first as Created
and then as Running
after the containers have been started:
Debugging training jobs
--debug True
flag, which will print the Kubernetes YAML to the console when the job starts so it can be inspected by users.hyp list hyp-pytorch-job
.hyp describe hyp-pytorch-job —job-name fsdp-llama3-1-8b
.hyp get-monitoring --grafana
and hyp get-monitoring --prometheus
to get the Grafana dashboard and Prometheus workspace URLs, respectively, to view cluster and job metrics.kubectl exec -it
-- nvtop
to run nvtop
for visibility into GPU utilization. You can open an interactive shell by running kubectl exec -it
-- /bin/bash
.kubectl get pods -n aws-hyperpod | grep hp-training-controller-manager
to find the controller pod name and run kubectl logs -n aws-hyperpod
Deploy models with the SageMaker HyperPod CLI and SDK
Deploy SageMaker JumpStart models
--model-id
is the model ID in the SageMaker JumpStart model hub. In this example, we deploy a DeepSeek R1-distilled version of Qwen 1.5B, which is available on SageMaker JumpStart.--instance-type
is the target instance type in your SageMaker HyperPod cluster where you want to deploy the model. This instance type must be supported by the chosen model.--endpoint-name
is the name that the SageMaker inference endpoint will have. This name must be unique. SageMaker inference endpoint creation is optional.--tls-certificate-output-s3-uri
is the S3 bucket location where the TLS certificate for the ALB will be stored. This can be used to directly invoke the model through HTTPS. You can use S3 buckets that are accessible by the HyperPod inference operator IAM role.--namespace
is the Kubernetes namespace the model will be deployed to. The default value is set to default
.hyp list hyp-jumpstart-endpoint
will show the status
first as DeploymentInProgress
and then as DeploymentComplete
when the endpoint is ready to be used:
Deploy custom models
--model-name
is the name of the model that will be created in SageMaker--model-source-type
specifies either fsx
or s3
for the location of the model artifacts--model-location
specifies the prefix or folder where the model artifacts are located--s3-bucket-name
and —s3-region
specify the S3 bucket name and AWS Region, respectively--instance-type
, --endpoint-name
, --namespace
, and --tls-certificate
behave the same as for the deployment of SageMaker JumpStart modelshyp list hyp-custom-endpoint
will show the status
first as DeploymentInProgress
and as DeploymentComplete
when the endpoint is ready to be used:
Debugging inference deployments
hyp get-operator-logs
—since-hours 0.5
to access the operator logs for custom and SageMaker JumpStart deployments, respectively.hyp list
hyp describe
--name
hyp get-monitoring --grafana
and hyp get-monitoring --prometheus
to get the Grafana dashboard and Prometheus workspace URLs, respectively, to view inference metrics as well.kubectl exec -it
-- nvtop
to run nvtop
for visibility into GPU utilization. You can open an interactive shell by running kubectl exec -it
-- /bin/bash
.Clean up
Conclusion
sagemaker-hyperpod
package. The SageMaker HyperPod CLI and SDK provide the right level of abstraction for both data scientists looking to quickly experiment with distributed training and ML engineers building production systems.
About the authors
Giuseppe Angelo Porcelli is a Principal Machine Learning Specialist Solutions Architect for Amazon Web Services. With several years of software engineering and an ML background, he works with customers of any size to understand their business and technical needs and design AI and ML solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He has worked on projects in different domains, including MLOps, computer vision, and NLP, involving a broad set of AWS services. In his free time, Giuseppe enjoys playing football.
Shweta Singh is a Senior Product Manager in the Amazon SageMaker Machine Learning platform team at AWS, leading the SageMaker Python SDK. She has worked in several product roles in Amazon for over 5 years. She has a Bachelor of Science degree in Computer Engineering and a Masters of Science in Financial Engineering, both from New York University.
Nicolas Jourdan is a Specialist Solutions Architect at AWS, where he helps customers unlock the full potential of AI and ML in the cloud. He holds a PhD in Engineering from TU Darmstadt in Germany, where his research focused on the reliability, concept drift detection, and MLOps of industrial ML applications. Nicolas has extensive hands-on experience across industries, including autonomous driving, drones, and manufacturing, having worked in roles ranging from research scientist to engineering manager. He has contributed to award-winning research, holds patents in object detection and anomaly detection, and is passionate about applying cutting-edge AI to solve complex real-world problems.