Train and deploy models on Amazon SageMaker HyperPod using the new HyperPod CLI and SDK

In this post, we demonstrate how to use the new Amazon SageMaker HyperPod CLI and SDK to streamline the process of training and deploying large AI models through practical examples of distributed training using Fully Sharded Data Parallel (FSDP) and model deployment for inference. The tools provide simplified workflows through straightforward commands for common tasks, while offering flexible development options through the SDK for more complex requirements, along with comprehensive observability features and production-ready deployment capabilities.

Jat AI

Sep 3, 2025 - 02:00

Train and deploy models on Amazon SageMaker HyperPod using the new HyperPod CLI and SDK

Training and deploying large AI models requires advanced distributed computing capabilities, but managing these distributed systems shouldn’t be complex for data scientists and machine learning (ML) practitioners. The newly released command line interface (CLI) and software development kit (SDK) for Amazon SageMaker HyperPod simplify how you can use the service’s distributed training and inference capabilities.

The SageMaker HyperPod CLI provides data scientists with an intuitive command-line experience, abstracting away the underlying complexity of distributed systems. Built on top of the SageMaker HyperPod SDK, the CLI offers straightforward commands for common workflows like launching training or fine-tuning jobs, deploying inference endpoints, and monitoring cluster performance. This makes it ideal for quick experimentation and iteration.

For more advanced use cases requiring fine-grained control, the SageMaker HyperPod SDK enables programmatic access to customize your ML workflows. Developers can use the SDK’s Python interface to precisely configure training and deployment parameters while maintaining the simplicity of working with familiar Python objects.

In this post, we demonstrate how to use both the CLI and SDK to train and deploy large language models (LLMs) on SageMaker HyperPod. We walk through practical examples of distributed training using Fully Sharded Data Parallel (FSDP) and model deployment for inference, showcasing how these tools streamline the development of production-ready generative AI applications.

Prerequisites

To follow the examples in this post, you must have the following prerequisites:

An AWS account with access to SageMaker HyperPod, Amazon Simple Storage Service (Amazon S3) and Amazon FSx for Lustre.
A local environment (either your local machine or a cloud-based compute environment) from which to run the SageMaker HyperPod CLI commands, configured as follows:
- Operating system based on Linux or MacOS.
- Python 3.8, 3.9, 3.10 or 3.11 installed.
- The AWS Command Line Interface (AWS CLI) configured with the appropriate credentials to use the aforementioned services.
A SageMaker HyperPod cluster orchestrated through Amazon Elastic Kubernetes Service (Amazon EKS) running with an instance group configured with 8 ml.g5.8xlarge instances. For more information on how to create and configured a new SageMaker HyperPod cluster, refer to Creating a SageMaker HyperPod cluster with Amazon EKS orchestration.
An FSx for Lustre persistent volume claim (PVC) to store checkpoints. This can be created either at cluster creation time or separately.

Because the use cases that we demonstrate are about training and deploying LLMs with the SageMaker HyperPod CLI and SDK, you must also install the following Kubernetes operators in the cluster:

HyperPod training operator – For installation instructions, see Installing the training operator.
HyperPod inference operator – For installation instructions, see Setting up your HyperPod clusters for model deployment and the corresponding notebook.

Install the SageMaker HyperPod CLI

First, you must install the latest version of the SageMaker HyperPod CLI and SDK (the examples in this post are based on version 3.1.0). From the local environment, run the following command (you can also install in a Python virtual environment):

# Install the HyperPod CLI and SDK
pip install sagemaker-hyperpod

This command sets up the tools needed to interact with SageMaker HyperPod clusters. For an existing installation, make sure you have the latest version of the package installed (sagemaker-hyperpod>=3.1.0) to be able to use the relevant set of features. To verify if the CLI is installed correctly, you can run the hyp command and check the outputs:

# Check if the HyperPod CLI is correctly installed
hyp

The output will be similar to the following, and includes instructions on how to use the CLI:

Usage: hyp [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  create               Create endpoints or pytorch jobs.
  delete               Delete endpoints or pytorch jobs.
  describe             Describe endpoints or pytorch jobs.
  get-cluster-context  Get context related to the current set cluster.
  get-logs             Get pod logs for endpoints or pytorch jobs.
  get-monitoring       Get monitoring configurations for Hyperpod cluster.
  get-operator-logs    Get operator logs for endpoints.
  invoke               Invoke model endpoints.
  list                 List endpoints or pytorch jobs.
  list-cluster         List SageMaker Hyperpod Clusters with metadata.
  list-pods            List pods for endpoints or pytorch jobs.
  set-cluster-context  Connect to a HyperPod EKS cluster.

For more information on CLI usage and the available commands and respective parameters, refer to the CLI reference documentation.

Set the cluster context

The SageMaker HyperPod CLI and SDK use the Kubernetes API to interact with the cluster. Therefore, make sure the underlying Kubernetes Python client is configured to execute API calls against your cluster by setting the cluster context.

Use the CLI to list the clusters available in your AWS account:

# List all HyperPod clusters in your AWS account
hyp list-cluster
[
    {
        "Cluster": "ml-cluster",
        "Instances": [
            {
                "InstanceType": "ml.g5.8xlarge",
                "TotalNodes": 8,
                "AcceleratorDevicesAvailable": 8,
                "NodeHealthStatus=Schedulable": 8,
                "DeepHealthCheckStatus=Passed": "N/A"
            },
            {
                "InstanceType": "ml.m5.12xlarge",
                "TotalNodes": 1,
                "AcceleratorDevicesAvailable": "N/A",
                "NodeHealthStatus=Schedulable": 1,
                "DeepHealthCheckStatus=Passed": "N/A"
            }
        ]
    }
]

Set the cluster context specifying the cluster name as input (in our case, we use ml-cluster as ):

# Set the cluster context for subsequent commands
hyp set-cluster-context --cluster-name

Train models with the SageMaker HyperPod CLI and SDK

The SageMaker HyperPod CLI provides a straightforward way to submit PyTorch model training and fine-tuning jobs to a SageMaker HyperPod cluster. In the following example, we schedule a Meta Llama 3.1 8B model training job with FSDP.

The CLI executes training using the HyperPodPyTorchJob Kubernetes custom resource, which is implemented by the HyperPod training operator, that needs to be installed in the cluster as discussed in the prerequisites section.

First, clone the awsome-distributed-training repository and create the Docker image that you will use for the training job:

cd ~
git clone https://github.com/aws-samples/awsome-distributed-training/
cd awsome-distributed-training/3.test_cases/pytorch/FSDP

Then, log in to the Amazon Elastic Container Registry (Amazon ECR) to pull the base image and build the new container:

export AWS_REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]')
export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
docker build -f Dockerfile -t ${REGISTRY}fsdp:pytorch2.7.1 .

The Dockerfile in the awsome-distributed-training repository referenced in the preceding code already contains the HyperPod elastic agent, which orchestrates lifecycles of training workers on each container and communicates with the HyperPod training operator. If you’re using a different Dockerfile, install the HyperPod elastic agent following the instructions in HyperPod elastic agent.

Next, create a new registry for your training image if needed and push the built image to it:

# Create registry if needed
REGISTRY_COUNT=$(aws ecr describe-repositories | grep "fsdp" | wc -l)
if [ "$REGISTRY_COUNT" -eq 0 ]; then
    aws ecr create-repository --repository-name fsdp
fi

# Login to registry
echo "Logging in to $REGISTRY ..."
aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY

# Push image to registry
docker image push ${REGISTRY}fsdp:pytorch2.7.1

After you have successfully created the Docker image, you can submit the training job using the SageMaker HyperPod CLI.

Internally, the SageMaker HyperPod CLI will use the Kubernetes Python client to build a HyperPodPyTorchJob custom resource and then create it on the Kubernetes the cluster.

You can modify the CLI command for other Meta Llama configurations by exchanging the --args to the desired arguments and values; examples can be found in the Kubernetes manifests in the awsome-distributed-training repository.

In the given configuration, the training job will write checkpoints to /fsx/checkpoints on the FSx for Lustre PVC.

hyp create hyp-pytorch-job \
    --job-name fsdp-llama3-1-8b \
    --image ${REGISTRY}fsdp:pytorch2.7.1 \
    --command '[
        hyperpodrun,
        --tee=3,
        --log_dir=/tmp/hyperpod,
        --nproc_per_node=1,
        --nnodes=8,
        /fsdp/train.py
    ]' \
    --args '[
        --max_context_width=8192,
        --num_key_value_heads=8,
        --intermediate_size=14336,
        --hidden_width=4096,
        --num_layers=32,
        --num_heads=32,
        --model_type=llama_v3,
        --tokenizer=hf-internal-testing/llama-tokenizer,
        --checkpoint_freq=50,
        --validation_freq=25,
        --max_steps=50,
        --checkpoint_dir=/fsx/checkpoints,
        --dataset=allenai/c4,
        --dataset_config_name=en,
        --resume_from_checkpoint=/fsx/checkpoints,
        --train_batch_size=1,
        --val_batch_size=1,
        --sharding_strategy=full,
        --offload_activations=1
    ]' \
    --environment '{"PYTORCH_CUDA_ALLOC_CONF": "max_split_size_mb:32"}' \
    --pull-policy "IfNotPresent" \
    --instance-type ml.g5.8xlarge \
    --node-count 8 \
    --tasks-per-node 1 \
    --deep-health-check-passed-nodes-only false \
    --max-retry 3 \
    --volume name=shmem,type=hostPath,mount_path=/dev/shm,path=/dev/shm,read_only=false \
    --volume name=fsx,type=pvc,mount_path=/fsx,claim_name=fsx-claim,read_only=false

The hyp create hyp-pytorch-job command supports additional arguments, which can be discovered by running the following:

hyp create hyp-pytorch-job --help

The preceding example code contains the following relevant arguments:

--command and --args offer flexibility in setting the command to be executed in the container. The command executed is hyperpodrun, implemented by the HyperPod elastic agent that is installed in the training container. The HyperPod elastic agent extends PyTorch’s ElasticAgent and manages the communication of the various workers with the HyperPod training operator. For more information, refer to HyperPod elastic agent.
--environment defines environment variables and customizes the training execution.
--max-retry indicates the maximum number of restarts at the process level that will be attempted by the HyperPod training operator. For more information, refer to Using the training operator to run jobs.
--volume is used to map persistent or ephemeral volumes to the container.

If successful, the command will output the following:

Using version: 1.0
2025-08-12 10:03:03,270 - sagemaker.hyperpod.training.hyperpod_pytorch_job - INFO - Successfully submitted HyperPodPytorchJob 'fsdp-llama3-1-8b'!

You can observe the status of the training job through the CLI. Running hyp list hyp-pytorch-job will show the status first as Created and then as Running after the containers have been started:

NAME                          NAMESPACE           STATUS         AGE            
--------------------------------------------------------------------------------
fsdp-llama3-1-8b              default             Running        6m

To list the pods that are created by this training job, run the following command:

hyp list-pods hyp-pytorch-job --job-name fsdp-llama3-1-8b
Pods for job: fsdp-llama3-1-8b

POD NAME                                          NAMESPACE           
----------------------------------------------------------------------
fsdp-llama3-1-8b-pod-0                            default             
fsdp-llama3-1-8b-pod-1                            default             
fsdp-llama3-1-8b-pod-2                            default         
fsdp-llama3-1-8b-pod-3                            default         
fsdp-llama3-1-8b-pod-4                            default         
fsdp-llama3-1-8b-pod-5                            default         
fsdp-llama3-1-8b-pod-6                            default        
fsdp-llama3-1-8b-pod-7                            default

You can observe the logs of one of the training pods that get spawned by running the following command:

hyp get-logs hyp-pytorch-job --pod-name fsdp-llama3-1-8b-pod-0 \
--job-name fsdp-llama3-1-8b
...
2025-08-12T14:59:25.069208138Z [HyperPodElasticAgent] 2025-08-12 14:59:25,069 [INFO] [rank0-restart0] /usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py:685: [default] Starting worker group 
2025-08-12T14:59:25.069301320Z [HyperPodElasticAgent] 2025-08-12 14:59:25,069 [INFO] [rank0-restart0] /usr/local/lib/python3.10/dist-packages/hyperpod_elastic_agent/hyperpod_elastic_agent.py:221: Starting workers with worker spec worker_group.spec=WorkerSpec(role='default', local_world_size=1, rdzv_handler=, fn=None, entrypoint='/usr/bin/python3', args=('-u', '/fsdp/train.py', '--max_context_width=8192', '--num_key_value_heads=8', '--intermediate_size=14336', '--hidden_width=4096', '--num_layers=32', '--num_heads=32', '--model_type=llama_v3', '--tokenizer=hf-internal-testing/llama-tokenizer', '--checkpoint_freq=50', '--validation_freq=50', '--max_steps=100', '--checkpoint_dir=/fsx/checkpoints', '--dataset=allenai/c4', '--dataset_config_name=en', '--resume_from_checkpoint=/fsx/checkpoints', '--train_batch_size=1', '--val_batch_size=1', '--sharding_strategy=full', '--offload_activations=1'), max_restarts=3, monitor_interval=0.1, master_port=None, master_addr=None, local_addr=None)... 
2025-08-12T14:59:30.264195963Z [default0]:2025-08-12 14:59:29,968 [INFO] **main**: Creating Model 
2025-08-12T15:00:51.203541576Z [default0]:2025-08-12 15:00:50,781 [INFO] **main**: Created model with total parameters: 7392727040 (7.39 B) 
2025-08-12T15:01:18.139531830Z [default0]:2025-08-12 15:01:18 I [checkpoint.py:79] Loading checkpoint from /fsx/checkpoints/llama_v3-24steps ... 
2025-08-12T15:01:18.833252603Z [default0]:2025-08-12 15:01:18,081 [INFO] **main**: Wrapped model with FSDP 
2025-08-12T15:01:18.833290793Z [default0]:2025-08-12 15:01:18,093 [INFO] **main**: Created optimizer

We elaborate on more advanced debugging and observability features at the end of this section.

Alternatively, if you prefer a programmatic experience and more advanced customization options, you can submit the training job using the SageMaker HyperPod Python SDK. For more information, refer to the SDK reference documentation. The following code will yield the equivalent training job submission to the preceding CLI example:

import os
from sagemaker.hyperpod.training import HyperPodPytorchJob
from sagemaker.hyperpod.training import ReplicaSpec, Template, VolumeMounts, Spec, Containers, Resources, RunPolicy, Volumes, HostPath, PersistentVolumeClaim
from sagemaker.hyperpod.common.config import Metadata

REGISTRY = os.environ['REGISTRY']

# Define job specifications
nproc_per_node = "1"  # Number of processes per node
replica_specs = [
    ReplicaSpec(
        name = "pod",  # Replica name
        replicas = 8,
        template = Template(
            spec = Spec(
                containers =
                [
                    Containers(
                        # Container name
                        name="fsdp-training-container",  
                        
                        # Training image
                        image=f"{REGISTRY}fsdp:pytorch2.7.1",  
                        # Volume mounts
                        volume_mounts=[
                            VolumeMounts(
                                name="fsx",
                                mount_path="/fsx"
                            ),
                            VolumeMounts(
                                name="shmem", 
                                mount_path="/dev/shm"
                            )
                        ],
                        env=[
                                {"name": "PYTORCH_CUDA_ALLOC_CONF", "value": "max_split_size_mb:32"},
                            ],
                        
                        # Image pull policy
                        image_pull_policy="IfNotPresent",
                        resources=Resources(
                            requests={"nvidia.com/gpu": "1"},  
                            limits={"nvidia.com/gpu": "1"},   
                        ),
                        # Command to run
                        command=[
                            "hyperpodrun",
                            "--tee=3",
                            "--log_dir=/tmp/hyperpod",
                            "--nproc_per_node=1",
                            "--nnodes=8",
                            "/fsdp/train.py"
                        ],  
                        # Script arguments
                        args = [
                            '--max_context_width=8192',
                            '--num_key_value_heads=8',
                            '--intermediate_size=14336',
                            '--hidden_width=4096',
                            '--num_layers=32',
                            '--num_heads=32',
                            '--model_type=llama_v3',
                            '--tokenizer=hf-internal-testing/llama-tokenizer',
                            '--checkpoint_freq=2',
                            '--validation_freq=25',
                            '--max_steps=50',
                            '--checkpoint_dir=/fsx/checkpoints',
                            '--dataset=allenai/c4',
                            '--dataset_config_name=en',
                            '--resume_from_checkpoint=/fsx/checkpoints',
                            '--train_batch_size=1',
                            '--val_batch_size=1',
                            '--sharding_strategy=full',
                            '--offload_activations=1'
                        ]
                    )
                ],
                volumes = [
                    Volumes(
                        name="fsx",
                        persistent_volume_claim=PersistentVolumeClaim(
                            claim_name="fsx-claim",
                            read_only=False
                        ),
                    ),
                    Volumes(
                        name="shmem",
                        host_path=HostPath(path="/dev/shm"),
                    )
                ],
                node_selector={
                    "node.kubernetes.io/instance-type": "ml.g5.8xlarge",
                },
            )
        ),
    )
]
run_policy = RunPolicy(clean_pod_policy="None", job_max_retry_count=3)  
# Create and start the PyTorch job
pytorch_job = HyperPodPytorchJob(
    # Job name
    metadata = Metadata(
        name="fsdp-llama3-1-8b",     
        namespace="default",
    ),
    # Processes per node
    nproc_per_node = nproc_per_node,   
    # Replica specifications
    replica_specs = replica_specs,        
)
# Launch the job
pytorch_job.create()

Debugging training jobs

In addition to monitoring the training pod logs as described earlier, there are several other useful ways of debugging training jobs:

You can submit training jobs with an additional --debug True flag, which will print the Kubernetes YAML to the console when the job starts so it can be inspected by users.
You can view a list of current training jobs by running hyp list hyp-pytorch-job.
You can view the status and corresponding events of the job by running hyp describe hyp-pytorch-job —job-name fsdp-llama3-1-8b.
If the HyperPod observability stack is deployed to the cluster, run hyp get-monitoring --grafana and hyp get-monitoring --prometheus to get the Grafana dashboard and Prometheus workspace URLs, respectively, to view cluster and job metrics.
To monitor GPU utilization or view directory contents, it can be useful to execute commands or open an interactive shell into the pods. You can run commands in a pod by running, for example, kubectl exec -it-- nvtop to run nvtop for visibility into GPU utilization. You can open an interactive shell by running kubectl exec -it-- /bin/bash.
The logs of the HyperPod training operator controller pod can have valuable information about scheduling. To view them, run kubectl get pods -n aws-hyperpod | grep hp-training-controller-manager to find the controller pod name and run kubectl logs -n aws-hyperpod to view the corresponding logs.

Deploy models with the SageMaker HyperPod CLI and SDK

The SageMaker HyperPod CLI provides commands to quickly deploy models to your SageMaker HyperPod cluster for inference. You can deploy both foundation models (FMs) available on Amazon SageMaker JumpStart as well as custom models with artifacts that are stored on Amazon S3 or FSx for Lustre file systems.

This functionality will automatically deploy the chosen model to the SageMaker HyperPod cluster through Kubernetes custom resources, which are implemented by the HyperPod inference operator, that needs to be installed in the cluster as discussed in the prerequisites section. It is optionally possible to automatically create a SageMaker inference endpoint as well as an Application Load Balancer (ALB), which can be used directly using HTTPS calls with a generated TLS certificate to invoke the model.

Deploy SageMaker JumpStart models

You can deploy an FM that is available on SageMaker JumpStart with the following command:

hyp create hyp-jumpstart-endpoint \
  --model-id deepseek-llm-r1-distill-qwen-1-5b \
  --instance-type ml.g5.8xlarge \
  --endpoint-name \
  --tls-certificate-output-s3-uri s3:/// \
  --namespace default

The preceding code includes the following parameters:

--model-id is the model ID in the SageMaker JumpStart model hub. In this example, we deploy a DeepSeek R1-distilled version of Qwen 1.5B, which is available on SageMaker JumpStart.
--instance-type is the target instance type in your SageMaker HyperPod cluster where you want to deploy the model. This instance type must be supported by the chosen model.
--endpoint-name is the name that the SageMaker inference endpoint will have. This name must be unique. SageMaker inference endpoint creation is optional.
--tls-certificate-output-s3-uri is the S3 bucket location where the TLS certificate for the ALB will be stored. This can be used to directly invoke the model through HTTPS. You can use S3 buckets that are accessible by the HyperPod inference operator IAM role.
--namespace is the Kubernetes namespace the model will be deployed to. The default value is set to default.

The CLI supports more advanced deployment configurations, including auto scaling, through additional parameters, which can be viewed by running the following command:

hyp create hyp-jumpstart-endpoint --help

If successful, the command will output the following:

Creating JumpStart model and sagemaker endpoint. Endpoint name: deepseek-distill-qwen-endpoint-cli.
 The process may take a few minutes...

After a few minutes, both the ALB and the SageMaker inference endpoint will be available, which can be observed through the CLI. Running hyp list hyp-jumpstart-endpoint will show the status first as DeploymentInProgress and then as DeploymentComplete when the endpoint is ready to be used:

| name                               | namespace   | labels   | status             |
|------------------------------------|-------------|----------|--------------------|
| deepseek-distill-qwen-endpoint-cli | default     |          | DeploymentComplete |

To get additional visibility into the deployment pod, run the following commands to find the pod name and view the corresponding logs:

hyp list-pods hyp-jumpstart-endpoint --namespace 
hyp get-logs hyp-jumpstart-endpoint --namespace  --pod-name

The output will look similar to the following:

2025-08-12T15:53:14.042031963Z WARN  PyProcess W-195-model-stderr: Capturing CUDA graph shapes: 100%|??????????| 35/35 [00:18<00:00,  1.63it/s]
2025-08-12T15:53:14.042257357Z WARN  PyProcess W-195-model-stderr: Capturing CUDA graph shapes: 100%|??????????| 35/35 [00:18<00:00,  1.94it/s]
2025-08-12T15:53:14.042297298Z INFO  PyProcess W-195-model-stdout: INFO 08-12 15:53:14 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 26.18 seconds
2025-08-12T15:53:15.215357997Z INFO  PyProcess Model [model] initialized.
2025-08-12T15:53:15.219205375Z INFO  WorkerThread Starting worker thread WT-0001 for model model (M-0001, READY) on device gpu(0)
2025-08-12T15:53:15.221591827Z INFO  ModelServer Initialize BOTH server with: EpollServerSocketChannel.
2025-08-12T15:53:15.231404670Z INFO  ModelServer BOTH API bind to: http://0.0.0.0:8080

You can invoke the SageMaker inference endpoint you created through the CLI by running the following command:

hyp invoke hyp-jumpstart-endpoint \
    --endpoint-name deepseek-distill-qwen-endpoint-cli \       
    --body '{"inputs":"What is the capital of USA?"}'

You will get an output similar to the following:

{"generated_text": " What is the capital of France? What is the capital of Japan? What is the capital of China? What is the capital of Germany? What is"}

Alternatively, if you prefer a programmatic experience and advanced customization options, you can use the SageMaker HyperPod Python SDK. The following code will yield the equivalent deployment to the preceding CLI example:

from sagemaker.hyperpod.inference.config.hp_jumpstart_endpoint_config import Model, Server, SageMakerEndpoint, TlsConfig
from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint

model=Model(
    model_id='deepseek-llm-r1-distill-qwen-1-5b',
)

server=Server(
    instance_type='ml.g5.8xlarge',
)

endpoint_name=SageMakerEndpoint(name='deepseek-distill-qwen-endpoint-cli')

tls_config=TlsConfig(tls_certificate_output_s3_uri='s3://')

js_endpoint=HPJumpStartEndpoint(
    model=model,
    server=server,
    sage_maker_endpoint=endpoint_name,
    tls_config=tls_config,
    namespace="default"
)

js_endpoint.create()

Deploy custom models

You can also use the CLI to deploy custom models with model artifacts stored on either Amazon S3 or FSx for Lustre. This is useful for models that have been fine-tuned on custom data. You must provide the storage location of the model artifacts as well as a container image for inference that is compatible with the model artifacts and SageMaker inference endpoints. In the following example, we deploy a TinyLlama 1.1B model from Amazon S3 using the DJL Large Model Inference container image.

In preparation, download the model artifacts locally and push them to an S3 bucket:

# Install huggingface-hub if not present on your machine
pip install huggingface-hub

# Download model
hf download TinyLlama/TinyLlama-1.1B-Chat-v1.0 --local-dir ./tinyllama-1.1b-chat

# Upload to S3
aws s3 cp ./tinyllama s3:///models/tinyllama-1.1b-chat/ --recursive

Now you can deploy the model with the following command:

hyp create hyp-custom-endpoint \
    --endpoint-name my-custom-tinyllama-endpoint \
    --model-name tinyllama \
    --model-source-type s3 \
    --model-location models/tinyllama-1.1b-chat/ \
    --s3-bucket-name  \
    --s3-region  \
    --instance-type ml.g5.8xlarge \
    --image-uri 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.33.0-lmi15.0.0-cu128 \
    --container-port 8080 \
    --model-volume-mount-name modelmount \
    --tls-certificate-output-s3-uri s3:/// \
    --namespace default

The preceding code contains the following key parameters:

--model-name is the name of the model that will be created in SageMaker
--model-source-type specifies either fsx or s3 for the location of the model artifacts
--model-location specifies the prefix or folder where the model artifacts are located
--s3-bucket-name and —s3-region specify the S3 bucket name and AWS Region, respectively
--instance-type, --endpoint-name, --namespace, and --tls-certificate behave the same as for the deployment of SageMaker JumpStart models

Similar to SageMaker JumpStart model deployment, the CLI supports more advanced deployment configurations, including auto scaling, through additional parameters, which you can view by running the following command:

hyp create hyp-custom-endpoint --help

If successful, the command will output the following:

Creating sagemaker model and endpoint. Endpoint name: my-custom-tinyllama-endpoint.
 The process may take a few minutes...

After a few minutes, both the ALB and the SageMaker inference endpoint will be available, which you can observe through the CLI. Running hyp list hyp-custom-endpoint will show the status first as DeploymentInProgress and as DeploymentComplete when the endpoint is ready to be used:

| name                         | namespace   | labels   | status               |
|------------------------------|-------------|----------|----------------------|
| my-custom-tinyllama-endpoint | default     |          | DeploymentComplete   |

To get additional visibility into the deployment pod, run the following commands to find the pod name and view the corresponding logs:

hyp list-pods hyp-custom-endpoint --namespace 
hyp get-logs hyp-custom-endpoint --namespace  --pod-name

The output will look similar to the following:

│ INFO  PyProcess W-196-model-stdout: INFO 08-12 16:00:36 [monitor.py:33] torch.compile takes 29.18 s in total                                                          │
│ INFO  PyProcess W-196-model-stdout: INFO 08-12 16:00:37 [kv_cache_utils.py:634] GPU KV cache size: 809,792 tokens                                                     │
│ INFO  PyProcess W-196-model-stdout: INFO 08-12 16:00:37 [kv_cache_utils.py:637] Maximum concurrency for 2,048 tokens per request: 395.41x                             │
│ INFO  PyProcess W-196-model-stdout: INFO 08-12 16:00:59 [gpu_model_runner.py:1626] Graph capturing finished in 22 secs, took 0.37 GiB                                 │
│ INFO  PyProcess W-196-model-stdout: INFO 08-12 16:00:59 [core.py:163] init engine (profile, create kv cache, warmup model) took 59.39 seconds                         │
│ INFO  PyProcess W-196-model-stdout: INFO 08-12 16:00:59 [core_client.py:435] Core engine process 0 ready.                                                             │
│ INFO  PyProcess Model [model] initialized.                                                                                                                            │
│ INFO  WorkerThread Starting worker thread WT-0001 for model model (M-0001, READY) on device gpu(0)                                                                    │
│ INFO  ModelServer Initialize BOTH server with: EpollServerSocketChannel.                                                                                              │
│ INFO  ModelServer BOTH API bind to: http://0.0.0.0:8080

You can invoke the SageMaker inference endpoint you created through the CLI by running the following command:

hyp invoke hyp-custom-endpoint \
    --endpoint-name my-custom-tinyllama-endpoint \       
    --body '{"inputs":"What is the capital of USA?"}'

You will get an output similar to the following:

{"generated_text": " What is the capital of France? What is the capital of Japan? What is the capital of China? What is the capital of Germany? What is"}

Alternatively, you can deploy using the SageMaker HyperPod Python SDK. The following code will yield the equivalent deployment to the preceding CLI example:

from sagemaker.hyperpod.inference.config.hp_endpoint_config import S3Storage, ModelSourceConfig, TlsConfig, EnvironmentVariables, ModelInvocationPort, ModelVolumeMount, Resources, Worker
from sagemaker.hyperpod.inference.hp_endpoint import HPEndpoint

model_source_config = ModelSourceConfig(
    model_source_type='s3',
    model_location="models/tinyllama-1.1b-chat/",
    s3_storage=S3Storage(
        bucket_name='',
        region='',
    ),
)

worker = Worker(
    image='763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.33.0-lmi15.0.0-cu128',
    model_volume_mount=ModelVolumeMount(
        name='modelmount',
    ),
    model_invocation_port=ModelInvocationPort(container_port=8080),
    resources=Resources(
            requests={"cpu": "30000m", "nvidia.com/gpu": 1, "memory": "100Gi"},
            limits={"nvidia.com/gpu": 1}
    ),
)

tls_config = TlsConfig(tls_certificate_output_s3_uri='s3:///')

custom_endpoint = HPEndpoint(
    endpoint_name='my-custom-tinyllama-endpoint',
    instance_type='ml.g5.8xlarge',
    model_name='tinyllama',  
    tls_config=tls_config,
    model_source_config=model_source_config,
    worker=worker,
)

custom_endpoint.create()

Debugging inference deployments

In addition to the monitoring of the inference pod logs, there are several other useful ways of debugging inference deployments:

You can access the HyperPod inference operator controller logs through the SageMaker HyperPod CLI. Run hyp get-operator-logs—since-hours 0.5 to access the operator logs for custom and SageMaker JumpStart deployments, respectively.
You can view a list of inference deployments by running hyp list.
You can view the status and corresponding events of deployments by running hyp describe--name to view the status and events for custom and SageMaker JumpStart deployments, respectively.
If the HyperPod observability stack is deployed to the cluster, run hyp get-monitoring --grafana and hyp get-monitoring --prometheus to get the Grafana dashboard and Prometheus workspace URLs, respectively, to view inference metrics as well.
To monitor GPU utilization or view directory contents, it can be useful to execute commands or open an interactive shell into the pods. You can run commands in a pod by running, for example, kubectl exec -it-- nvtop to run nvtop for visibility into GPU utilization. You can open an interactive shell by running kubectl exec -it-- /bin/bash.

For more information on the inference deployment features in SageMaker HyperPod, see Amazon SageMaker HyperPod launches model deployments to accelerate the generative AI model development lifecycle and Deploying models on Amazon SageMaker HyperPod.

Clean up

To delete the training job from the corresponding example, use the following CLI command:

hyp delete hyp-pytorch-job --job-name fsdp-llama3-1-8b

To delete the model deployments from the inference example, use the following CLI commands for SageMaker JumpStart and custom model deployments, respectively:

hyp delete hyp-jumpstart-endpoint --name deepseek-distill-qwen-endpoint-cli
hyp delete hyp-custom-endpoint --name my-custom-tinyllama-endpoint

To avoid incurring ongoing costs for the instances running in your cluster, you can scale down the instances or delete instances.

Conclusion

The new SageMaker HyperPod CLI and SDK can significantly streamline the process of training and deploying large-scale AI models. Through the examples in this post, we’ve demonstrated how these tools provide the following benefits:

Simplified workflows – The CLI offers straightforward commands for common tasks like distributed training and model deployment, making powerful capabilities of SageMaker HyperPod accessible to data scientists without requiring deep infrastructure knowledge.
Flexible development options – Although the CLI handles common scenarios, the SDK enables fine-grained control and customization for more complex requirements, so developers can programmatically configure every aspect of their distributed ML workloads.
Comprehensive observability – Both interfaces provide robust monitoring and debugging capabilities through system logs and integration with the SageMaker HyperPod observability stack, helping quickly identify and resolve issues during development.
Production-ready deployment – The tools support end-to-end workflows from experimentation to production, including features like automatic TLS certificate generation for secure model endpoints and integration with SageMaker inference endpoints.

Getting started with these tools is as simple as installing the sagemaker-hyperpod package. The SageMaker HyperPod CLI and SDK provide the right level of abstraction for both data scientists looking to quickly experiment with distributed training and ML engineers building production systems.

For more information about SageMaker HyperPod and these development tools, refer to the SageMaker HyperPod CLI and SDK documentation or explore the example notebooks.

About the authors

Giuseppe Angelo Porcelli is a Principal Machine Learning Specialist Solutions Architect for Amazon Web Services. With several years of software engineering and an ML background, he works with customers of any size to understand their business and technical needs and design AI and ML solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He has worked on projects in different domains, including MLOps, computer vision, and NLP, involving a broad set of AWS services. In his free time, Giuseppe enjoys playing football.

Shweta Singh is a Senior Product Manager in the Amazon SageMaker Machine Learning platform team at AWS, leading the SageMaker Python SDK. She has worked in several product roles in Amazon for over 5 years. She has a Bachelor of Science degree in Computer Engineering and a Masters of Science in Financial Engineering, both from New York University.

Nicolas Jourdan is a Specialist Solutions Architect at AWS, where he helps customers unlock the full potential of AI and ML in the cloud. He holds a PhD in Engineering from TU Darmstadt in Germany, where his research focused on the reliability, concept drift detection, and MLOps of industrial ML applications. Nicolas has extensive hands-on experience across industries, including autonomous driving, drones, and manufacturing, having worked in roles ranging from research scientist to engineering manager. He has contributed to award-winning research, holds patents in object detection and anomaly detection, and is passionate about applying cutting-edge AI to solve complex real-world problems.

Tags:

Build a serverless Amazon Bedrock batch job orchestration workflow using AWS Ste...

Jat AI Stay informed with the latest in artificial intelligence. Jat AI News Portal is your go-to source for AI trends, breakthroughs, and industry analysis. Connect with the community of technologists and business professionals shaping the future.