Track LLM model evaluation using Amazon SageMaker managed MLflow and FMEval

In this post, we show how to use FMEval and Amazon SageMaker to programmatically evaluate LLMs. FMEval is an open source LLM evaluation library, designed to provide data scientists and machine learning (ML) engineers with a code-first experience to evaluate LLMs for various aspects, including accuracy, toxicity, fairness, robustness, and efficiency.

Jat AI

Jan 28, 2025 - 18:00

Track LLM model evaluation using Amazon SageMaker managed MLflow and FMEval

Evaluating large language models (LLMs) is crucial as LLM-based systems become increasingly powerful and relevant in our society. Rigorous testing allows us to understand an LLM’s capabilities, limitations, and potential biases, and provide actionable feedback to identify and mitigate risk. Furthermore, evaluation processes are important not only for LLMs, but are becoming essential for assessing prompt template quality, input data quality, and ultimately, the entire application stack. As LLMs take on more significant roles in areas like healthcare, education, and decision support, robust evaluation frameworks are vital for building trust and realizing the technology’s potential while mitigating risks.

Developers interested in using LLMs should prioritize a comprehensive evaluation process for several reasons. First, it assesses the model’s suitability for specific use cases, because performance can vary significantly across different tasks and domains. Evaluations are also a fundamental tool during application development to validate the quality of prompt templates. This process makes sure that solutions align with the company’s quality standards and policy guidelines before deploying them to production. Regular interval evaluation also allows organizations to stay informed about the latest advancements, making informed decisions about upgrading or switching models. Moreover, a thorough evaluation framework helps companies address potential risks when using LLMs, such as data privacy concerns, regulatory compliance issues, and reputational risk from inappropriate outputs. By investing in robust evaluation practices, companies can maximize the benefits of LLMs while maintaining responsible AI implementation and minimizing potential drawbacks.

To support robust generative AI application development, it’s essential to keep track of models, prompt templates, and datasets used throughout the process. This record-keeping allows developers and researchers to maintain consistency, reproduce results, and iterate on their work effectively. By documenting the specific model versions, fine-tuning parameters, and prompt engineering techniques employed, teams can better understand the factors contributing to their AI system’s performance. Similarly, maintaining detailed information about the datasets used for training and evaluation helps identify potential biases and limitations in the model’s knowledge base. This comprehensive approach to tracking key components not only facilitates collaboration among team members but also enables more accurate comparisons between different iterations of the AI application. Ultimately, this systematic approach to managing models, prompts, and datasets contributes to the development of more reliable and transparent generative AI applications.

In this post, we show how to use FMEval and Amazon SageMaker to programmatically evaluate LLMs. FMEval is an open source LLM evaluation library, designed to provide data scientists and machine learning (ML) engineers with a code-first experience to evaluate LLMs for various aspects, including accuracy, toxicity, fairness, robustness, and efficiency. In this post, we only focus on the quality and responsible aspects of model evaluation, but the same approach can be extended by using other libraries for evaluating performance and cost, such as LLMeter and FMBench, or richer quality evaluation capabilities like those provided by Amazon Bedrock Evaluations.

SageMaker is a data, analytics, and AI/ML platform, which we will use in conjunction with FMEval to streamline the evaluation process. We specifically focus on SageMaker with MLflow. MLflow is an open source platform for managing the end-to-end ML lifecycle, including experimentation, reproducibility, and deployment. The managed MLflow in SageMaker simplifies the deployment and operation of tracking servers, and offers seamless integration with other AWS services, making it straightforward to track experiments, package code into reproducible runs, and share and deploy models.

By combining FMEval’s evaluation capabilities with SageMaker with MLflow, you can create a robust, scalable, and reproducible workflow for assessing LLM performance. This approach can enable you to systematically evaluate models, track results, and make data-driven decisions in your generative AI development process.

Using FMEval for model evaluation

FMEval is an open-source library for evaluating foundation models (FMs). It consists of three main components:

Data config – Specifies the dataset location and its structure.
Model runner – Composes input, and invokes and extracts output from your model. Thanks to this construct, you can evaluate any LLM by configuring the model runner according to your model.
Evaluation algorithm – Computes evaluation metrics to model outputs. Different algorithms have different metrics to be specified.

You can use pre-built components because it provides native components for both Amazon Bedrock and Amazon SageMaker JumpStart, or create custom ones by inheriting from the base core component. The library supports various evaluation scenarios, including pre-computed model outputs and on-the-fly inference. FMEval offers flexibility in dataset handling, model integration, and algorithm implementation. Refer to Evaluate large language models for quality and responsibility or the Evaluating Large Language Models with fmeval paper to dive deeper into FMEval, or see the official GitHub repository.

Using SageMaker with MLflow to track experiments

The fully managed MLflow capability on SageMaker is built around three core components:

MLflow tracking server – This component can be quickly set up through the Amazon SageMaker Studio interface or using the API for more granular configurations. It functions as a standalone HTTP server that provides various REST API endpoints for monitoring, recording, and visualizing experiment runs. This allows you to keep track of your ML experiments.
MLflow metadata backend – This crucial part of the tracking server is responsible for storing all the essential information about your experiments. It keeps records of experiment names, run identifiers, parameter settings, performance metrics, tags, and locations of artifacts. This comprehensive data storage makes sure that you can effectively manage and analyze your ML projects.
MLflow artifact repository – This component serves as a storage space for all the files and objects generated during your ML experiments. These can include trained models, datasets, log files, and visualizations. The repository uses an Amazon Simple Storage Service (Amazon S3) bucket within your AWS account, making sure that your artifacts are stored securely and remain under your control.

The following diagram depicts the different components and where they run within AWS.

Code walkthrough

You can follow the full sample code from the GitHub repository.

Prerequisites

You must have the following prerequisites:

A running MLflow tracking server within an Amazon SageMaker Studio domain
A JupyterLab application within the same SageMaker Studio domain
Active subscriptions to the Amazon Bedrock models you want to evaluate and permissions to invoke these models
Permissions to deploy foundation models via Amazon SageMaker JumpStart

Refer to the documentation best practices regarding AWS Identity and Access Management (IAM) policies for SageMaker, MLflow, and Amazon Bedrock on how to set up permissions for the SageMaker execution role. Remember to always following the least privilege access principle.

Evaluate a model and log to MLflow

We provide two sample notebooks to evaluate models hosted in Amazon Bedrock (Bedrock.ipynb) and models deployed to SageMaker Hosting using SageMaker JumpStart (JumpStart.ipynb). The workflow implemented in these two notebooks is essentially the same, although a few differences are noteworthy:

Models hosted in Amazon Bedrock can be consumed directly using an API without any setup, providing a “serverless” experience, whereas models in SageMaker JumpStart require the user first to deploy the models. Although deploying models through SageMaker JumpStart is a straightforward operation, the user is responsible for managing the lifecycle of the endpoint.
ModelRunners implementations differ. FMEval provides native implementations for both Amazon Bedrock, using the BedrockModelRunner class, and SageMaker JumpStart, using the JumpStartModelRunner class. We discuss the main differences in the following section.

ModelRunner definition

For BedrockModelRunner, we need to find the model content_template. We can find this information conveniently on the Amazon Bedrock console in the API request sample section, and look at value of the body. The following example is the content template for Anthropic’s Claude 3 Haiku:

output_jmespath = "content[0].text"
content_template = """{
  "anthropic_version": "bedrock-2023-05-31",
  "max_tokens": 512,
  "temperature": 0.5,
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": $prompt
        }
      ]
    }
  ]
}"""

model_runner = BedrockModelRunner(
    model_id=model_id,
    output=output_jmespath,
    content_template=content_template,
)

For JumpStartModelRunner, we need to find the model and model_version. This information can be retrieved directly using the get_model_info_from_endpoint(endpoint_name=endpoint_name) utility provided by the SageMaker Python SDK, where endpoint_name is the name of the SageMaker endpoint where the SageMaker JumpStart model is hosted. See the following code example:

from sagemaker.jumpstart.session_utils import get_model_info_from_endpoint

model_id, model_version, , , _ = get_model_info_from_endpoint(endpoint_name=endpoint_name)

model_runner = JumpStartModelRunner(
    endpoint_name=endpoint_name,
    model_id=model_id,
    model_version=model_version,
)

DataConfig definition

For each model runner, we want to evaluate three categories: Summarization, Factual Knowledge, and Toxicity. For each of this category, we prepare a DataConfig object for the appropriate dataset. The following example shows only the data for the Summarization category:

dataset_path = Path("datasets")

dataset_uri_summarization = dataset_path / "gigaword_sample.jsonl"
if not dataset_uri_summarization.is_file():
    print("ERROR - please make sure the file, gigaword_sample.jsonl, exists.")

data_config_summarization = DataConfig(
    dataset_name="gigaword_sample",
    dataset_uri=dataset_uri_summarization.as_posix(),
    dataset_mime_type=MIME_TYPE_JSONLINES,
    model_input_location="document",
    target_output_location="summary",
)

Evaluation sets definition

We can now create an evaluation set for each algorithm we want to use in our test. For the Summarization evaluation set, replace with your own prompt according to the input signature identified earlier. fmeval uses $model_input as placeholder to get the input from your evaluation dataset. See the following code:

summarization_prompt = "Summarize the following text in one sentence: $model_input"

summarization_accuracy = SummarizationAccuracy()

evaluation_set_summarization = EvaluationSet(
  data_config_summarization,
  summarization_accuracy,
  summarization_prompt,
)

We are ready now to group the evaluation sets:

evaluation_list = [
    evaluation_set_summarization,
    evaluation_set_factual,
    evaluation_set_toxicity,
]

Evaluate and log to MLflow

We set up the MLflow experiment used to track the evaluations. We then create a new run for each model, and run all the evaluations for that model within that run, so that the metrics will all appear together. We use the model_id as the run name to make it straightforward to identify this run as part of a larger experiment, and run the evaluation using the run_evaluation_sets() defined in utils.py. See the following code:

run_name = f"{model_id}"

experiment_name = "fmeval-mlflow-simple-runs"
experiment = mlflow.set_experiment(experiment_name)

with mlflow.start_run(run_name=run_name) as run:
    run_evaluation_sets(model_runner, evaluation_list)

It is up to the user to decide how to best organize the results in MLflow. In fact, a second possible approach is to use nested runs. The sample notebooks implement both approaches to help you decide which one fits best your needs.

experiment_name = "fmeval-mlflow-nested-runs"
experiment = mlflow.set_experiment(experiment_name)

with mlflow.start_run(run_name=run_name, nested=True) as run:
    run_evaluation_sets_nested(model_runner, evaluation_list)

Run evaluations

Tracking the evaluation process involves storing information about three aspects:

The input dataset
The parameters of the model being evaluated
The scores for each evaluation

We provide a helper library (fmeval_mlflow) to abstract the logging of these aspects to MLflow, streamlining the interaction with the tracking server. For the information we want to store, we can refer to the following three functions:

log_input_dataset(data_config: DataConfig | list[DataConfig]) – Log one or more input datasets to MLflow for evaluation purposes
log_runner_parameters(model_runner: ModelRunner, custom_parameters_map: dict | None = None, model_id: str | None = None,) – Log the parameters associated with a given ModelRunner instance to MLflow
log_metrics(eval_output: list[EvalOutput], log_eval_output_artifact: bool = False) – Log metrics and artifacts for a list of SingleEvalOutput instances to MLflow.

When the evaluations are complete, we can analyze the results directly in the MLflow UI for a first visual assessment.

In the following screenshots, we show the visualization differences between logging using simple runs or nested runs.

You might want to create your own custom visualizations. For example, spider plots are often used to make visual comparison across multiple metrics. In the notebook compare_models.ipynb, we provide an example on how to use metrics stored in MLflow to generate such plots, which ultimately can also be stored in MLflow as part of your experiments. The following screenshots show some example visualizations.

Clean up

Once created, an MLflow tracking server will incur costs until you delete or stop it. Billing for tracking servers is based on the duration the servers have been running, the size selected, and the amount of data logged to the tracking servers. You can stop the tracking servers when they are not in use to save costs or delete them using the API or SageMaker Studio UI. For more details on pricing, see Amazon SageMaker pricing.

Similarly, if you deployed a model using SageMaker, endpoints are priced by deployed infrastructure time rather than by requests. You can avoid unnecessary charges by deleting your endpoints when you’re done with the evaluation.

Conclusion

In this post, we demonstrated how to create an evaluation framework for LLMs by combining SageMaker managed MLflow with FMEval. This integration provides a comprehensive solution for tracking and evaluating LLM performance across different aspects including accuracy, toxicity, and factual knowledge.

To enhance your evaluation journey, you can explore the following:

Get started with FMeval and SageMaker managed MLflow by following our code examples in the provided GitHub repository
Implement systematic evaluation practices in your LLM development workflow using the demonstrated approach
Use MLflow’s tracking capabilities to maintain detailed records of your evaluations, making your LLM development process more transparent and reproducible
Explore different evaluation metrics and datasets available in FMEval to comprehensively assess your LLM applications

By adopting these practices, you can build more reliable and trustworthy LLM applications while maintaining a clear record of your evaluation process and results.

About the authors

Paolo Di Francesco is a Senior Solutions Architect at Amazon Web Services (AWS). He holds a PhD in Telecommunications Engineering and has experience in software engineering. He is passionate about machine learning and is currently focusing on using his experience to help customers reach their goals on AWS, in particular in discussions around MLOps. Outside of work, he enjoys playing football and reading.

Dr. Alessandro Cerè is a GenAI Evaluation Specialist and Solutions Architect at AWS. He assists customers across industries and regions in operationalizing and governing their generative AI systems at scale, ensuring they meet the highest standards of performance, safety, and ethical considerations. Bringing a unique perspective to the field of AI, Alessandro has a background in quantum physics and research experience in quantum communications and quantum memories. In his spare time, he pursues his passion for landscape and underwater photography.

Tags:

Optimizing AI responsiveness: A practical guide to Amazon Bedrock latency-optimi...

Jat AI Stay informed with the latest in artificial intelligence. Jat AI News Portal is your go-to source for AI trends, breakthroughs, and industry analysis. Connect with the community of technologists and business professionals shaping the future.