Evaluate models or RAG systems using Amazon Bedrock Evaluations – Now generally available

Today, we’re excited to announce the general availability of these evaluation features in Amazon Bedrock Evaluations, along with significant enhancements that make them fully environment-agnostic. In this post, we explore these new features in detail, showing you how to evaluate both RAG systems and models with practical examples. We demonstrate how to use the comparison capabilities to benchmark different implementations and make data-driven decisions about your AI deployments.

Apr 4, 2025 - 17:00
Evaluate models or RAG systems using Amazon Bedrock Evaluations – Now generally available

Organizations deploying generative AI applications need robust ways to evaluate their performance and reliability. When we launched LLM-as-a-judge (LLMaJ) and Retrieval Augmented Generation (RAG) evaluation capabilities in public preview at AWS re:Invent 2024, customers used them to assess their foundation models (FMs) and generative AI applications, but asked for more flexibility beyond Amazon Bedrock models and knowledge bases. Today, we’re excited to announce the general availability of these evaluation features in Amazon Bedrock Evaluations, along with significant enhancements that make them fully environment-agnostic.

The general availability release introduces “bring your own inference responses” capabilities for both RAG evaluation and model evaluation. This means you can now evaluate a RAG system or model—whether running on Amazon Bedrock, other cloud providers, or on premises—as long as you provide evaluation data in the required format. You can even evaluate the final responses of a fully built application. We’ve also added new citation metrics for the already-powerful RAG evaluation suite, including citation precision and citation coverage, to help you better assess how accurately your RAG system uses retrieved information. These metrics provide deeper insights into whether responses are properly grounded in the source documents and if they’re using the most relevant retrieved passages.

In this post, we explore these new features in detail, showing you how to evaluate both RAG systems and models with practical examples. We demonstrate how to use the comparison capabilities to benchmark different implementations and make data-driven decisions about your AI deployments. Whether you’re using Amazon Bedrock or other environments for your generative AI applications, these evaluation tools can help you optimize performance and promote quality across your entire generative AI portfolio.

What’s new: RAG evaluation

In this section, we discuss what’s new with RAG evaluation, including the introduction of Bring Your Own Inference responses for evaluating external RAG systems and new citation metrics that measure precision and coverage of RAG citations.

Bring Your Own Inference responses

The RAG evaluation capabilities have been significantly expanded to allow evaluation of outputs from RAG systems hosted anywhere through the evaluation environment in Amazon Bedrock. With Bring Your Own Inference (BYOI) responses, you can now evaluate retrieval and generation results from a variety of sources, including other FM providers, custom-build RAG systems, or deployed open-weights solutions, by providing the outputs in the required format. Additionally, the introduction of new citation metrics with our previously released quality and responsible AI metrics also provides deeper insights into how well RAG systems use their knowledge bases and source documents.

When evaluating retrieval results from a knowledge base, we allow one knowledge base to be evaluated per evaluation job. For every dataset entry, you can provide a list of conversation turns. Each conversation turn currently has a prompt and referenceResponses. When using BYOI, you must provide your retrieval results in a new field called output (this field is required for BYOI jobs but optional and not needed for non-BYOI jobs). You must provide a knowledgeBaseIdentifier for every output. Additionally, for every retrieval result you bring, you can provide a name and additional metadata in the form of key-value pairs. The following is the input JSONL format for RAG evaluation jobs (Retrieve). Fields marked with ? are optional.

In the public preview period that ended March 20, 2025, the following dataset format contained a key called referenceContexts. This field has now been changed to referenceResponses to align with the Retrieve and Generate evaluation. The content of referenceResponses should be the expected ground truth answer that an end-to-end RAG system would have generated given the prompt, not the expected passages or chunks retrieved from the knowledge base.

{
  "conversationTurns": [
    {
      "prompt": {
        "content": [
          {
            "text": string
          }
        ]
      },
      "referenceResponses"?: [
        {
          "content": [
            {
              "text": string
            }
          ]
        }
      ],
       "output"?: {
        "knowledgeBaseIdentifier": string
        "retrievedResults": {
          "retrievalResults": [
            {
              "name"?: string
              "content": {
                "text": string
              },
              "metadata"?: {
                [key: string]: string
              }
            }
          ]
        }
      }
    }
  ]
}

When evaluating retrieval and generation results from a knowledge base or RAG system, we allow one knowledge base (RAG system) to be evaluated per evaluation job. For every dataset entry, you can provide a list of conversation turns. Each conversation turn has a prompt and referenceResponses. When using BYOI, you must provide the generated text, retrieved passages (for example, retrieval results), and citations for the generated text in a new field called output (this field is required for BYOI jobs but optional and not needed for non-BYOI jobs). The citations field will be used by the citation metrics. If you don’t have citations, you can enter dummy data in those fields and make sure to not select the citation precision or citation coverage metrics for the evaluation job. You must also provide a knowledgeBaseIdentifier for every output you bring. This knowledgeBaseIdentifier should be the same for all lines in the dataset and also at the job level. The following is the input JSONL format for RAG evaluation jobs (Retrieve and Generate). Fields marked with ? are optional.

{
  "conversationTurns": [
    {
      "prompt": {
        "content": [
          {
            "text": string
          }
        ]
      },
      "referenceResponses"?: [
        {
          "content": [
            {
              "text": string
            }
          ]
        }
      ],
      "output"?: {
        "text": string
        "modelIdentifier"?: string
        "knowledgeBaseIdentifier": string
        "retrievedPassages": {
          "retrievalResults": [
            {
              "name"?: string
              "content": {
                "text": string
              },
              "metadata"?: {
                [key: string]: string
              }
            }
          ]
        },
        "citations": [
          {
            "generatedResponsePart": {
              "textResponsePart": {
                "span": {
                  "start": number
                  "end": number
                  
                },
                "text": string
              }
            },
            "retrievedReferences": [
              {
                "name"?: string
                "content": {
                  "text": string
                },
                "metadata"?: {
                  [key: string]: string
                }
              }
            ]
          }
        ]
      }
    }
  ]
}

Citation metrics

Citation precision measures how accurately a RAG system cites its sources by evaluating whether the cited passages actually contain information used in the response. This metric helps identify when responses include unnecessary or irrelevant citations—a common issue where models might cite passages that weren’t actually used to generate the response content. The metric produces a score between 0–1, where 1 indicates perfect citation precision (all cited passages were relevant and used in the response), and 0 indicates complete imprecision (none of the cited passages should have been used). This helps teams identify and fix cases where their RAG systems are citing passages indiscriminately rather than selectively citing only the passages that contributed to the response.

Citation coverage evaluates how well a response’s content is supported by its citations, focusing on whether all information derived from retrieved passages has been properly cited. This metric works by comparing the faithfulness of the response to both the cited passages and the full set of retrieved passages, then calculating their ratio. In cases where the response is completely unfaithful to the retrieved passages (for example, hallucination), the metric indicates that citation coverage assessment isn’t applicable. The metric provides a score from 0–1, where scores closer to 0 indicate that the model failed to cite relevant supporting passages, and scores closer to 1 suggest proper citation of information from the retrieved passages. Importantly, a perfect score of 1 doesn’t necessarily mean all information in the response is cited—rather, it indicates that all information that could have been cited (based on the retrieved passages) was properly cited. This helps teams identify when their RAG systems are missing opportunities to cite relevant source material.

What’s new: Model evaluation

In this section, we discuss what’s new with model evaluation, including expanded Bring Your Own Inference capabilities that allow you to evaluate model responses from external providers and deployed solutions through both LLMaaJ and human evaluation workflows.

Bring Your Own Inference responses

The model evaluation capabilities have been significantly expanded to allow evaluation of outputs from the models available through the Amazon Bedrock evaluation environment. With BYOI responses, you can now evaluate model responses from other FM providers or deployed solutions by providing the outputs in the required format. BYOI is available for both LLMaJ and human evaluation workflows.

You don’t have to limit these evaluations to foundation models. Because BYOI evaluation takes in the prompt and the output, it can evaluate the final response of a full application if you choose to bring that into your dataset.

When using LLMaaJ, only one model can be evaluated per evaluation job. Consequently, you must provide just a single entry in the modelResponses list for each evaluation, though you can run multiple evaluation jobs to compare different models. The modelResponses field is required for BYOI jobs, but not needed for non-BYOI jobs. The following is the input JSONL format for LLMaaJ with BYOI. Fields marked with ? are optional.

{
    "prompt": string
    "referenceResponse"?: string
    "category"?: string
     "modelResponses"?: [
        {
            "response": string
            "modelIdentifier": string
        }
    ]
}

When using human evaluation, up to two models can be evaluated per evaluation job. Consequently, you can provide up to two entries in the modelResponses list, with each response requiring a unique identifier. The modelResponses field is required for BYOI jobs but not needed for non-BYOI jobs. The following is the input JSONL format for human evaluation with BYOI. Fields marked with ? are optional.

{
    "prompt": string
    "referenceResponse"?: string
    "category"?: string
    "modelResponses"?: [
        {
            "response": string
            "modelIdentifier": string
        },
        {
            "response": string
            "modelIdentifier": string
        }
    ]
}

Feature overview

The LLMaaJ evaluation workflow with BYOI enables you to systematically assess model outputs from your preferred source using Amazon Bedrock model evaluation capabilities. The process follows a straightforward flow, starting with selecting an evaluator model and configuring BYOI, then choosing appropriate metrics, uploading your evaluation datasets, running the assessment, and finally analyzing the detailed results. This comprehensive evaluation pipeline (as illustrated in the following diagram) helps provide consistent quality across your AI implementations, regardless of where they are deployed.

For RAG system evaluation, the workflow incorporates additional components and metrics specific to context retrieval analysis. The process begins with choosing an evaluator model, then selecting between retrieval-only or retrieve-and-generate evaluation modes. With BYOI support, you can evaluate RAG outputs from any source while using powerful LLMaaJ metrics to assess the quality of your retrievals or full end-to-end retrieve-and-generate flow of your RAG system. This end-to-end evaluation framework (see the following diagram) provides deep insights into how effectively your RAG system uses its retrieved context and generates accurate, well-supported responses.

Prerequisites

To use the LLMaaJ model evaluation and RAG evaluation features with BYOI, you must have the following prerequisites:

Dataset description and preparation

For demonstrating an LLMaaJ evaluation job with BYOI, we created a shopping math problems dataset with a third-party model. The dataset contains 30 discount calculation problems, each with a prompt, reference answer, and model response following the required BYOI format. Each record uses the "third-party-model" identifier in the modelResponses array, enabling the LLMaaJ evaluator to assess response quality against your chosen metrics.

The dataset we used for a RAG evaluation job with BYOI was created using Amazon’s 10-K SEC filing information. Each record in the dataset contains questions about Amazon’s corporate information (such as SEC file numbers, operating segments, and fiscal year reporting), reference responses, and third-party model-generated answers with their corresponding retrieved passages and citations. The dataset follows the required BYOI format, with each record using "third-party-RAG" as the knowledge base identifier. This structure enables evaluation of both retrieve and generate quality metrics and citation accuracy, demonstrating how organizations can assess RAG systems regardless of where they’re deployed.

Start an LLMaaJ evaluation job with BYOI using the Amazon Bedrock console

In this first example, we use the prompt and inference responses dataset published in our amazon-bedrock-samples repository. You can use LLMaaJ with BYOI responses on Amazon Bedrock model evaluation to assess model performance through a user-friendly AWS Management Console interface. Follow these steps to start an evaluation job:

  1. On the Amazon Bedrock console, choose Inference and Assessment in the navigation pane and then choose Evaluations.
  2. On the Evaluations page, choose the Models
  3. In the Model evaluations section, choose Create and choose Automatic: Model as a judge.
  4. On the Create automatic evaluation page, enter a name and description and under Evaluation model, select a model. This model will be used as a judge to evaluate your inference responses.
  5. Optionally, choose Tags, and create your own tags.
  6. Under Inference source, choose Bring your own inference responses as the inference source.
  7. For Source name, enter a name that is the same as the modelIdentifier in your prompt and inference response dataset.

    {
        "prompt": "If books cost \\$10.21 each and you buy 15 of them with a 20% discount, how much will you pay in total?",
        "referenceResponse": "The total price will be \\$122.52. Original price: \\$153.15 minus 20% discount (\\$30.630000000000003)",
        "category": "Shopping Math",
        "modelResponses": [{
            "response": "To determine the total cost of buying 15 books with a 20% discount, we can follow these steps:\n\n1. Calculate the total cost of the books before the discount.\n2. Determine the discount amount.\n3. Subtract the discount amount from the total cost before the discount.\n\n**Step 1: Calculate the total cost of the books before the discount.**\n\nThe cost of one book is \\$10.21. Therefore, the cost of 15 books is:\n\\[ 15 \\times 10.21 = 153.15 \\]\n\n**Step 2: Determine the discount amount.**\n\nThe discount is 20%, so we need to find 20% of \\$153.15. To do this, we multiply \\$153.15 by 0.20:\n\\[ 153.15 \\times 0.20 = 30.63 \\]\n\n**Step 3: Subtract the discount amount from the total cost before the discount.**\n\nThe total cost after the discount is:\n\\[ 153.15 - 30.63 = 122.52 \\]\n\nTherefore, the total amount you will pay is:\n\\[ \\boxed{122.52} \\]",
            "modelIdentifier": "third-party-model"
        }]
    }
  1. Under Metrics, select the metrics you want to use to evaluate the model response (such as helpfulness, correctness, faithfulness, relevance, and harmfulness).
  2. Under Datasets, for Choose a prompt dataset and Evaluation results, enter or choose the Browse S3 option to enter the S3 URI.
  3. Under Amazon Bedrock IAM role – Permissions, select or create an IAM service role with the proper permissions. This includes service access to Amazon Bedrock, the S3 buckets in the evaluation job, and the models being used in the job. If you create a new IAM role in the evaluation setup, the service will automatically give the role the proper permissions for the job.
  4. Specify the folder in the S3 bucket where the results of the model evaluation will be stored and choose Create.

There’s an input dataset validator that validates that the format of the input prompt dataset is correct when you choose Create. It will help you correct any formatting errors.

You will be able to see the evaluation job is In Progress. Wait for the job status to change to Complete. This may take minutes or hours depending on how long your prompts and responses are and if you used only a few prompts or hundreds of prompts.

  1. When the job is complete, choose the job to see its details and metrics summary.
  2. To view generation metrics details, scroll down in the model evaluation report and choose one of the individual metrics (like helpfulness or correctness) to see its detailed breakdown.
  3. To see each record’s prompt input, generation output (from bring your own inference responses), ground truth (which is optional), and individual scores, choose a metric and choose Prompt details.
  4. Hover over an individual score to view its detailed explanation.

Start an LLMaaJ evaluation job with BYOI using the Python SDK and APIs

To use the Python SDK for creating an LLMaaJ model evaluation job with your own inference responses, use the following steps.

  1. Set up the required configurations:
    import boto3
    import json
    import random
    from datetime import datetime
    import botocore
    
    # AWS Configuration
    REGION = "us-east-1"
    ROLE_ARN = "arn:aws:iam:::role/"
    BUCKET_NAME = ""
    PREFIX = ""
    dataset_custom_name = "" # without the ".jsonl extension
    
    # Initialize AWS clients
    bedrock_client = boto3.client('bedrock', region_name=REGION)
    s3_client = boto3.client('s3', region_name=REGION)
  2. Use the following function to create an LLMaaJ model evaluation job with your own inference responses. The precomputedInferenceSource parameter is used when you’re bringing your own pre-generated inference responses. The inferenceSourceIdentifier value must match the model identifier you used in your evaluation dataset. This identifier represents the model that generated the responses you want to evaluate.
    def create_llm_judge_evaluation(
        client,
        job_name: str,
        role_arn: str,
        input_s3_uri: str,
        output_s3_uri: str,
        evaluator_model_id: str,
        dataset_name: str = None,
        task_type: str = "General" # must be General for LLMaaJ
    ):    
        # All available LLM-as-judge metrics
        llm_judge_metrics = [
            "Builtin.Correctness",
            "Builtin.Completeness", 
            "Builtin.Faithfulness",
            "Builtin.Helpfulness",
            "Builtin.Coherence",
            "Builtin.Relevance",
            "Builtin.FollowingInstructions",
            "Builtin.ProfessionalStyleAndTone",
            "Builtin.Harmfulness",
            "Builtin.Stereotyping",
            "Builtin.Refusal"
        ]
    
        # Configure dataset
        dataset_config = {
            "name": dataset_name or "CustomDataset",
            "datasetLocation": {
                "s3Uri": input_s3_uri
            }
        }
    
        try:
            response = client.create_evaluation_job(
                jobName=job_name,
                roleArn=role_arn,
                applicationType="ModelEvaluation",
                evaluationConfig={
                    "automated": {
                        "datasetMetricConfigs": [
                            {
                                "taskType": task_type,
                                "dataset": dataset_config,
                                "metricNames": llm_judge_metrics
                            }
                        ],
                        "evaluatorModelConfig": {
                            "bedrockEvaluatorModels": [
                                {
                                    "modelIdentifier": evaluator_model_id
                                }
                            ]
                        }
                    }
                },
                inferenceConfig={
                    "models": [
                        {
                            "precomputedInferenceSource": {
                                "inferenceSourceIdentifier": "third-party-model"
                            }
                        }
                    ]
                },
                outputDataConfig={
                    "s3Uri": output_s3_uri
                }
            )
            return response
            
        except Exception as e:
            print(f"Error creating evaluation job: {str(e)}")
            raise
  3. Use the following code to create an LLMaaJ evaluation job, specifying the evaluator model and configuring all necessary parameters such as input dataset location, output path, and task type. The job configuration creates an evaluation task that will assess your model inference responses, with results being stored in the designated S3 bucket for later analysis.
    # Job Configuration
    evaluator_model = "anthropic.claude-3-haiku-20240307-v1:0"
    job_name = f"llmaaj-third-party-model-{evaluator_model.split('.')[0]}-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
    
    # S3 Paths
    input_data = f"s3://{BUCKET_NAME}/{PREFIX}/{dataset_custom_name}.jsonl"
    output_path = f"s3://{BUCKET_NAME}/{PREFIX}"
    
    # Create evaluation job
    try:
        llm_as_judge_response = create_llm_judge_evaluation(
            client=bedrock_client,
            job_name=job_name,
            role_arn=ROLE_ARN,
            input_s3_uri=input_data,
            output_s3_uri=output_path,
            evaluator_model_id=evaluator_model,
            task_type="General"
        )
        print(f"✓ Created evaluation job: {llm_as_judge_response['jobArn']}")
    except Exception as e:
        print(f"✗ Failed to create evaluation job: {str(e)}")
        raise
  4. Monitor the progress of your evaluation job:
    # Get job ARN based on job type
    evaluation_job_arn = llm_as_judge_response['jobArn']
    
    # Check job status
    check_status = bedrock_client.get_evaluation_job(jobIdentifier=evaluation_job_arn) 
    print(f"Job Status: {check_status['status']}")

Start a RAG evaluation job with BYOI using the Amazon Bedrock console

When dealing with RAG systems, it’s important to evaluate the quality of retrieval and retrieval and generation together. We already provide several important metrics for those workflows, and now you can also evaluate the citation quality of your RAG systems. You can include citations when using Amazon Bedrock Knowledge Bases. In this section, we explore how you can use a RAG evaluation job to surface two important citation metrics. If your custom RAG system uses citations, you can also plug your citation information into your input dataset with BYOI and use the citation precision and citation coverage metrics:

Best practice is to use both citation metrics, citation precision and citation coverage, together to give a complete view of the citation quality.

To get started using the console, complete the following steps:

  1. On the Amazon Bedrock console, under Inference and Assessments in the navigation pane, choose Evaluations.
  2. Choose the RAG.
  3. Choose Create. This will take you to a new page where you can set up your RAG evaluation job.
  4. Under Evaluation details, enter a name and description and choose an Evaluator model. In the example, we chose Claude 3 Haiku for demonstration, but we provide a choice of several judge models for flexibility. This model will be used as a judge to evaluate your inference responses.
  5. Under Inference source, choose Bring your own inference responses as the inference source.
  6. For Source name, enter a name that is the same as the modelIdentifier in your prompt and inference response dataset. For example, the following is a snippet of one record in our evaluation dataset:
    {
        "conversationTurns": [{
            "prompt": {
                "content": [{
                    "text": "What is Amazon's SEC file number?"
                }]
            },
            "referenceResponses": [{
                "content": [{
                    "text": "Amazon's SEC file number is 000-22513."
                }]
            }],
            "output": {
                "text": "Amazon's SEC file number is 000-22513.",
                "modelIdentifier": "third-party-model",
                "knowledgeBaseIdentifier": "third-party-RAG",
                "retrievedPassages": {
                    "retrievalResults": [{
                        "content": {
                            "text": "Commission File No. 000-22513"
                        }
                    }, {
                        "content": {
                            "text": "AMAZON.COM, INC. (Exact name of registrant as specified in its charter)"
                        }
                    }]
                },
                "citations": [{
                    "generatedResponsePart": {
                        "textResponsePart": {
                            "span": {
                                "start": 0,
                                "end": 11
                            },
                            "text": "Amazon's SEC"
                        }
                    },
                    "retrievedReferences": [{
                        "content": {
                            "text": "UNITED STATESSECURITIES AND EXCHANGE COMMISSION"
                        }
                    }]
                }, {
                    "generatedResponsePart": {
                        "textResponsePart": {
                            "span": {
                                "start": 12,
                                "end": 22
                            },
                            "text": "file number"
                        }
                    },
                    "retrievedReferences": [{
                        "content": {
                            "text": "Commission File No. 000-22513"
                        }
                    }]
                }, {
                    "generatedResponsePart": {
                        "textResponsePart": {
                            "span": {
                                "start": 23,
                                "end": 33
                            },
                            "text": "is 000-22513"
                        }
                    },
                    "retrievedReferences": [{
                        "content": {
                            "text": "Commission File No. 000-22513"
                        }
                    }]
                }]
            }
        }]
    }

Therefore, we use the source name third-party-RAG.

  1. Under Metrics, select the citation metrics Citation precision and Citation coverage. You can also select other metrics.
  2. Under Dataset and evaluation results S3 location, choose Browse S3 to enter the S3 URI for the evaluation input file, and specify the output location.
  3. Under Amazon Bedrock IAM role – Permissions, create a new service role or use an existing one.
  4. Choose Create.

If you see an error in the S3 URI, such as, “Your S3 bucket does not have the required CORS settings,” then you might need to edit your CORS setting on the bucket that has your data. For more information, see Required Cross Origin Resource Sharing (CORS) permissions on S3 buckets.

You can monitor your job; the status will be In progress while the job is running.

  1. When the job status is Completed, you can follow the link on a completed job to review results.

The results include a Metric summary section. In this particular example, both citation precision and citation coverage are high.

You can also set metric criteria to make sure the RAG system is performing up to your standards, and view example conversations as shown below each metric.

Start a RAG evaluation job with BYOI using the Python SDK and APIs

To use the Python SDK for creating an RAG evaluation job with your own inference responses, follow these steps (or refer to our example notebook):

  1. Set up the required configurations, which should include your model identifier for the evaluator, IAM role with appropriate permissions, S3 paths for input data containing your inference responses, and output location for results:
    import boto3
    from datetime import datetime
    
    # Configure knowledge base and model settings
    evaluator_model = ""
    role_arn = "arn:aws:iam:::role/"
    BUCKET_NAME = ""
    PREFIX = ""
    RAG_dataset_custom_name = "" # without the ".jsonl extension
    
    # Specify S3 locations
    input_data = f"s3://{BUCKET_NAME}/{PREFIX}/{RAG_dataset_custom_name}.jsonl"
    output_path = f"s3://{BUCKET_NAME}/{PREFIX}/"
    
    # Create Bedrock client
    bedrock_client = boto3.client('bedrock')
  2. Use the following function to create a RAG evaluation job with your own pre-generated retrieval and generation responses. The precomputedRagSourceConfig parameter is used when you’re bringing your own pre-generated RAG responses for evaluation. The ragSourceIdentifier value must match the identifier you used in your evaluation dataset. This identifier represents the RAG system that produced the responses you want to evaluate. The code sets up an automated evaluation that will assess various dimensions of your RAG system’s performance, including correctness, completeness, helpfulness, logical coherence, faithfulness, and citation quality against the provided dataset.
    retrieve_generate_job_name = f"rag-evaluation-generate-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
    
    retrieve_generate_job = bedrock_client.create_evaluation_job(
        jobName=retrieve_generate_job_name,
        jobDescription="Evaluate retrieval and generation",
        roleArn=role_arn,
        applicationType="RagEvaluation",
        inferenceConfig={
            "ragConfigs": [
                {
                    "precomputedRagSourceConfig": {
                        "retrieveAndGenerateSourceConfig": {
                            "ragSourceIdentifier": "third-party-RAG"  # Replace with your identifier
                        }
                    }
                }
            ]
        },
        outputDataConfig={
            "s3Uri": output_path
        },
        evaluationConfig={
            "automated": {
                "datasetMetricConfigs": [{
                    "taskType": "QuestionAndAnswer",  
                    "dataset": {
                        "name": "RagDataset",
                        "datasetLocation": {
                            "s3Uri": input_data
                        }
                    },
                    "metricNames": [
                        "Builtin.Correctness",
                        "Builtin.Completeness",
                        "Builtin.Helpfulness",
                        "Builtin.LogicalCoherence",
                        "Builtin.Faithfulness",
                        "Builtin.CitationPrecision",
                        "Builtin.CitationCoverage"
                    ]
                }],
                "evaluatorModelConfig": {
                    "bedrockEvaluatorModels": [{
                        "modelIdentifier": evaluator_model
                    }]
                }
            }
        }
    )
  1. After submitting the evaluation job, you can check its status using the get_evaluation_job method and retrieve the results when the job is complete. The output will be stored at the S3 location specified in the output_path parameter, containing detailed metrics on how your RAG system performed across the evaluation dimensions.

Clean up

To avoid incurring future charges, delete the S3 bucket, notebook instances, and other resources that were deployed as part of the post.

Conclusion

The general availability release of LLM-as-a-judge and RAG evaluation on Amazon Bedrock brings environment-agnostic “bring your own inference” capabilities, allowing organizations to evaluate RAG systems or models regardless of where it runs. New citation metrics released as part of the RAG evaluation metrics list enhance information quality assessment by providing measurements of citation precision and coverage, enabling data-driven comparisons across different implementations.

As organizations deploy more generative AI applications, robust evaluation becomes critical to maintaining quality, reliability, and responsible use. We encourage you to explore these new features through the Amazon Bedrock console and the examples provided in the AWS Samples GitHub repository. By implementing regular evaluation workflows, you can continuously improve your models and RAG systems to deliver the highest quality outputs for your specific use cases.

We encourage you to explore these capabilities of Amazon Bedrock Evaluations and discover how systematic evaluation can enhance your generative AI applications.


About the authors

Adewale Akinfaderin is a Sr. Data Scientist–Generative AI, Amazon Bedrock, where he contributes to cutting edge innovations in foundational models and generative AI applications at AWS. His expertise is in reproducible and end-to-end AI/ML methods, practical implementations, and helping global customers formulate and develop scalable solutions to interdisciplinary problems. He has two graduate degrees in physics and a doctorate in engineering.

Shreyas Subramanian is a Principal Data Scientist and helps customers by using generative AI and deep learning to solve their business challenges using AWS services. Shreyas has a background in large-scale optimization and ML and in the use of ML and reinforcement learning for accelerating optimization tasks.

Jesse Manders is a Senior Product Manager on Amazon Bedrock, the AWS Generative AI developer service. He works at the intersection of AI and human interaction with the goal of creating and improving generative AI products and services to meet our needs. Previously, Jesse held engineering team leadership roles at Apple and Lumileds, and was a senior scientist in a Silicon Valley startup. He has an M.S. and Ph.D. from the University of Florida, and an MBA from the University of California, Berkeley, Haas School of Business.

Ishan Singh is a Sr. Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.

Jat AI Stay informed with the latest in artificial intelligence. Jat AI News Portal is your go-to source for AI trends, breakthroughs, and industry analysis. Connect with the community of technologists and business professionals shaping the future.