Ground truth curation and metric interpretation best practices for evaluating generative AI question answering using FMEval
In this post, we discuss best practices for working with Foundation Model Evaluations Library (FMEval) in ground truth curation and metric interpretation for evaluating question answering applications for factual knowledge and quality.
Generative artificial intelligence (AI) applications powered by large language models (LLMs) are rapidly gaining traction for question answering use cases. From internal knowledge bases for customer support to external conversational AI assistants, these applications use LLMs to provide human-like responses to natural language queries. However, building and deploying such assistants with responsible AI best practices requires a robust ground truth and evaluation framework to make sure they meet quality standards and user experience expectations, as well as clear evaluation interpretation guidelines to make the quality and responsibility of these systems intelligible to business decision-makers.
This post focuses on evaluating and interpreting metrics using FMEval for question answering in a generative AI application. FMEval is a comprehensive evaluation suite from Amazon SageMaker Clarify, providing standardized implementations of metrics to assess quality and responsibility. To learn more about FMEval, refer to Evaluate large language models for quality and responsibility.
In this post, we discuss best practices for working with FMEval in ground truth curation and metric interpretation for evaluating question answering applications for factual knowledge and quality. Ground truth data in AI refers to data that is known to be true, representing the expected outcome for the system being modeled. By providing a true expected outcome to measure against, ground truth data unlocks the ability to deterministically evaluate system quality. Ground truth curation and metric interpretation are tightly coupled, and the implementation of the evaluation metric must inform ground truth curation to achieve best results. By following these guidelines, data scientists can quantify the user experience delivered by their generative AI pipelines and communicate meaning to business stakeholders, facilitating ready comparisons across different architectures, such as Retrieval Augmented Generation (RAG) pipelines, off-the-shelf or fine-tuned LLMs, or agentic solutions.
Solution overview
We use an example ground truth dataset (referred to as the golden dataset, shown in the following table) of 10 question-answer-fact triplets. Each triplet describes a fact, and an encapsulation of the fact as a question-answer pair to emulate an ideal response, derived from a knowledge source document. We used Amazon’s Q2 2023 10Q report as the source document from the SEC’s public EDGAR dataset to create 10 question-answer-fact triplets. The 10Q report contains details on company financials and operations over the Q2 2023 business quarter. The golden dataset applies the ground truth curation best practices discussed in this post for most questions, but not all, to demonstrate the downstream impact of ground truth curation on metric results.
Question | Answer | Fact |
Who is Andrew R. Jassy? | Andrew R. Jassy is the President and Chief Executive Officer of Amazon.com, Inc. | Chief Executive Officer of Amazon |
What were Amazon’s total net sales for the second quarter of 2023? | Amazon’s total net sales for the second quarter of 2023 were $134.4 billion. | 134.4 billion |
Where is Amazon’s principal office located? | Amazon’s principal office is located at 410 Terry Avenue North, Seattle, Washington 98109-5210. | 410 Terry Avenue North |
What was Amazon’s operating income for the six months ended June 30, 2023? | Amazon’s operating income for the six months ended June 30, 2023 was $12.5 billion. | 12.5 billion |
When did Amazon acquire One Medical? | Amazon acquired One Medical on February 22, 2023 for cash consideration of approximately $3.5 billion, net of cash acquired. | Feb 22 2023 |
What was a key challenge faced by Amazon’s business in the second quarter of 2023? | Changes in foreign exchange rates reduced Amazon’s International segment net sales by $180 million for Q2 2023. | foreign exchange rates |
What was Amazon’s total cash, cash equivalents and restricted cash as of June 30, 2023? | Amazon’s total cash, cash equivalents, and restricted cash as of June 30, 2023 was $50.1 billion. | 50.1 billion |
What were Amazon’s AWS sales for the second quarter of 2023? | Amazon’s AWS sales for the second quarter of 2023 were $22.1 billion. | 22.1 billion |
As of June 30, 2023, how many shares of Rivian’s Class A common stock did Amazon hold? | As of June 30, 2023, Amazon held 158 million shares of Rivian’s Class A common stock. | 158 million |
How many shares of common stock were outstanding as of July 21, 2023? | There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023. | 10317750796 |
We generated responses from three generative AI RAG pipelines (anonymized as Pipeline1, Pipeline2, Pipeline3, as shown in the following figure) and calculated factual knowledge and QA accuracy metrics, evaluating them against the golden dataset. The fact key of the triplet is used for the Factual Knowledge metric ground truth, and the answer key is used for the QA Accuracy metric ground truth. With this, factual knowledge is measured against the fact key, and the ideal user experience in terms of style and conciseness is measured against the question-answer pairs.
Evaluation for question answering in a generative AI application
A generative AI pipeline can have many subcomponents, such as a RAG pipeline. RAG is a methodology to improve the accuracy of LLM responses answering a user query by retrieving and inserting relevant domain knowledge into the language model prompt. RAG quality depends on the configurations of the retriever (chunking, indexing) and generator (LLM selection and hyperparameters, prompt), as illustrated in the following figure. Tuning chunking and indexing in the retriever makes sure the correct content is available in the LLM prompt for generation. The chunk size and chunk splitting method, as well as the means of embedding and ranking relevant document chunks as vectors in the knowledge store, impacts whether the actual answer to the query is ultimately inserted in the prompt. In the generator, selecting an appropriate LLM to run the prompt, and tuning its hyperparameters and prompt template, all control how the retrieved information is interpreted for the response. With this, when a final response from a RAG pipeline is evaluated, the preceding components may be adjusted to improve response quality.
Alternatively, question answering can be powered by a fine-tuned LLM, or through an agentic approach. Although we demonstrate the evaluation of final responses from RAG pipelines, the final responses from a generative AI pipeline for question answering can be similarly evaluated because the prerequisites are a golden dataset and the generative answers. With this approach, changes in the generative output due to different generative AI pipeline architectures can be evaluated to inform the best design choices (comparing RAG and knowledge retrieval agents, comparing LLMs used for generation, retrievers, chunking, prompts, and so on).
Although evaluating each sub-component of a generative AI pipeline is important in development and troubleshooting, business decisions rely on having an end-to-end, side-by-side data view, quantifying how a given generative AI pipeline will perform in terms of user experience. With this, business stakeholders can understand expected quality changes in terms of end-user experience by switching LLMs, and adhere to legal and compliance requirements, such as ISO42001 AI Ethics. There are further financial benefits to realize; for example, quantifying expected quality changes on internal datasets when switching a development LLM to a cheaper, lightweight LLM in production. The overall evaluation process for the benefit of decision-makers is outlined in the following figure. In this post, we focus our discussion on ground truth curation, evaluation, and interpreting evaluation scores for entire question answering generative AI pipelines using FMEval to enable data-driven decision-making on quality.
A useful mental model for ground truth curation and improvement of a golden dataset is a flywheel, as shown in the following figure. The ground truth experimentation process involves querying your generative AI pipeline with the initial golden dataset questions and evaluating the responses against initial golden answers using FMEval. Then, the quality of the golden dataset must be reviewed by a judge. The judge review of the golden dataset quality accelerates the flywheel towards an ever-improving golden dataset. The judge role in the workflow can be assumed by another LLM to enable scaling against established, domain-specific criteria for high-quality ground truth. Maintaining a human-in-the-loop component to the judge function remains essential to sample and verify results, as well as to increase the quality bar with increasing task complexity. Improvement to the golden dataset fosters improvement to the quality of the evaluation metrics, until sufficient measurement accuracy in the flywheel is met by the judge, using the established criteria for quality. To learn more about AWS offerings on human review of generations and data labeling, such as Amazon Augmented AI (Amazon A2I) and Amazon SageMaker Ground Truth Plus, refer to Using Amazon Augmented AI for Human Review and High-quality human feedback for your generative AI applications from Amazon SageMaker Ground Truth Plus. When using LLMs as a judge, make sure to apply prompt safety best practices.
However, to conduct reviews of golden dataset quality as part of the ground truth experiment flywheel, human reviewers must understand the evaluation metric implementation and its coupling to ground truth curation.
FMEval metrics for question answering in a generative AI application
The Factual Knowledge and QA Accuracy metrics from FMEval provide a way to evaluate custom question answering datasets against ground truth. For a full list of metrics implemented with FMEval, refer to Using prompt datasets and available evaluation dimensions in model evaluation jobs.
Factual Knowledge
The Factual Knowledge metric evaluates whether the generated response contains factual information present in the ground truth answer. It is a binary (0 or 1) score based on a string match. Factual knowledge also reports a quasi-exact string match which performs matching after normalization. For simplicity, we focus on the exact match Factual Knowledge score in this post.
For each golden question:
- 0 indicates the lowercased factual ground truth is not present in the model response
- 1 indicates the lowercased factual ground truth is present in the response
QA Accuracy
The QA Accuracy metric measures a model’s question answering accuracy by comparing its generated answers against ground truth answers. The metrics are computed by string matching true positive, false positive, and false negative word matches between QA ground truth answers and generated answers.
It includes several sub-metrics:
- Recall Over Words – Scores from 0 (worst) to 1 (best), measuring how much of the QA ground truth is contained in the model output
- Precision Over Words – Scores from 0 (worst) to 1 (best), measuring how many words in the model output match the QA ground truth
- F1 Over Words – The harmonic mean of precision and recall, providing a balanced score from 0 to 1
- Exact Match – Binary 0 or 1, indicating if the model output exactly matches the QA ground truth
- Quasi Exact Match – Similar to Exact Match, but with normalization (lowercasing and removing articles)
Because QA Accuracy metrics are calculated on an exact match basis, (for more details, see Accuracy) they may be less reliable for questions where the answer can be rephrased without modifying its meaning. To mitigate this, we propose applying Factual Knowledge as the assessment of factual correctness, motivating the use of a dedicated factual ground truth with minimal word expression, together with QA Accuracy as a measure of idealized user experience in terms of response verbosity and style. We elaborate on these concepts later in this post. The BERTScore is also computed as part of QA Accuracy, which provides a measure of semantic match quality against the ground truth.
Proposed ground truth curation best practices for question answering with FMEval
In this section, we share best practices for curating your ground truth for question answering with FMEval.
Understanding the Factual Knowledge metric calculation
A factual knowledge score is a binary measure of whether a real-world fact was correctly retrieved by the generative AI pipeline. 0 indicates the lower-cased expected answer is not part of the model response, whereas 1 indicates it is. Where there is more than one acceptable answer, and either answer is considered correct, apply a logical operator for OR. A configuration for a logical AND can also be applied for cases where the factual material encompasses multiple distinct entities. In the present examples, we demonstrate a logical OR, using the Fact detection is useful for assessing hallucination in a generative AI pipeline. The two sample responses in the following table illustrate fact detection. The first example correctly states the fact in the example response, and receives a 1.0 score. The second example hallucinates a number instead of stating the fact, and receives a 0 score.
In the following example, we highlight the importance of units in ground truth for Factual Knowledge string matching. The golden question and golden fact represent Amazon’s total net sales for the second quarter of 2023.
The first response hallucinates the fact, using units of billions, and correctly receives a score of 0.0. The second response correctly represents the fact, in units of millions. Both units should be represented in the golden fact. The third response was unable to answer the question, flagging a potential issue with the information retrieval step.
Factual knowledge scores are a useful flag for challenges in the generative AI pipeline such as hallucination or information retrieval problems. Factual knowledge scores can be curated in the form of a Factual Knowledge Report for human review, as shown in the following table, to visualize pipeline quality in terms of fact detection side by side.
Consider the impact of string matching between your ground truth and LLM responses when curating ground truth for Factual Knowledge. Best practices for curation in consideration of string matching are the following:
We use the following question answer pair to demonstrate how FMEval metrics are calculated, and how this informs best practices in QA ground truth curation.
In calculating QA Accuracy metrics, first the responses and ground truth are first normalized (lowercase, remove punctuation, remove articles, remove excess whitespace). Then, true positive, false positives, and false negative matches are computed between the LLM response and the ground truth. QA Accuracy metrics returned by FMEval include recall, precision, F1. By assessing exact matching, the Exact Match and Quasi-Exact Match metrics are returned. A detailed walkthrough of the calculation and scores are shown in the following tables.
The first table illustrates the accuracy metric calculation mechanism.
Golden Answer: “There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.” Example Response: “Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023.” Golden Answer: “There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.” Example Response: “Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023.” Golden Answer: “There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.” Example Response: “Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023.” The following table lists the accuracy scores.
The following are best practices for interpreting QA accuracy scores:
“Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023. Specifically, in the first excerpt from the quarterly report for the quarter ending June 30, 2023, it states: ‘10,317,750,796 shares of common stock, par value $0.01 per share, outstanding as of July 21, 2023’ Therefore, the number of shares of Amazon common stock outstanding as of July 21, 2023 was 10,317,750,796 according to this statement.” “Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023. Specifically, in the first excerpt from the quarterly report for the quarter ending June 30, 2023, it states: ‘10,317,750,796 shares of common stock, par value $0.01 per share, outstanding as of July 21, 2023’ Therefore, the number of shares of Amazon common stock outstanding as of July 21, 2023 was 10,317,750,796 according to this statement.” 10317750796 10,317,750,796 Consider the impact of true positive, false positive, and false negative matches between your golden answer and LLM responses when curating your ground truth for QA Accuracy. Best practices for curation in consideration of string matching are as follows:
Curating appropriate ground truth and interpreting evaluation metrics in a feedback loop is crucial for effective business decision-making when deploying generative AI pipelines for question answering.
There were several key takeaways from this experiment:
In this post, we outlined best practices for ground truth curation and metric interpretation when evaluating generative AI question answering using FMEval. We demonstrated how to curate ground truth question-answer-fact triplets in consideration of the Factual Knowledge and QA Accuracy metrics calculated by FMEval. To validate our approach, we curated a golden dataset of 10 question-answer-fact triplets from Amazon’s Q2 2023 10Q report. We generated responses from three anonymized generative AI pipelines and calculated QA Accuracy and Factual Knowledge metrics.
Our primary findings emphasize that ground truth curation and metric interpretation are tightly coupled. Ground truth should be curated with the measurement approach in mind, and metrics can update the ground truth during golden dataset development. We further recommend curating separate ground truths for QA accuracy and factual knowledge, particularly emphasizing setting a desired level of verbosity according to user experience goals, and setting golden questions with unambiguous interpretations. Closeness and conciseness to ground truth are valid interpretations of FMEval recall and precision metrics, and factual knowledge scores can be used to detect hallucinations. Ultimately, the quantification of the expected user experience in the form of a golden dataset for pipeline evaluation with FMEval supports business decision-making, such as choosing between pipeline options, projecting quality changes from development to production, and adhering to legal and compliance requirements.
Whether you are building an internal application, a customer-facing virtual assistant, or exploring the potential of generative AI for your business, this post can help you use FMEval to make sure your projects meet the highest standards of quality and responsibility. We encourage you to adopt these best practices and start evaluating your generative AI question answering pipelines with the FMEval toolkit today.
Samantha Stuart is a Data Scientist with AWS Professional Services, and has delivered for customers across generative AI, MLOps, and ETL engagements. Samantha has a research master’s degree in engineering from the University of Toronto, where she authored several publications on data-centric AI for drug delivery system design. Outside of work, she is most likely spotted playing music, spending time with friends and family, at the yoga studio, or exploring Toronto.
Rahul Jani is a Data Architect with AWS Professional Services. He collaborates closely with enterprise customers building modern data platforms, generative AI applications, and MLOps. He is specialized in the design and implementation of big data and analytical applications on the AWS platform. Beyond work, he values quality time with family and embraces opportunities for travel.
Ivan Cui is a Data Science Lead with AWS Professional Services, where he helps customers build and deploy solutions using ML and generative AI on AWS. He has worked with customers across diverse industries, including software, finance, pharmaceutical, healthcare, IoT, and entertainment and media. In his free time, he enjoys reading, spending time with his family, and traveling.
Andrei Ivanovic is a Data Scientist with AWS Professional Services, with experience delivering internal and external solutions in generative AI, AI/ML, time series forecasting, and geospatial data science. Andrei has a Master’s in CS from the University of Toronto, where he was a researcher at the intersection of deep learning, robotics, and autonomous driving. Outside of work, he enjoys literature, film, strength training, and spending time with loved ones.
Golden Question
“How many shares of common stock were outstanding as of July 21, 2023?”
Golden Fact
10,317,750,796
Metric
Example Response
Score
Calculation Approach
Factual Knowledge
“Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023.”
1.0
String match to golden fact
“Based on the documents provided, Amazon had 22,003,237,746 shares of common stock outstanding as of July 21, 2023.”
0.0
Golden Question
“What were Amazon’s total net sales for the second quarter of 2023?
Golden Fact
134.4 billion
Metric
Example Response
Score
Calculation Approach
Factual Knowledge
Amazon’s total net sales for the second quarter of 2023 were $170.0 billion.
0.0
String match to golden fact
The total consolidated net sales for Q2 2023 were $134,383 million according to this report.
1.0
Sorry, the provided context does not include any information about Amazon’s total net sales for the second quarter of 2023. Would you like to ask another question?
0.0
Interpreting Factual Knowledge scores
User Question
QA Ground Truth
Factual Ground Truth
Pipeline 1
Pipeline 2
Pipeline 3
As of June 30, 2023, how many shares of Rivian’s Class A common stock did Amazon hold?
As of June 30, 2023, Amazon held 158 million shares of Rivian’s Class A common stock.
158 million
1
1
1
How many shares of common stock were outstanding as of July 21, 2023?
There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.
10317750796
1
1
1
What was Amazon’s operating income for the six months ended June 30, 2023?
Amazon’s operating income for the six months ended June 30, 2023 was $12.5 billion.
12.5 billion
1
1
1
What was Amazon’s total cash, cash equivalents and restricted cash as of June 30, 2023?
Amazon’s total cash, cash equivalents, and restricted cash as of June 30, 2023 was $50.1 billion.
50.1 billion
1
0
0
What was a key challenge faced by Amazon’s business in the second quarter of 2023?
Changes in foreign exchange rates reduced Amazon’s International segment net sales by $180 million for Q2 2023.
foreign exchange rates
0
0
0
What were Amazon’s AWS sales for the second quarter of 2023?
Amazon’s AWS sales for the second quarter of 2023 were $22.1 billion.
22.1 billion
1
0
0
What were Amazon’s total net sales for the second quarter of 2023?
Amazon’s total net sales for the second quarter of 2023 were $134.4 billion.
134.4 billion
1
0
0
When did Amazon acquire One Medical?
Amazon acquired One Medical on February 22, 2023 for cash consideration of approximately $3.5 billion, net of cash acquired.
Feb 22 2023
1
0
1
Where is Amazon’s principal office located?
Amazon’s principal office is located at 410 Terry Avenue North, Seattle, Washington 98109-5210.
410 Terry Avenue North
0
0
0
Who is Andrew R. Jassy?
Andrew R. Jassy is the President and Chief Executive Officer of Amazon.com, Inc.
Chief Executive Officer of Amazon
1
1
1
Curating Factual Knowledge ground truth
Understanding QA Accuracy metric calculation
Golden Question
“How many shares of common stock were outstanding as of July 21, 2023?”
Golden Answer
“There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.”
Metric
Definition
Example
Score
True Positive (TP)
The number of words in the model output that are also contained in the ground truth.
11
False Positive (FP)
The number of words in the model output that are not contained in the ground truth.
7
False Negative (FN)
The number of words that are missing from the model output, but are included in the ground truth.
3
Metric
Score
Calculation Approach
Recall Over Words
0.786
Precision Over Words
0.611
F1
0.688
Exact Match
0.0
(Non-normalized) Binary score that indicates whether the model output is an exact match for the ground truth answer.
Quasi-Exact Match
0.0
(Normalized) Binary score that indicates whether the model output is an exact match for the ground truth answer.
Interpreting QA Accuracy scores
Interpretation
Question
Curated Ground Truth
High Closeness to Ground Truth
Low Closeness to Ground Truth
Interpreting Closeness to Ground Truth Scores
“How many shares of common stock were outstanding as of July 21, 2023?”
“There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.”
“As of July 21, 2023, there were 10,317,750,796 shares of common stock outstanding.”
0.923
“Sorry, I do not have access to documents containing common stock information about Amazon.”
0.111
Interpretation
Question
Curated Ground Truth
High Conciseness to Ground Truth
Low Conciseness to Ground Truth
Interpreting Conciseness to Ground Truth
“How many shares of common stock were outstanding as of July 21, 2023?”
“There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.”
As of July 21, 2023, there were 10,317,750,796 shares of common stock outstanding.
1.0
0.238
Interpretation
Question
Curated Ground Truth
High Combined Closeness x Conciseness
Low Combined Closeness x Conciseness
Interpreting Closeness and Conciseness to Ground Truth
“How many shares of common stock were outstanding as of July 21, 2023?”
“There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.”
“As of July 21, 2023, there were 10,317,750,796 shares of common stock outstanding.”
0.96
0.364
Interpretation
QA Ground Truth
Factual Ground Truth
Factual Knowledge
Recall Score
LLM response
Hallucination detection
Amazon’s total net sales for the second quarter of 2023 were $134.4 billion.
134.4 billion
0
0.92
Amazon’s total net sales for the second quarter of 2023 were $170.0 billion.
Detect false positive facts
There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.
1.0
0.0
Document ID: 10317750796
Correct answer, expressed in different words to ground truth question-answer-fact
Amazon’s principal office is located at 410 Terry Avenue North, Seattle, Washington 98109-5210.
410 Terry Avenue North
0
0.54
Amazon’s principal office is located in Seattle, Washington.
Curating QA Accuracy ground truth
Key takeaways
Conclusion
About the Authors