Generate training data and cost-effectively train categorical models with Amazon Bedrock
In this post, we explore how you can use Amazon Bedrock to generate high-quality categorical ground truth data, which is crucial for training machine learning (ML) models in a cost-sensitive environment. Generative AI solutions can play an invaluable role during the model development phase by simplifying training and test data creation for multiclass classification supervised learning use cases. We dive deep into this process on how to use XML tags to structure the prompt and guide Amazon Bedrock in generating a balanced label dataset with high accuracy. We also showcase a real-world example for predicting the root cause category for support cases. This use case, solvable through ML, can enable support teams to better understand customer needs and optimize response strategies.

In this post, we explore how you can use Amazon Bedrock to generate high-quality categorical ground truth data, which is crucial for training machine learning (ML) models in a cost-sensitive environment. Generative AI solutions can play an invaluable role during the model development phase by simplifying training and test data creation for multiclass classification supervised learning use cases. We dive deep into this process on how to use XML tags to structure the prompt and guide Amazon Bedrock in generating a balanced label dataset with high accuracy. We also showcase a real-world example for predicting the root cause category for support cases. This use case, solvable through ML, can enable support teams to better understand customer needs and optimize response strategies.
Business challenge
The exploration and methodology described in this post addresses two key challenges: costs associated with generating a ground truth dataset for multiclass classification use cases can be prohibitive, and conventional approaches and synthetic dataset creation strategies for generating ground truth data are inadequate in generating balanced classes and meeting desired performance parameters for the real-world use cases.
Ground truth data generation is expensive and time consuming
Ground truth annotation needs to be accurate and consistent, often requiring massive time and expertise to ensure the dataset is balanced, diverse, and large enough for model training and testing. For a multiclass classification problem such as support case root cause categorization, this challenge compounds many fold.
Let’s say the task at hand is to predict the root cause categories (Customer Education, Feature Request, Software Defect, Documentation Improvement, Security Awareness, and Billing Inquiry) for customer support cases. Based on our experiments using best-in-class supervised learning algorithms available in AutoGluon, we arrived at a 3,000 sample size for the training dataset for each category to attain an accuracy of 90%. This requirement translates into time and effort investment of trained personnel, who could be support engineers or other technical staff, to review tens of thousands of support cases to arrive at an even distribution of 3,000 per category. With each support case and the related correspondences averaging 5 minutes per review and assessment from a human labeler, this translates into 1,500 hours (5 minutes x 18,000 support cases) of work or 188 days considering an 8-hour workday. Besides the time in review and labeling, there is an upfront investment in training the labelers so the exercise split between 10 or more labelers is consistent. To break this down further, a ground truth labeling campaign split between 10 labelers would require close to 4 weeks to label 18,000 cases if the labelers spend 40 hours a week on the exercise.
Not only is such an extended and effort-intensive campaign expensive, but it can cause inconsistent labeling for categories every time the labeler puts aside the task and resumes it later. The exercise also doesn’t guarantee a balanced labeled ground truth dataset because some root cause categories such as Customer Education could be far more common than Feature Request or Software Defect, thereby extending the campaign.
Conventional techniques to get balanced classes or synthetic data generation have shortfalls
A balanced labeled dataset is critical for a multiclass classification use case to mitigate bias and make sure the model learns to accurately classify all classes, rather than favoring the majority class. If the dataset is imbalanced, with one or more classes having significantly fewer instances than others, the model might struggle to learn the patterns and features associated with the minority classes, leading to poor performance and biased predictions. This issue is particularly problematic in applications where accurate classification of minority classes is critical, such as medical diagnoses, fraud detection, or root cause categorization. For the use case of labeling the support root cause categories, it’s often harder to source examples for categories such as Software Defect, Feature Request, and Documentation Improvement for labeling than it is for Customer Education. This results in an imbalanced class distribution for training and test datasets.
To address this challenge, various techniques can be employed, including oversampling the minority classes, undersampling the majority classes, using ensemble methods that combine multiple classifiers trained on different subsets of the data, or synthetic data generation to augment minority classes. However, the ideal approach for achieving optimal performance is to start with a balanced and highly accurate labeled dataset for ground truth training.
Although oversampling for minority classes means extended and expensive data labeling with humans who review the support cases, synthetic data generation to augment the minority classes poses its own challenges. For the multiclass classification problem to label support case data, synthetic data generation can quickly result in overfitting. This is because it can be difficult to synthesize real-world examples of technical case correspondences that contain complex content related to software configuration, implementation guidance, documentation references, technical troubleshooting, and the like.
Because ground truth labeling is expensive and synthetic data generation isn’t an option for use cases such as root cause prediction, the effort to train a model is often put aside. This results in a missed opportunity to review the root cause trends that can guide investment in the right areas such as education for customers, documentation improvement, or other efforts to reduce the case volume and improve customer experience.
Solution overview
The preceding section discussed why conventional ground truth data generation techniques aren’t viable for certain supervised learning use cases and fall short in training a highly accurate model to predict the support case root cause in our example. Let’s look at how generative AI can help solve this problem.
Generative AI supports key use cases such as content creation, summarization, code generation, creative applications, data augmentation, natural language processing, scientific research, and many others. Amazon Bedrock is well-suited for this data augmentation exercise to generate high-quality ground truth data. Using highly tuned and custom tailored prompts with examples and techniques discussed in the following sections, support teams can pass the anonymized support case correspondence to Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock or other available large language models (LLMs) to predict the root cause label for a support case from one of the many categories (Customer Education, Feature Request, Software Defect, Documentation Improvement, Security Awareness, and Billing Inquiry). After achieving the desired accuracy, you can use this ground truth data in an ML pipeline with automated machine learning (AutoML) tools such as AutoGluon to train a model and inference the support cases.
Checking LLM accuracy for ground truth data
To evaluate an LLM for the task of category labeling, the process begins by determining if labeled data is available. If labeled data exists, the next step is to check if the model’s use case produces discrete outcomes. Where discrete outcomes with labeled data exist, standard ML methods such as precision, recall, or other classic ML metrics can be used. These metrics provide high precision but are limited to specific use cases due to limited ground truth data.
If the use case doesn’t yield discrete outputs, task-specific metrics are more appropriate. These include metrics such as ROUGE or cosine similarity for text similarity, and specific benchmarks for assessing toxicity (Detoxify), prompt stereotyping (cross-entropy loss), or factual knowledge (HELM, LAMA).
If labeled data is unavailable, the next question is whether the testing process should be automated. The automation decision depends on the cost-accuracy trade-off because higher accuracy comes at a higher cost. For cases where automation is not required, human-in-the-Loop (HIL) approaches can be used. This involves manual evaluation based on predefined assessment rules (for example, ground truth), yielding high evaluation precision, but often is time-consuming and costly.
When automation is preferred, using another LLM to assess outputs can be effective. Here, a reliable LLM can be instructed to rate generated outputs, providing automated scores and explanations. However, the precision of this method depends on the reliability of the chosen LLM. Each path represents a tailored approach based on the availability of labeled data and the need for automation, allowing for flexibility in assessing a wide range of FM applications.
The following figure illustrates an FM evaluation workflow.
For the use case, if a historic collection of 10,000 or more support cases labeled using Amazon SageMaker Ground Truth with HIL is available, it can be used for evaluating the accuracy of the LLM prediction. The key goal for generating new ground truth data using Amazon Bedrock should be to augment it for increasing diversity and increasing the training data size for AutoGluon training to arrive at a performant model that can be used for the final inference or root cause prediction. In the following sections, we explain how to take an incremental and measured approach to improve Anthropic’s Claude 3.5 Sonnet prediction accuracy through prompt engineering.
Prompt engineering for FM accuracy and consistency
Prompt engineering is the art and science of designing a prompt to get an LLM to produce the desired output. We suggest consulting LLM prompt engineering documentation such as Anthropic prompt engineering for experiments. Based on experiments conducted without a finely tuned and optimized prompt, we observed low accuracy rates of less than 60%. In the following sections, we provide a detailed explanation on how to construct your first prompt, and then gradually improve it to consistently achieve over 90% accuracy.
Designing the prompt
Before starting any scaled use of generative AI, you should have the following in place:
- A clear definition of the problem you are trying to solve along with the end goal.
- A way to test the model’s output for accuracy. The thumbs up/down technique to determine accuracy along with comparing with the 10,000 labeled dataset through SageMaker Ground Truth is well-suited for this exercise.
- A defined success criterion on how accurate the model needs to be.
It’s helpful to think of an LLM as a new employee who is very well read, but knows nothing about your culture, your norms, what you are trying to do, or why you are trying to do it. The LLM’s performance will depend on how precisely you can explain what you want. How would a skilled manager handle a very smart, but new and inexperienced employee? The manager would provide contextual background, explain the problem, explain the rules they should apply when analyzing the problem, and give some examples of what good looks like along with why it is good. Later, if they saw the employee making mistakes, they might try to simplify the problem and provide constructive feedback by giving examples of what not to do, and why. One difference is that an employee would understand the job they are being hired for, so we need to explicitly tell the LLM to assume the persona of a support employee.
Prerequisites
To follow along with this post, set up Amazon SageMaker Studio to run Python in a notebook and interact with Amazon Bedrock. You also need the appropriate permissions to access Amazon Bedrock models.
Set up SageMaker Studio
Complete the following steps to set up SageMaker Studio:
- On the SageMaker console, choose Studio under Applications and IDEs in the navigation pane.
- Create a new SageMaker Studio instance if you haven’t already.
- If prompted, set up a user profile for SageMaker Studio by providing a user name and specifying AWS Identity and Access Management (IAM) permissions.
- Open a SageMaker Studio notebook:
- Choose JupyterLab.
- Create a private JupyterLab space.
- Configure the space (set the instance type to ml.m5.large for optimal performance).
- Launch the space.
- On the File menu, choose New and Notebook to create a new notebook.
- Configure SageMaker to meet your security and compliance objectives. Refer to Configure security in Amazon SageMaker AI for details.
Set up permissions for Amazon Bedrock access
Make sure you have the following permissions:
- IAM role with Amazon Bedrock permissions – Make sure that your SageMaker Studio execution role has the necessary permissions to access Amazon Bedrock. Attach the
AmazonBedrockFullAccess
policy or a custom policy with specific Amazon Bedrock permissions to your IAM role. - AWS SDKs and authentication – Verify that your AWS credentials (usually from the SageMaker role) have Amazon Bedrock access. Refer to Getting started with the API to set up your environment to make Amazon Bedrock requests through the AWS API.
- Model access – Grant permission to use Anthropic’s Claude 3.5 Sonnet. For instructions, see Add or remove access to Amazon Bedrock foundation models.
Test the code using the native inference API for Anthropic’s Claude
The following code uses the native inference API to send a text message to Anthropic’s Claude. The Python code invokes the Amazon Bedrock Runtime service:
Construct the initial prompt
We demonstrate the approach for the specific use case for root cause prediction with a goal of achieving 90% accuracy. Start by creating a prompt similar to the prompt you would give to humans in natural language. This can be a simple description of each root cause label and why you would choose it, how to interpret the case correspondences, how to analyze and choose the corresponding root cause label, and provide examples for every category. Ask the model to also provide the reasoning to understand how it reached to certain decisions. It can be especially interesting to understand the reasoning for the decisions you don’t agree with. See the following example code:
Analyze the results
We recommend using a small sample (for example, 150) of random cases and run them through Anthropic’s Claude 3.5 Sonnet using the initial prompt, and manually check the initial results. You can load the input data and model output into Excel, and add the following columns for analysis:
- Claude Label – A calculated column with Anthropic’s Claude’s category
- Label – True category after reviewing each case and selecting a specific root cause category to compare with the model’s prediction and derive an accuracy measurement
- Close Call – 1 or 0 so that you can take numerical averages
- Notes – For cases where there was something noteworthy about the case or inaccurate categorizations
- Claude Correct – A calculated column (0 or 1) based on whether our category matched the model’s output category
Although the first run is expected to have low accuracy unfit for using the prompt for generating the ground truth data, the reasoning will help you understand why Anthropic’s Claude mislabeled the cases. In the example, many of the misses fell into these categories and the accuracy was only 61%:
- Cases where Anthropic’s Claude categorized Customer Education cases as Software Defect because it interpreted the support agent instructions to reconfigure something as a workaround for a Software Defect.
- Cases where users asked questions about billing that Anthropic’s Claude categorized as Customer Education. Although billing questions could also be Customer Education cases, we wanted these to be categorized as the more specific Billing Inquiry Likewise, although Security Awareness cases are also Customer Education, we wanted to categorize these as the more specific Security Awareness category.
Iterate on the prompt and make changes
Providing the LLM explicit instructions on correcting these errors should result in a major boost in accuracy. We tested the following adjustments with Anthropic’s Claude:
- We defined and assigned a persona with background information for the LLM: “You are a Support Agent and an expert on the enterprise application software. You will be classifying customer cases into categories…”
- We ordered the categories from more deterministic and well-defined to less specific and instructed Anthropic’s Claude to evaluate the categories in the order they appear in the prompt.
- We recommend using the Anthropic documentation suggestion to use XML tags and the enclosed root cause categories in light XML but not a formal XML document, with elements delimited with tags. It’s ideal to create categories as nodes with a separate sub-node for each category. The category node should consist of a name of the category, a description, and what the output would look like. The categories should be delimited by begin and end tags.
- We created a good examples node with at least one good example for every category. Each good example consisted of the example, the classification, and the reasoning:
Here are some good examples with reasoning:
- We created a bad examples node with examples of where the LLM miscategorized previous cases. The bad examples node should have the same set of fields as the good examples, such as example data, classification, explanation, but the explanation explained the error. The following is a snippet:
Here are some examples for wrong classification with reasoning:
- We also added instructions for how to format the output:
Test with the new prompt
The preceding approach should result in an improved prediction accuracy. In our experiment, we saw 84% accuracy with the new prompt and the output was consistent and more straightforward to parse. Anthropic’s Claude followed the suggested output format in almost all cases. We wrote code to fix errors such as unexpected tags in the output and drop responses that could not be parsed.
The following is the code to parse the output:
Most mislabeled cases were close calls or had very similar traits. For example, when a customer described a problem, the support agent suggested possible solutions and asked for logs in order to troubleshoot. However, the customer self-resolved the case and so the resolution details weren’t conclusive. For this scenario, the root cause prediction was inaccurate. In our experiment, Anthropic’s Claude labeled these cases as Software Defects, but the most likely scenario is that the customer figured it out for themselves and never followed up.
Continued fine-tuning of the prompt to adjust examples and include such scenarios incrementally can help to get over 90% prediction accuracy, as we confirmed with our experimentation. The following code is an example of how to adjust the prompt and add a few more bad examples:
With the preceding adjustments and refinement to the prompt, we consistently obtained over 90% accuracy and noted that a few miscategorized cases were close calls where humans chose multiple categories including the one Anthropic’s Claude chose. See the appendix at the end of this post for the final prompt.
Run batch inference at scale with AutoGluon Multimodal
As illustrated in the previous sections, by crafting a well-defined and tailored prompt, Amazon Bedrock can help automate generation of ground truth data with balanced categories. This ground truth data is necessary to train the supervised learning model for a multiclass classification use case. We suggest taking advantage of the preprocessing capabilities of SageMaker to further refine the fields, encoding them into a format that’s optimal for model ingestion. The manifest files can be set up as the catalyst, triggering an AWS Lambda function that sets entire SageMaker pipeline into action. This end-to-end process seamlessly handles data inference and stores the results in Amazon Simple Storage Service (Amazon S3). We recommend AutoGluon Multimodal for training and prediction and deploying a model for a batch inference pipeline to predict the root cause for new or updated support cases at scale on a daily cadence.
Clean up
To prevent unnecessary expenses, it’s essential to properly decommission all provisioned resources. This cleanup process involves stopping notebook instances and deleting JupyterLab spaces, SageMaker domains, S3 bucket, IAM role, and associated user profiles. Refer to Clean up Amazon SageMaker notebook instance resources for details.
Conclusion
This post explored how Amazon Bedrock and advanced prompt engineering can generate high-quality labeled data for training ML models. Specifically, we focused on a use case of predicting the root cause category for customer support cases, a multiclass classification problem. Traditional approaches to generating labeled data for such problems are often prohibitively expensive, time-consuming, and prone to class imbalances. Amazon Bedrock, guided by XML prompt engineering, demonstrated the ability to generate balanced labeled datasets, at a lower cost, with over 90% accuracy for the experiment, and can help overcome labeling challenges for training categorical models for real-world use cases.
The following are our key takeaways:
- Generative AI can simplify labeled data generation for complex multiclass classification problems
- Prompt engineering is crucial for guiding LLMs to achieve desired outputs accurately
- An iterative approach, incorporating good/bad examples and specific instructions, can significantly improve model performance
- The generated labeled data can be integrated into ML pipelines for scalable inference and prediction using AutoML multimodal supervised learning algorithms for batch inference
Review your ground truth training costs with respect to time and effort for HIL labeling and service costs and do a comparative analysis with Amazon Bedrock to plan your next categorical model training at scale.
Appendix
The following code is the final prompt:
About the Authors
Sumeet Kumar is a Sr. Enterprise Support Manager at AWS leading the technical and strategic advisory team of TAM builders for automotive and manufacturing customers. He has diverse support operations experience and is passionate about creating innovative solutions using AI/ML.
Andy Brand is a Principal Technical Account Manager at AWS, where he helps education customers develop secure, performant, and cost-effective cloud solutions. With over 40 years of experience building, operating, and supporting enterprise software, he has a proven track record of addressing complex challenges.
Tom Coombs is a Principal Technical Account Manager at AWS, based in Switzerland. In Tom’s role, he helps enterprise AWS customers operate effectively in the cloud. From a development background, he specializes in machine learning and sustainability.
Ramu Ponugumati is a Sr. Technical Account Manager and a specialist in analytics and AI/ML at AWS. He works with enterprise customers to modernize and cost optimize workloads, and helps them build reliable and secure applications on the AWS platform. Outside of work, he loves spending time with his family, playing badminton, and hiking.