Best practices for building robust generative AI applications with Amazon Bedrock Agents – Part 1
In this post, we show you how to create accurate and reliable agents. Agents helps you accelerate generative AI application development by orchestrating multistep tasks. Agents use the reasoning capability of foundation models (FMs) to break down user-requested tasks into multiple steps.
Building intelligent agents that can accurately understand and respond to user queries is a complex undertaking that requires careful planning and execution across multiple stages. Whether you are developing a customer service chatbot or a virtual assistant, there are numerous considerations to keep in mind, from defining the agent’s scope and capabilities to architecting a robust and scalable infrastructure.
This two-part series explores best practices for building generative AI applications using Amazon Bedrock Agents. Agents helps you accelerate generative AI application development by orchestrating multistep tasks. Agents use the reasoning capability of foundation models (FMs) to break down user-requested tasks into multiple steps. In addition, they use the developer-provided instruction to create an orchestration plan and then carry out the plan by invoking company APIs and accessing knowledge bases using Retrieval Augmented Generation (RAG) to provide an answer to the user’s request.
In Part 1, we focus on creating accurate and reliable agents. Part 2 discusses architectural considerations and development lifecycle practices.
Laying the groundwork: Collecting ground truth data
The foundation of any successful agent is high-quality ground truth data—the accurate, real-world observations used as reference for benchmarks and evaluating the performance of a model, algorithm, or system. For an agent application, before you start building, it’s crucial to collect a set of ground truth interactions or conversations that will drive the entire agent lifecycle. This data provides a benchmark for expected agent behavior, including the interaction with existing APIs, knowledge bases, and guardrails connected with the agent. This enables accurate testing and evaluation and helps identify edge cases and potential pitfalls.
To build a robust ground truth dataset, focus on gathering diverse examples that cover various user intents and scenarios. Your dataset should include the input and expected output for both simple and complex interactions. It’s important to regularly update and expand your dataset as you learn more about user behavior. Ground your data in real customer interactions that reflect actual use cases but be sure to de-identify and anonymize the data.
The following table shows a subset of ground truth collected for a banking assistant agent that knows the check_account_balance
and book_appointments
APIs, uses the knowledge base bankingFAQ
, and is connected with the guardrail BlockInvestmentAdvice
. This example presents only a few of the queries used to test the agent, including the session information used from other systems and passed to the agent using sessionAttributes
. It’s also a best practice to collect any extra information that would be shared with the agent in a production system. That includes information shared with the action group execution function only (session attributes), information used to augment the prompt (prompt session attributes), and information used by the knowledge base (knowledge base configuration).
User Query | Session Attributes | Session prompt Attributes | Expected Response | API, Knowledge Bases and Guardrails invoked |
What is my account balance? | None | None | Could you please provide the number of the account that you would like to check the balance for? | None |
What is the balance for the account 1234? | user id 111 | None | Your balance is X | Action Group: check_account_balance(111, 1234) |
How can I open an account? | None | None | To open a new banking account with ANY_BANK Corporation, you need to gather the following information and documentation: 1. Valid government-issued ID (driver’s license, passport, etc.) 2. Proof of address (utility bill, lease agreement, etc.) 3. Social Security number or Individual Taxpayer Identification Number (ITIN) 4. Initial deposit amount (if required) You can then fill out an application online or in person at an ANY_BANK Corporation branch. Once approved, you can make your initial deposit and start using your new account. |
Knowledge Base: bankingFAQ |
Should I invest in bitcoins? | None | None | Sorry, your query violates our usage policies. We do not provide investment advices. To discuss the best investment advice for your current situation, please contact us on (XXX) XXX-XXXX and we will be happy to support you. | Guardrail: BlockInvestmentAdvice |
Could you make an appointment for tomorrow at 2pm? | user id 111 | Today: 09/03/2024 | Certainly! We’ve booked an appointment for you tomorrow, September 4th, 2024, at 2pm. Your appointment ID is XXXX. | Action Group: book_appointment(111, 09/04/2024) |
Defining scope and sample interactions
Now that you have your ground truth data, the next step is to clearly define the scope of each agent, including tasks it should and shouldn’t handle, and outline clear expected sample user interactions. This process involves identifying primary functions and capabilities, limitations and out-of-scope tasks, expected input formats and types, and desired output formats and styles.
For instance, when considering an HR assistant agent, a possible scope would be the following:
Primary functions:
– Provide information on company HR policies
– Assist with vacation requests and time-off management
– Answer basic payroll questions
Out of scope:
– Handling sensitive employee data
– Making hiring or firing decisions
– Providing legal advice
Expected inputs:
– Natural language queries about HR policies
– Requests for time-off or vacation information
– Basic payroll inquires
Desired outputs:
– Clear and concise responses to policy questions
– Step-by-step guidance for vacation requests
– Completion of tasks for book a new vacation, retrieve, edit and delete an existing request
– Referrals to appropriate HR personnel for complex issues
– Creation of an HR ticket for questions where the agent is not able to respond
By clearly defining your agent’s scope, you set clear boundaries and expectations, which will guide your development process and help create a focused, reliable AI agent.
Architecting your solution: Building small and focused agents that interact with each other
When it comes to agent architecture, the principle “divide and conquer” holds true. In our experience, it has proven to be more effective to build small, focused agents that interact with each other rather than a single large monolithic agent. This approach offers improved modularity and maintainability, straightforward testing and debugging, flexibility to use different FMs for specific tasks, and enhanced scalability and extensibility.
For example, consider an HR assistant that helps internal employees in an organization and a payroll team assistant that supports the employees of the payroll team. Both agents have common functionality such as answering payroll policy questions and scheduling meetings between employees. Although the functionalities are similar, they differ in scope and permissions. For instance, the HR assistant can only reply to questions based on the internally available knowledge, whereas the payroll agents can also handle confidential information only available for the payroll employees. Additionally, the HR agents can schedule meetings between employees and their assigned HR representative, whereas the payroll agent schedules meetings between the employees on their team. In a single-agent approach, those functionalities are handled in the agent itself, resulting in the duplication of the action groups available to each agent, as shown in the following figure.
In this scenario, when something changes in the meetings action group, the change needs to be propagated to the different agents. When applying the multi-agent collaboration best practice, the HR and payroll agents orchestrate smaller, task-focused agents that are focused on their own scope and have their own instructions. Meetings are now handled by an agent itself that is reused between the two agents, as shown in the following figure.
When a new functionality is added to the meeting assistant agent, the HR agent and payroll agent only need to be updated to handle those functionalities. This approach can also be automated in your applications to increase the scalability of your agentic solutions. The supervisor agents (HR and payroll agents) can set the tone of your application as well as define how each functionality (knowledge base or sub-agent) of the agent should be used. That includes enforcing knowledge base filters and parameter constraints as part of the agentic application.
Crafting the user experience: Planning agent tone and greetings
The personality of your agent sets the tone for the entire user interaction. Carefully planning the tone and greetings of your agent is crucial for creating a consistent and engaging user experience. Consider factors such as brand voice and personality, target audience preferences, formality level, and cultural sensitivity.
For instance, a formal HR assistant might be instructed to address users formally, using titles and last names, while maintaining a professional and courteous tone throughout the conversation. In contrast, a friendly IT support agent could use a casual, upbeat tone, addressing users by their first names and even incorporating appropriate emojis and tech-related jokes to keep the conversation light and engaging.
The following is an example prompt for a formal HR assistant:
The following is an example prompt for a friendly IT support agent:
Make sure your agent’s tone aligns with your brand identity and remains constant across different interactions. When collaborating between multiple agents, you should set the tone across the application and enforce it over the different sub-agents.
Maintaining clarity: Providing unambiguous instructions and definitions
Clear communication is the cornerstone of effective AI agents. When defining instructions, functions, and knowledge base interactions, strive for unambiguous language that leaves no room for misinterpretation. Use simple, direct language and provide specific examples for complex concepts. Define clear boundaries between similar functions and implement confirmation mechanisms for critical actions. Consider the following example of clear vs. ambiguous instructions.
The following is an example ambiguous prompt
The following is a clearer prompt:
By providing clear instructions, you reduce the chances of errors and make sure your agent behaves predictably and reliably.
The same advice is valid when defining the functions of your action groups. Avoid ambiguous function names and definitions and set clear descriptions for its parameters. The following figure shows how to change the name, description, and parameters of two functions in an action group to get the user details and information based on what is actually returned by the functions and the expected value formatting for the user ID.
Finally, the knowledge base instructions should clearily state what is available in the knowledge base and when to use it to answer user queries.
The following is an ambiguous prompt:
The following is a clearer prompt:
Using organizational knowledge: Integrating knowledge bases
To make sure you provide your agents with enterprise knowledge, integrate them with your organization’s existing knowledge bases. This allows your agents to use vast amounts of information and provide more accurate, context-aware responses. By accessing up-to-date organizational data, your agents can improve response accuracy and relevance, cite authoritative sources, and reduce the need for frequent model updates.
Complete the following steps when integrating a knowledge base with Amazon Bedrock:
- Index your documents into a vector database using Amazon Bedrock Knowledge Bases.
- Configure your agent to access the knowledge base during interactions.
- Implement citation mechanisms to reference source documents in responses.
Regularly update your knowledge base to make sure your agent has consistent access to the most current information. This can achieved by implementing event-based synchronization of your knowledge base data sources using the StartIngestionJob API and an Amazon EventBridge rule that is invoked periodically or based on updates of files in the knowledge base Amazon Simple Storage Service (Amazon S3) bucket.
Integrating Amazon Bedrock Knowledge Bases with your agent will allow you to add semantic search capabilities to your application. By using the knowledgeBaseConfigurations
field in your agent’s SessionState during the InvokeAgent request, you can control how your agent interacts with your knowledge base by setting the desired number of results and any necessary filters.
Defining success: Establishing evaluation criteria
To measure the effectiveness of your AI agent, it’s essential to define specific evaluation criteria. These metrics will help you assess performance, identify areas for improvement, and track progress over time.
Consider the following key evaluation metrics:
- Response accuracy – This metric measures how your responses compare to your ground truth data. It provides information such as if the answers are correct and if the agent shows good performance and high quality.
- Task completion rate – This measures the success rate of the agent. The core idea of this metric is to measure the percentage or proportion of the conversations or user interactions where the agent was able to successfully complete the requested tasks and fulfill the user’s intent.
- Latency or response time – This metric measures how long a task took to run and the response time. Essentially, it measures how quickly the agent can provide a response or output after receiving an input or query. You can also set intermediate metrics that measure how long each step of the agent trace takes to run to identify the steps that need to be optimized in your system.
- Conversation efficiency – These measures how efficiently the conversation was able to collect the required information.
- Engagement – These measures how well the agent can understand the user’s intent, provide relevant and natural responses, and maintain an engagement with back-and-forth conversational flow.
- Conversation coherence – This metric measures the logical progression and continuity between the responses. It checks if the context and relevance are kept during the session and if the appropriate pronouns and references are used.
Furthermore, you should define your use case-specific evaluation metrics that determine how well the agent is fulfilling the tasks for your use case. For instance, for the HR use case, a possible custom metric could be the number of tickets created, because those are created when the agent can’t answer the question by itself.
Implementing a robust evaluation process involves creating a comprehensive test dataset based on your ground truth data, developing automated evaluation scripts to measure quantitative metrics, implementing A/B testing to compare different agent versions or configurations, and establishing a regular cadence for human evaluation of qualitative factors. Evaluation is an ongoing process, so you should continuously refine your criteria and measurement methods as you learn more about your agent’s performance and user needs.
Using human evaluation
Although automated metrics are valuable, human evaluation plays a crucial role in assessing and improving your AI agent’s performance. Human evaluators can provide nuanced feedback on aspects that are difficult to quantify automatically, such as assessing natural language understanding and generation, evaluating the appropriateness of responses in context, identifying potential biases or ethical concerns, and providing insights into user experience and satisfaction.
To effectively use human evaluation, consider the following best practices:
- Create a diverse panel of evaluators representing different perspectives
- Develop clear evaluation guidelines and rubrics
- Use a mix of expert evaluators (such as subject matter experts) and representative end-users
- Collect quantitative ratings and qualitative feedback
- Regularly analyze evaluation results to identify trends and areas for improvement
Continuous improvement: Testing, iterating, and refining
Building an effective AI agent is an iterative process. Now that you have a working prototype, it’s crucial to test extensively, gather feedback, and continuously refine your agent’s performance. This process should include comprehensive testing using your ground truth dataset; real-world user testing with a beta group; analysis of agent logs and conversation traces; regular updates to instructions, function definitions, and prompts; and performance comparison across different FMs.
To achieve thorough testing, consider using AI to generate diverse test cases. The following is an example prompt for generating HR assistant test scenarios:
One of the best tools of the testing phase is the agent trace. The trace provides you with the prompts used by the agent in each step taken during the agent’s orchestration. It gives insights on the agent’s chain of thought and reasoning process. You can enable the trace in your InvokeAgent call during the test process and disable it after your agent has been validated.
The next step after collecting a ground truth dataset is to evaluate the agent’s behavior. You first need to define evaluation criteria for assessing the agent’s behavior. For the HR assistant example, you can create a test dataset that compares the results provided by your agent with the results obtained by directly querying the vacations database. You can then manually evaluate the agent behavior using human evaluation, or you can automate the evaluation using agent evaluation frameworks such as Agent Evaluation. If model invocation logging is enabled, Amazon Bedrock Agents will also give you Amazon CloudWatch logs. You can use those logs to validate your agent’s behavior, debug unexpected outputs, and adjust the agent accordingly.
The last step of the agent testing phase is to plan for A/B testing groups during the deployment stage. You should define different aspects of agent behavior, such as formal or informal HR assistant tone, that can be tested with a smaller set of your user group. You can then make different agent versions available for each group during initial deployments and evaluate the agent behavior for each group. Amazon Bedrock Agents has built-in versioning capabilities to help you with this key part of testing.
Conclusions
Following these best practices and continuously refining your approach can significantly contribute to your success in developing powerful, accurate, and user-oriented AI agents using Amazon Bedrock. In Part 2 of this series, we explore architectural considerations, security best practices, and strategies for scaling your AI agents in production environments.
By following these best practices, you can build secure, accurate, scalable, and responsible generative AI applications using Amazon Bedrock. For examples to get started, check out the Amazon Bedrock Agents GitHub repository.
To learn more about Amazon Bedrock Agents, you can get started with the Amazon Bedrock Workshop and the standalone Amazon Bedrock Agents Workshop, which provides a deeper dive. Additionally, check out the service introduction video from AWS re:Invent 2023.
About the Authors
Maira Ladeira Tanke is a Senior Generative AI Data Scientist at AWS. With a background in machine learning, she has over 10 years of experience architecting and building AI applications with customers across industries. As a technical lead, she helps customers accelerate their achievement of business value through generative AI solutions on Amazon Bedrock. In her free time, Maira enjoys traveling, playing with her cat, and spending time with her family someplace warm.
Mark Roy is a Principal Machine Learning Architect for AWS, helping customers design and build generative AI solutions. His focus since early 2023 has been leading solution architecture efforts for the launch of Amazon Bedrock, the flagship generative AI offering from AWS for builders. Mark’s work covers a wide range of use cases, with a primary interest in generative AI, agents, and scaling ML across the enterprise. He has helped companies in insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. Prior to joining AWS, Mark was an architect, developer, and technology leader for over 25 years, including 19 years in financial services. Mark holds six AWS certifications, including the ML Specialty Certification.
Navneet Sabbineni is a Software Development Manager at AWS Bedrock. With over 9 years of industry experience as a software developer and manager, he has worked on building and maintaining scalable distributed services for AWS, including generative AI services like Amazon Bedrock Agents and conversational AI services like Amazon Lex. Outside of work, he enjoys traveling and exploring the Pacific Northwest with his family and friends.
Monica Sunkara is a Senior Applied Scientist at AWS, where she works on Amazon Bedrock Agents. With over 10 years of industry experience, including 6 years at AWS, Monica has contributed to various AI and ML initiatives such as Alexa Speech Recognition, Amazon Transcribe, and Amazon Lex ASR. Her work spans speech recognition, natural language processing, and large language models. Recently, she worked on adding function calling capabilities to Amazon Titan text models. Monica holds a degree from Cornell University, where she conducted research on object localization under the supervision of Prof. Andrew Gordon Wilson before joining Amazon in 2018.