Effectively use prompt caching on Amazon Bedrock
Prompt caching, now generally available on Amazon Bedrock with Anthropic’s Claude 3.5 Haiku and Claude 3.7 Sonnet, along with Nova Micro, Nova Lite, and Nova Pro models, lowers response latency by up to 85% and reduces costs up to 90% by caching frequently used prompts across multiple API calls. This post provides a detailed overview of the prompt caching feature on Amazon Bedrock and offers guidance on how to effectively use this feature to achieve improved latency and cost savings.

Prompt caching, now generally available on Amazon Bedrock with Anthropic’s Claude 3.5 Haiku and Claude 3.7 Sonnet, along with Nova Micro, Nova Lite, and Nova Pro models, lowers response latency by up to 85% and reduces costs up to 90% by caching frequently used prompts across multiple API calls.
With prompt caching, you can mark the specific contiguous portions of your prompts to be cached (known as a prompt prefix). When a request is made with the specified prompt prefix, the model processes the input and caches the internal state associated with the prefix. On subsequent requests with a matching prompt prefix, the model reads from the cache and skips the computation steps required to process the input tokens. This reduces the time to first token (TTFT) and makes more efficient use of hardware such that we can share the cost savings with you.
This post provides a detailed overview of the prompt caching feature on Amazon Bedrock and offers guidance on how to effectively use this feature to achieve improved latency and cost savings.
How prompt caching works
Large language model (LLM) processing is made up of two primary stages: input token processing and output token generation. The prompt caching feature on Amazon Bedrock optimizes the input token processing stage.
You can begin by marking the relevant portions of your prompt with cache checkpoints. The entire section of the prompt preceding the checkpoint then becomes the cached prompt prefix. As you send more requests with the same prompt prefix, marked by the cache checkpoint, the LLM will check if the prompt prefix is already stored in the cache. If a matching prefix is found, the LLM can read from the cache, allowing the input processing to resume from the last cached prefix. This saves the time and cost that would otherwise be spent recomputing the prompt prefix.
Be advised that the prompt caching feature is model-specific. You should review the supported models and details on the minimum number of tokens per cache checkpoint and maximum number of cache checkpoints per request.
Cache hits only occur when the exact prefix matches. To fully realize the benefits of prompt caching, it’s recommended to position static content such as instructions and examples at the beginning of the prompt. Dynamic content, including user-specific information, should be placed at the end of the prompt. This principle also extends to images and tools, which must remain identical across requests in order to enable caching.
The following diagram illustrates how cache hits work. A, B, C, D represent distinct portions of the prompt. A, B and C are marked as the prompt prefix. Cache hits occur when subsequent requests contain the same A, B, C prompt prefix.
When to use prompt caching
Prompt caching on Amazon Bedrock is recommended for workloads that involve long context prompts that are frequently reused across multiple API calls. This capability can significantly improve response latency by up to 85% and reduce inference costs by up to 90%, making it well-suited for applications that use repetitive, long input context. To determine if prompt caching is beneficial for your use case, you will need to estimate the number of tokens you plan to cache, the frequency of reuse, and the time between requests.
The following use cases are well-suited for prompt caching:
- Chat with document – By caching the document as input context on the first request, each user query becomes more efficient, enabling simpler architectures that avoid heavier solutions like vector databases.
- Coding assistants – Reusing long code files in prompts enables near real-time inline suggestions, eliminating much of the time spent reprocessing code files.
- Agentic workflows – Longer system prompts can be used to refine agent behavior without degrading the end-user experience. By caching the system prompts and complex tool definitions, the time to process each step in the agentic flow can be reduced.
- Few-shot learning – Including numerous high-quality examples and complex instructions, such as for customer service or technical troubleshooting, can benefit from prompt caching.
How to use prompt caching
When evaluating a use case to use prompt caching, it’s crucial to categorize the components of a given prompt into two distinct groups: the static and repetitive portion, and the dynamic portion. The prompt template should adhere to the structure illustrated in the following figure.
You can create multiple cache checkpoints within a request, subject to model-specific limitations. It should follow the same static portion, cache checkpoint, dynamic portion structure, as illustrated in the following figure.
Use case example
The “chat with document” use case, where the document is included in the prompt, is well-suited for prompt caching. In this example, the static portion of the prompt would comprise instructions on response formatting and the body of the document. The dynamic portion would be the user’s query, which changes with each request.
In this scenario, the static portions of the prompt should be marked as the prompt prefixes to enable prompt caching. The following code snippet demonstrates how to implement this approach using the Invoke Model API. Here we create two cache checkpoints in the request, one for the instructions and one for the document content, as illustrated in the following figure.
We use the following prompt:
In the response to the preceding code snippet, there is a usage section that provides metrics on the cache reads and writes. The following is the example response from the first model invocation:
The cache checkpoint has been successfully created with 37,209 tokens cached, as indicated by the cache_creation_input_tokens
value, as illustrated in the following figure.
For the subsequent request, we can ask a different question:
The dynamic portion of the prompt has been changed, but the static portion and prompt prefixes remain the same. We can expect cache hits from the subsequent invocations. See the following code:
37,209 tokens are for the document and instructions read from the cache, and 10 input tokens are for the user query, as illustrated in the following figure.
Let’s change the document to a different blog post, but our instructions remain the same. We can expect cache hits for the instructions prompt prefix because it was positioned before the document body in our requests. See the following code:
In the response, we can see 1,038 cache read tokens for the instructions and 37,888 cache write tokens for the new document content, as illustrated in the following figure.
Cost savings
When a cache hit happens, Amazon Bedrock passes along the compute savings to customers by giving a per-token discount on cached context. To calculate the potential cost savings, you should first understand your prompt caching usage pattern with cache write/read metrics in the Amazon Bedrock response. Then you can calculate your potential cost savings with price per 1,000 input tokens (cache write) and price per 1,000 input tokens (cache read). For more price details, see Amazon Bedrock pricing.
Latency benchmark
Prompt caching is optimized to improve the TTFT performance on repetitive prompts. Prompt caching is well-suited for conversational applications that involve multi-turn interactions, similar to chat playground experiences. It can also benefit use cases that require repeatedly referencing a large document.
However, prompt caching might be less effective for workloads that involve a lengthy 2,000-token system prompt with a long set of dynamically changing text afterwards. In such cases, the benefits of prompt caching might be limited.
We have published a notebook on how to use prompt caching and how to benchmark it in our GitHub repo . The benchmark results depend on the use case: input token count, cached token count, or output token count.
Amazon Bedrock cross-Region inference
Prompt caching can be used in conjunction with cross-region inference (CRIS). Cross-region inference automatically selects the optimal AWS Region within your geography to serve your inference request, thereby maximizing available resources and model availability. At times of high demand, these optimizations may lead to increased cache writes.
Metrics and observability
Prompt caching observability is essential for optimizing cost savings and improving latency in applications using Amazon Bedrock. By monitoring key performance metrics, developers can achieve significant efficiency improvements—such as reducing TTFT by up to 85% and cutting costs by up to 90% for lengthy prompts. These metrics are pivotal because they enable developers to assess cache performance accurately and make strategic decisions regarding cache management.
Monitoring with Amazon Bedrock
Amazon Bedrock exposes cache performance data through the API response’s usage
section, allowing developers to track essential metrics such as cache hit rates, token consumption (both read and write), and latency improvements. By using these insights, teams can effectively manage caching strategies to enhance application responsiveness and reduce operational costs.
Monitoring with Amazon CloudWatch
Amazon CloudWatch provides a robust platform for monitoring the health and performance of AWS services, including new automatic dashboards tailored specifically for Amazon Bedrock models. These dashboards offer quick access to key metrics and facilitate deeper insights into model performance.
To create custom observability dashboards, complete the following steps:
- On the CloudWatch console, create a new dashboard. For a full example, see Improve visibility into Amazon Bedrock usage and performance with Amazon CloudWatch.
- Choose CloudWatch as your data source and select Pie for the initial widget type (this can be adjusted later).
- Update the time range for metrics (such as 1 hour, 3 hours, or 1 day) to suit your monitoring needs.
- Select Bedrock under AWS namespaces.
- Enter “cache” in the search box to filter cache-related metrics.
- For the model, locate
anthropic.claude-3-7-sonnet-20250219-v1:0
, and select bothCacheWriteInputTokenCount
andCacheReadInputTokenCount
.
- Choose Create widget and then Save to save your dashboard.
The following is a sample JSON configuration for creating this widget:
Understanding cache hit rates
Analyzing cache hit rates involves observing both CacheReadInputTokens
and CacheWriteInputTokens
. By summing these metrics over a defined period, developers can gain insights into the efficiency of the caching strategies. With the published pricing for the model-specific price per 1,000 input tokens (cache write) and price per 1,000 input tokens (cache read) on the Amazon Bedrock pricing page, you can estimate the potential cost savings for your specific use case.
Conclusion
This post explored the prompt caching feature in Amazon Bedrock, demonstrating how it works, when to use it, and how to use it effectively. It’s important to carefully evaluate whether your use case will benefit from this feature. It depends on thoughtful prompt structuring, understanding the distinction between static and dynamic content, and selecting appropriate caching strategies for your specific needs. By using CloudWatch metrics to monitor cache performance and following the implementation patterns outlined in this post, you can build more efficient and cost-effective AI applications while maintaining high performance.
For more information about working with prompt caching on Amazon Bedrock, see Prompt caching for faster model inference.
About the authors
Sharon Li is an AI/ML Specialist Solutions Architect at Amazon Web Services (AWS) based in Boston, Massachusetts. With a passion for leveraging cutting-edge technology, Sharon is at the forefront of developing and deploying innovative generative AI solutions on the AWS cloud platform.
Shreyas Subramanian is a Principal Data Scientist and helps customers by using generative AI and deep learning to solve their business challenges using AWS services. Shreyas has a background in large-scale optimization and ML and in the use of ML and reinforcement learning for accelerating optimization tasks.
Satveer Khurpa is a Sr. WW Specialist Solutions Architect, Amazon Bedrock at Amazon Web Services, specializing in Amazon Bedrock security. In this role, he uses his expertise in cloud-based architectures to develop innovative generative AI solutions for clients across diverse industries. Satveer’s deep understanding of generative AI technologies and security principles allows him to design scalable, secure, and responsible applications that unlock new business opportunities and drive tangible value while maintaining robust security postures.
Kosta Belz is a Senior Applied Scientist in the AWS Generative AI Innovation Center, where he helps customers design and build generative AI solutions to solve key business problems.
Sean Eichenberger is a Sr Product Manager at AWS.