Mastering Amazon Bedrock throttling and service availability: A comprehensive guide
This post shows you how to implement robust error handling strategies that can help improve application reliability and user experience when using Amazon Bedrock. We'll dive deep into strategies for optimizing performances for the application with these errors. Whether this is for a fairly new application or matured AI application, in this post you will be able to find the practical guidelines to operate with on these errors.
In production generative AI applications, we encounter a series of errors from time to time, and the most common ones are requests failing with 429 ThrottlingException and 503 ServiceUnavailableException errors. As a business application, these errors can happen due to multiple layers in the application architecture.
Most of the cases in these errors are retriable but this impacts user experience as the calls to the application get delayed. Delays in responding can disrupt a conversation’s natural flow, reduce user interest, and ultimately hinder the widespread adoption of AI-powered solutions in interactive AI applications.
One of the most common challenges is multiple users flowing on a single model for widespread applications at the same time. Mastering these errors means the difference between a resilient application and frustrated users.
This post shows you how to implement robust error handling strategies that can help improve application reliability and user experience when using Amazon Bedrock. We’ll dive deep into strategies for optimizing performances for the application with these errors. Whether this is for a fairly new application or matured AI application, in this post you will be able to find the practical guidelines to operate with on these errors.
Prerequisites
- AWS account with Amazon Bedrock access
- Python 3.x and boto3 installed
- Basic understanding of AWS services
- IAM Permissions: Ensure you have the following minimum permissions:
bedrock:InvokeModelorbedrock:InvokeModelWithResponseStreamfor your specific modelscloudwatch:PutMetricData,cloudwatch:PutMetricAlarmfor monitoringsns:Publishif using SNS notifications- Follow the principle of least privilege – grant only the permissions needed for your use case
Example IAM policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"bedrock:InvokeModel"
],
"Resource": "arn:aws:bedrock:us-east-1:123456789012:model/anthropic.claude-*"
}
]
}
Note: This walkthrough uses AWS services that may incur charges, including Amazon CloudWatch for monitoring and Amazon SNS for notifications. See AWS pricing pages for details.
Quick Reference: 503 vs 429 Errors
The following table compares these two error types:
| Aspect | 503 ServiceUnavailable | 429 ThrottlingException |
|---|---|---|
| Primary Cause | Temporary service capacity issues, server failures | Exceeded account quotas (RPM/TPM) |
| Quota Related | Not Quota Related | Directly quota-related |
| Resolution Time | Transient, refreshes faster | Requires waiting for quota refresh |
| Retry Strategy | Immediate retry with exponential backoff | Must sync with 60-second quota cycle |
| User Action | Wait and retry, consider alternatives | Optimize request patterns, increase quotas |
Deep dive into 429 ThrottlingException
A 429 ThrottlingException means Amazon Bedrock is deliberately rejecting some of your requests to keep overall usage within the quotas you have configured or that are assigned by default. In practice, you will most often see three flavors of throttling: rate-based, token-based, and model-specific.
1. Rate-Based Throttling (RPM – Requests Per Minute)
Error Message:
ThrottlingException: Too many requests, please wait before trying again.
Or:
botocore.errorfactory.ThrottlingException: An error occurred (ThrottlingException) when calling the InvokeModel operation: Too many requests, please wait before trying again
What this actually indicates
Rate-based throttling is triggered when the total number of Bedrock requests per minute to a given model and Region crosses the RPM quota for your account. The key detail is that this limit is enforced across the callers, not just per individual application or microservice.
Imagine a shared queue at a coffee shop: it does not matter which team is standing in line; the barista can only serve a fixed number of drinks per minute. As soon as more people join the queue than the barista can handle, some customers are told to wait or come back later. That “come back later” message is your 429.
Multi-application spike scenario
Suppose you have three production applications, all calling the same Bedrock model in the same Region:
- App A normally peaks around 50 requests per minute.
- App B also peaks around 50 rpm.
- App C usually runs at about 50 rpm during its own peak.
Ops has requested a quota of 150 RPM for this model, which seems reasonable since 50 + 50 + 50 = 150 and historical dashboards show that each app stays around its expected peak.
However, in reality your traffic is not perfectly flat. Maybe during a flash sale or a marketing campaign, App A briefly spikes to 60 rpm while B and C stay at 50. The combined total for that minute becomes 160 rpm, which is above your 150 rpm quota, and some requests start failing with ThrottlingException.
You can also get into trouble when the three apps shift upward at the same time over longer periods. Imagine a new pattern where peak traffic looks like this:
- App A: 75 rpm
- App B: 50 rpm
- App C: 50 rpm
Your new true peak is 175 rpm even though the original quota was sized for 150. In this situation, you will see 429 errors regularly during those peak windows, even if average daily traffic still looks “fine.”
Mitigation strategies
For rate-based throttling, the mitigation has two sides: client behavior and quota management.
On the client side:
- Implement request rate limiting to cap how many calls per second or per minute each application can send. APIs, SDK wrappers, or sidecars like API gateways can enforce per-app budgets so one noisy client does not starve others.
- Use exponential backoff with jitter on 429 errors so that retries can become gradually less frequent and are de-synchronized across instances.
- Align retry windows with the quota refresh period: because RPM is enforced per 60-second window, retries that happen several seconds into the next minute are more likely to succeed.
On the quota side:
- Analyze CloudWatch metrics for each application to determine true peak RPM rather than relying on averages.
- Sum those peaks across the apps for the same model/Region, add a safety margin, and request an RPM increase through AWS Service Quotas if needed.
In the previous example, if App A peaks at 75 rpm and B and C peak at 50 rpm, you should plan for at least 175 rpm and realistically target something like 200 rpm to provide room for growth and unexpected bursts.
2. Token-Based Throttling (TPM – Tokens Per Minute)
Error message:
botocore.errorfactory.ThrottlingException: An error occurred (ThrottlingException) when calling the InvokeModel operation: Too many tokens, please wait before trying again.
Why token limits matter
Even if your request count is modest, a single large prompt or a model that produces long outputs can consume thousands of tokens at once. Token-based throttling occurs when the sum of input and output tokens processed per minute exceeds your account’s TPM quota for that model.
For example, an application that sends 10 requests per minute with 15,000 input tokens and 5,000 output tokens each is consuming roughly 200,000 tokens per minute, which may cross TPM thresholds far sooner than an application that sends 200 tiny prompts per minute.
What this looks like in practice
You may notice that your application runs smoothly under normal workloads, but suddenly starts failing when users paste large documents, upload long transcripts, or run bulk summarization jobs. These are symptoms that token throughput, not request frequency, is the bottleneck.
How to respond
To mitigate token-based throttling:
- Monitor token usage by tracking InputTokenCount and OutputTokenCount metrics and logs for your Bedrock invocations.
- Implement a token-aware rate limiter that maintains a sliding 60-second window of tokens consumed and only issues a new request if there is enough budget left.
- Break large tasks into smaller, sequential chunks so you spread token consumption over multiple minutes instead of exhausting the entire budget in one spike.
- Use streaming responses when appropriate; streaming often gives you more control over when to stop generation so you do not produce unnecessarily long outputs.
For consistently high-volume, token-intensive workloads, you should also evaluate requesting higher TPM quotas or using models with larger context windows and better throughput characteristics.
3. Model-Specific Throttling
Error message:
botocore.errorfactory.ThrottlingException: An error occurred (ThrottlingException) when calling the InvokeModel operation: Model anthropic.claude-haiku-4-5-20251001-v1:0 is currently overloaded. Please try again later.
What is happening behind the scenes
Model-specific throttling indicates that a particular model endpoint is experiencing heavy demand and is temporarily limiting additional traffic to keep latency and stability under control. In this case, your own quotas might not be the limiting factor; instead, the shared infrastructure for that model is temporarily saturated.
How to respond
One of the most effective approaches here is to design for graceful degradation rather than treating this as a hard failure.
- Implement model fallback: define a priority list of compatible models (for example, Sonnet → Haiku) and automatically route traffic to a secondary model if the primary is overloaded.
- Combine fallback with cross-Region inference so you can use the same model family in a nearby Region if one Region is temporarily constrained.
- Expose fallback behavior in your observability stack so you can know when your system is running in “degraded but functional” mode instead of silently masking problems.
Implementing robust retry and rate limiting
Once you understand the types of throttling, the next step is to encode that knowledge into reusable client-side components.
Exponential backoff with jitter
Here’s a robust retry implementation that uses exponential backoff with jitter. This pattern is essential for handling throttling gracefully:
import time
import random
from botocore.exceptions import ClientError
def bedrock_request_with_retry(bedrock_client, operation, **kwargs):
"""Secure retry implementation with sanitized logging."""
max_retries = 5
base_delay = 1
max_delay = 60
for attempt in range(max_retries):
try:
if operation == 'invoke_model':
return bedrock_client.invoke_model(**kwargs)
elif operation == 'converse':
return bedrock_client.converse(**kwargs)
except ClientError as e:
# Security: Log error codes but not request/response bodies
# which may contain sensitive customer data
if e.response['Error']['Code'] == 'ThrottlingException':
if attempt == max_retries - 1:
raise
# Exponential backoff with jitter
delay = min(base_delay * (2 ** attempt), max_delay)
jitter = random.uniform(0, delay * 0.1)
time.sleep(delay + jitter)
continue
else:
raise
This pattern avoids hammering the service immediately after a throttling event and helps prevent many instances from retrying at the same exact moment.
Token-Aware Rate Limiting
For token-based throttling, the following class maintains a sliding window of token usage and gives your caller a simple yes/no answer on whether it is safe to issue another request:
import time
from collections import deque
class TokenAwareRateLimiter:
def __init__(self, tpm_limit):
self.tpm_limit = tpm_limit
self.token_usage = deque()
def can_make_request(self, estimated_tokens):
now = time.time()
# Remove tokens older than 1 minute
while self.token_usage and self.token_usage[0][0] < now - 60:
self.token_usage.popleft()
current_usage = sum(tokens for _, tokens in self.token_usage)
return current_usage + estimated_tokens <= self.tpm_limit
def record_usage(self, tokens_used):
self.token_usage.append((time.time(), tokens_used))
In practice, you would estimate tokens before sending the request, call can_make_request, and only proceed when it returns True, then call record_usage after receiving the response.
Understanding 503 ServiceUnavailableException
A 503 ServiceUnavailableException tells you that Amazon Bedrock is temporarily unable to process your request, often due to capacity pressure, networking issues, or exhausted connection pools. Unlike 429, this is not about your quota; it is about the health or availability of the underlying service at that moment.
Connection Pool Exhaustion
What it looks like:
botocore.errorfactory.ServiceUnavailableException: An error occurred (ServiceUnavailableException) when calling the ConverseStream operation (reached max retries: 4): Too many connections, please wait before trying again.
In many real-world scenarios this error is caused not by Bedrock itself, but by how your client is configured:
- By default, the
boto3HTTP connection pool size is relatively small (for example, 10 connections), which can be quickly exhausted by highly concurrent workloads. - Creating a new client for every request instead of reusing a single client per process or container can multiply the number of open connections unnecessarily.
To help fix this, share a single Bedrock client instance and increase the connection pool size:
import boto3
from botocore.config import Config
# Security Best Practice: Never hardcode credentials
# boto3 automatically uses credentials from:
# 1. Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
# 2. IAM role (recommended for EC2, Lambda, ECS)
# 3. AWS credentials file (~/.aws/credentials)
# 4. IAM roles for service accounts (recommended for EKS)
# Configure larger connection pool for parallel execution
config = Config(
max_pool_connections=50, # Increase from default 10
retries={'max_attempts': 3}
)
bedrock_client = boto3.client('bedrock-runtime', config=config)
This configuration allows more parallel requests through a single, well-tuned client instead of hitting client-side limits.
Temporary Service Resource Issues
What it looks like:
botocore.errorfactory.ServiceUnavailableException: An error occurred (ServiceUnavailableException) when calling the InvokeModel operation: Service temporarily unavailable, please try again.
In this case, the Bedrock service is signaling a transient capacity or infrastructure issue, often affecting on-demand models during demand spikes. Here you should treat the error as a temporary outage and focus on retrying smartly and failing over gracefully:
- Use exponential backoff retries, similar to your 429 handling, but with parameters tuned for slower recovery.
- Consider using cross-Region inference or different service tiers to help get more predictable capacity envelopes for your most critical workloads.
Advanced resilience strategies
When you operate mission-critical systems, simple retries are not enough; you also want to avoid making a bad situation worse.
Circuit Breaker Pattern
The circuit breaker pattern helps prevent your application from continuously calling a service that is already failing. Instead, it quickly flips into an “open” state after repeated failures, blocking new requests for a cooling-off period.
- CLOSED (Normal): Requests flow normally.
- OPEN (Failing): After repeated failures, new requests are rejected immediately, helping reduce pressure on the service and conserve client resources.
- HALF_OPEN (Testing): After a timeout, a small number of trial requests are allowed; if they succeed, the circuit closes again.
Why This Matters for Bedrock
When Bedrock returns 503 errors due to capacity issues, continuing to hammer the service with requests only makes things worse. The circuit breaker pattern helps:
- Reduce load on the struggling service, helping it recover faster
- Fail fast instead of wasting time on requests that will likely fail
- Provide automatic recovery by periodically testing if the service is healthy again
- Improve user experience by returning errors quickly rather than timing out
The following code implements this:
import time
from enum import Enum
class CircuitState(Enum):
CLOSED = "closed" # Normal operation
OPEN = "open" # Failing, reject requests
HALF_OPEN = "half_open" # Testing if service recovered
class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout=60):
self.failure_threshold = failure_threshold
self.timeout = timeout
self.failure_count = 0
self.last_failure_time = None
self.state = CircuitState.CLOSED
def call(self, func, *args, **kwargs):
if self.state == CircuitState.OPEN:
if time.time() - self.last_failure_time > self.timeout:
self.state = CircuitState.HALF_OPEN
else:
raise Exception("Circuit breaker is OPEN")
try:
result = func(*args, **kwargs)
self.on_success()
return result
except Exception as e:
self.on_failure()
raise
def on_success(self):
self.failure_count = 0
self.state = CircuitState.CLOSED
def on_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
# Usage
circuit_breaker = CircuitBreaker()
def make_bedrock_request():
return circuit_breaker.call(bedrock_client.invoke_model, **request_params)
Cross-Region Failover Strategy with CRIS
Amazon Bedrock cross-Region inference (CRIS) helps add another layer of resilience by giving you a managed way to route traffic across Regions.
- Global CRIS Profiles: can send traffic to AWS commercial Regions, typically offering the best combination of throughput and cost (often around 10% savings).
- Geographic CRIS Profiles: CRIS profiles confine traffic to specific geographies (for example, US-only, EU-only, APAC-only) to help satisfy strict data residency or regulatory requirements.
For applications without data residency requirements, global CRIS offers enhanced performance, reliability, and cost efficiency.
From an architecture standpoint:
- For non-regulated workloads, using a global profile can significantly improve availability and absorb regional spikes.
- For regulated workloads, configure geographic profiles that align with your compliance boundaries, and document those decisions in your governance artifacts.
Bedrock automatically encrypts data in transit using TLS and does not store customer prompts or outputs by default; combine this with CloudTrail logging for compliance posture.
Monitoring and Observability for 429 and 503 Errors
You cannot manage what you cannot see, so robust monitoring is essential when working with quota-driven errors and service availability. Setting up comprehensive Amazon CloudWatch monitoring is essential for proactive error management and maintaining application reliability.
Note: CloudWatch custom metrics, alarms, and dashboards incur charges based on usage. Review CloudWatch pricing for details.
Essential CloudWatch Metrics
Monitor these CloudWatch metrics:
- Invocations: Successful model invocations
- InvocationClientErrors: 4xx errors including throttling
- InvocationServerErrors: 5xx errors including service unavailability
- InvocationThrottles: 429 throttling errors
- InvocationLatency: Response times
- InputTokenCount/OutputTokenCount: Token usage for TPM monitoring
For better insight, create dashboards that:
- Separate 429 and 503 into different widgets so you can see whether a spike is quota-related or service-side.
- Break down metrics by ModelId and Region to find the specific models or Regions that are problematic.
- Show side-by-side comparisons of current traffic vs previous weeks to spot emerging trends before they become incidents.
Critical Alarms
Do not wait until users notice failures before you act. Configure CloudWatch alarms with Amazon SNS notifications based on thresholds such as:
For 429 Errors:
- A high number of throttling events in a 5-minute window.
- Consecutive periods with non-zero throttle counts, indicating sustained pressure.
- Quota utilization above a chosen threshold (for example, 80% of RPM/TPM).
For 503 Errors:
- Service success rate falling below your SLO (for example, 95% over 10 minutes).
- Sudden spikes in 503 counts correlated with specific Regions or models.
- Service availability (for example, <95% success rate)
- Signs of connection pool saturation on client metrics.
Alarm Configuration Best Practices
- Use Amazon Simple Notification Service (Amazon SNS) topics to route alerts to your team’s communication channels (Slack, PagerDuty, email)
- Set up different severity levels: Critical (immediate action), Warning (investigate soon), Info (trending issues)
- Configure alarm actions to trigger automated responses where appropriate
- Include detailed alarm descriptions with troubleshooting steps and runbook links
- Test your alarms regularly to make sure notifications are working correctly
- Do not include sensitive customer data in alarm messages
Log Analysis Queries
CloudWatch Logs Insights queries help you move from “we see errors” to “we understand patterns.” Examples include:
Find 429 error patterns:
fields @timestamp, @message
| filter @message like /ThrottlingException/
| stats count() by bin(5m)
| sort @timestamp desc
Analyze 503 error correlation with request volume:
fields @timestamp, @message
| filter @message like /ServiceUnavailableException/
| stats count() as error_count by bin(1m)
| sort @timestamp desc
Wrapping Up: Building Resilient Applications
We’ve covered a lot of ground in this post, so let’s bring it all together. Successfully handling Bedrock errors requires:
- Understand root causes: Distinguish quota limits (429) from capacity issues (503)
- Implement appropriate retries: Use exponential backoff with different parameters for each error type
- Design for scale: Use connection pooling, circuit breakers, and Cross-Region failover
- Monitor proactively: Set up comprehensive CloudWatch monitoring and alerting
- Plan for growth: Request quota increases and implement fallback strategies
Conclusion
Handling 429 ThrottlingException and 503 ServiceUnavailableException errors effectively is a crucial part of running production-grade generative AI workloads on Amazon Bedrock. By combining quota-aware design, intelligent retries, client-side resilience patterns, cross-Region strategies, and strong observability, you can keep your applications responsive even under unpredictable load.
As a next step, identify your most critical Bedrock workloads, enable the retry and rate-limiting patterns described here, and build dashboards and alarms that expose your real peaks rather than just averages. Over time, use real traffic data to refine quotas, fallback models, and regional deployments so your AI systems can remain both powerful and dependable as they scale.
For teams looking to accelerate incident resolution, consider enabling AWS DevOps Agent—an AI-powered agent that investigates Bedrock errors by correlating CloudWatch metrics, logs, and alarms just like an experienced DevOps engineer would. It learns your resource relationships, works with your observability tools and runbooks, and can significantly reduce mean time to resolution (MTTR) for 429 and 503 errors by automatically identifying root causes and suggesting remediation steps.
Learn More
- Amazon Bedrock Documentation
- Amazon Bedrock Quotas
- Cross-Region Inference
- Cross-Region Inference Security
- SNS Security
- AWS Logging Best Practices
- AWS Bedrock Security Best Practices
- AWS IAM Best Practices – Least Privilege
About the Authors