Improve Amazon Nova migration performance with data-aware prompt optimization
In this post, we present an LLM migration paradigm and architecture, including a continuous process of model evaluation, prompt generation using Amazon Bedrock, and data-aware optimization. The solution evaluates the model performance before migration and iteratively optimizes the Amazon Nova model prompts using user-provided dataset and objective metrics.

In the era of generative AI, new large language models (LLMs) are continually emerging, each with unique capabilities, architectures, and optimizations. Among these, Amazon Nova foundation models (FMs) deliver frontier intelligence and industry-leading cost-performance, available exclusively on Amazon Bedrock. Since its launch in 2024, generative AI practitioners, including the teams in Amazon, have started transitioning their workloads from existing FMs and adopting Amazon Nova models.
However, when transitioning between different foundation models, the prompts created for your original model might not be as performant for Amazon Nova models without prompt engineering and optimization. Amazon Bedrock prompt optimization offers a tool to automatically optimize prompts for your specified target models (in this case, Amazon Nova models). It can convert your original prompts to Amazon Nova-style prompts. Additionally, during the migration to Amazon Nova, a key challenge is making sure that performance after migration is at least as good as or better than prior to the migration. To mitigate this challenge, thorough model evaluation, benchmarking, and data-aware optimization are essential, to compare the Amazon Nova model’s performance against the model used before the migration, and optimize the prompts on Amazon Nova to align performance with that of the previous workload or improve upon them.
In this post, we present an LLM migration paradigm and architecture, including a continuous process of model evaluation, prompt generation using Amazon Bedrock, and data-aware optimization. The solution evaluates the model performance before migration and iteratively optimizes the Amazon Nova model prompts using user-provided dataset and objective metrics. We demonstrate successful migration to Amazon Nova for three LLM tasks: text summarization, multi-class text classification, and question-answering implemented by Retrieval Augmented Generation (RAG). We also discuss the lessons learned and best practices for you to implement the solution for your real-world use cases.
Migrating your generative AI workloads to Amazon Nova
Migrating the model from your generative AI workload to Amazon Nova requires a structured approach to achieve performance consistency and improvement. It includes evaluating and benchmarking the old and new models, optimizing prompts on the new model, and testing and deploying the new models in your production. In this section, we present a four-step workflow and a solution architecture, as shown in the following architecture diagram.
The workflow includes the following steps:
- Evaluate the source model and collect key performance metrics based on your business use case, such as response accuracy, response format correctness, latency, and cost, to set a performance baseline as the model migration target.
- Automatically update the structure, instruction, and language of your prompts to adapt to the Amazon Nova model for accurate, relevant, and faithful outputs. We will discuss this more in the next section.
- Evaluate the optimized prompts on the migrated Amazon Nova model to meet the performance target defined in Step 1. You can conduct the optimization in Step 2 as an iterative process until the optimized prompts meet your business criteria.
- Conduct A/B testing to validate the Amazon Nova model performance in your testing and production environment. When you’re satisfied, you can deploy the Amazon Nova model, settings, and prompts in production.
This four-step workflow needs to run continuously, to adapt to variations in both the model and the data, driven by the changes in business use cases. The continuous adaptation provides ongoing optimization and helps maximize overall model performance.
Data-aware prompt optimization on Amazon Nova
In this section, we present a comprehensive optimization methodology, taking two steps. The first step is to use Amazon Bedrock prompt optimization to refine your prompt structure, and then use an innovative data-aware prompt optimization approach to further optimize the prompt to improve the Amazon Nova model performance.
Amazon Bedrock prompt optimization
Amazon Bedrock provides a prompt optimization feature that rewrites prompts to improve performance for your use cases. Prompt optimization streamlines the way that AWS developers interact with FMs on Amazon Bedrock, automatically adapts the prompts to the selected models, and generates for better performance.
As the first step, you can use prompt optimization to adapt your prompt to Amazon Nova. By analyzing the prompt you provide, the feature interprets the task, system prompt, and instruction within the prompt, and automatically crafts the prompt with Amazon Nova specific format and appropriate words, phrases, and sentences. The following example shows how prompt optimization converts a typical prompt for a summarization task on Anthropic’s Claude Haiku into a well-structured prompt for an Amazon Nova model, with sections that begin with special markdown tags such as ## Task, ### Summarization Instructions
, and ### Document to Summarize
.
Model | Prompt |
Anthropic’s Claude 3 Haiku | Human: Act like you are an intelligent AI assistant. You are required to provide a summarization based on given document. Please use below instructions when generating the response. The document is provided in Please be brief and concise in your answer. Do not add any information that is not mentioned in the document. Do not provide any preamble and directly start with the summarization. Do not make up the answer, If you don’t know the answer, just say that I don’t know. |
Amazon Nova Lite with Amazon Bedrock prompt optimization |
### Task Your task is to summarize the given document enclosed in – Read the document carefully to understand its main points and key information. – Identify the core ideas, arguments, and supporting details presented in the document. – Synthesize the essential information into a clear and succinct summary. – Use your own words to paraphrase the key points – do not copy verbatim from the original text. – Omit any extraneous or redundant information not central to the main ideas. – Do not introduce new information or make up content not present in the original document. – If you cannot summarize the document due to lack of understanding, simply respond “I don’t know.”### Document to Summarize |
We applied the preceding prompts to the Anthropic Claude 3 Haiku and Amazon Nova Lite models, respectively, using the public xsum dataset. To evaluate the model performance, because the summarization task doesn’t have a predefined ground truth, we designed an LLM judge as shown in the following prompt to validate the summarization quality:
The experiment, using 80 data samples, shows that the accuracy is improved on the Amazon Nova Lite model from 77.75% to 83.25% using prompt optimization.
Data-aware optimization
Although Amazon Bedrock prompt optimization supports the basic needs of prompt engineering, other prompt optimization techniques are available to maximize LLM performance, such as Multi-Aspect Critique, Self-Reflection, Gradient Descent and Beam Search, and Meta Prompting. Specifically, we observed requirements from users that they need to fine-tune their prompts against their optimization objective metrics they define, such as ROUGE, BERT-F1, or an LLM judge score, by using a dataset they provide. To meet these needs, we designed a data-aware optimization architecture as shown in the following diagram.
The data-aware optimization takes inputs. The first input is the user-defined optimization objective metrics; for the summarization task discussed in the previous section, you can use the BERT-F1 score or create your own LLM judge. The second input is a training dataset (DevSet) provided by the user to validate the response quality, for example, a summarization data sample with the following format.
Source Document | Summarization |
Officers searched properties in the Waterfront Park and Colonsay View areas of the city on Wednesday. Detectives said three firearms, ammunition and a five-figure sum of money were recovered. A 26-year-old man who was arrested and charged appeared at Edinburgh Sheriff Court on Thursday. | A man has appeared in court after firearms, ammunition and cash were seized by police in Edinburgh. |
|
|
The data-aware optimization uses these two inputs to improve the prompt for better Amazon Nova response quality. In this work, we use the DSPy (Declarative Self-improving Python) optimizer for the data-aware optimization. DSPy is a widely used framework for programming language models. It offers algorithms for optimizing the prompts for multiple LLM tasks, from simple classifiers and summarizers to sophisticated RAG pipelines. The dspy.MIPROv2
optimizer intelligently explores better natural language instructions for every prompt using the DevSet, to maximize the metrics you define.
We applied the MIPROv2 optimizer on top of the results optimized by Amazon Bedrock in the previous section for better Amazon Nova performance. In the optimizer, we specify the number of the instruction candidates in the generation space, use Bayesian optimization to effectively search over the space, and run it iteratively to generate instructions and few-shot examples for the prompt in each step:
With the setting of num_candidates=5
, the optimizer generates five candidate instructions:
We set other parameters for the optimization iteration, including the number of trials, the number of few-shot examples, and the batch size for the optimization process:
When the optimization starts, MIPROv2 uses each instruction candidate along with the mini-batch of the testing dataset we provided to infer the LLM and calculate the metrics we defined. After the loop is complete, the optimizer evaluates the best instruction by using the full testing dataset and calculates the full evaluation score. Based on the iterations, the optimizer provides the improved instruction for the prompt:
Applying the optimized prompt, the summarization accuracy generated by the LLM judge on Amazon Nova Lite model is further improved from 83.25% to 87.75%.
We also applied the optimization process on other LLM tasks, including a multi-class text classification task, and a question-answering task using RAG. In all the tasks, our approach optimized the migrated Amazon Nova model to out-perform the Anthropic Claude Haiku and Meta Llama models before migration. The following table and chart illustrate the optimization results.
Task | DevSet | Evaluation | Before Migration | After Migration (Amazon Bedrock Prompt Optimization) | After Migration (DSPy with Amazon Bedrock Prompt Optimization) |
Summarization (Anthropic Claude 3 Haiku to Amazon Nova Lite) | 80 samples | LLM Judge | 77.75 | 83.25 | 87.75 |
Classification (Meta Llama 3.2 3B to Amazon Nova Micro) | 80 samples | Accuracy | 81.25 | 81.25 | 87.5 |
QA-RAG (Anthropic Claude 3 Haiku to Amazon Nova Lite) | 50 samples | Semantic Similarity | 52.71 | 51.6 | 57.15 |
For the text classification use case, we optimized the Amazon Nova Micro model using 80 samples, using the accuracy metrics to evaluate the optimization performance in each step. After seven iterations, the optimized prompt provides 87.5% accuracy, improved from the accuracy of 81.25% running on the Meta Llama 3.2 3B model.
For the question-answering use case, we used 50 samples to optimize the prompt for an Amazon Nova Lite model in the RAG pipeline, and evaluated the performance using a semantic similarity score. The score compares the cosine distance between the model’s answer and the ground truth answer. Comparing to the testing data running on Anthropic’s Claude 3 Haiku, the optimizer improved the score from 52.71 to 57.15 after migrating to the Amazon Nova Lite model and prompt optimization.
You can find more details of these examples in the GitHub repository.
Lessons learned and best practices
Through the solution design, we have identified best practices that can help you properly configure your prompt optimization to maximize the metrics you specify for your use case:
- Your dataset for optimizer should be of high quality and relevancy, and well-balanced to cover the data patterns and edge cases of your use case, and nuances to minimize biases.
- The metrics you defined as the target of optimization should be use case specific. For example, if your dataset has ground truth, then you can use statistical and programmatical machine learning (ML) metrics such as accuracy and semantic similarity If your dataset doesn’t include ground truth, a well-designed and human-aligned LLM judge can provide a reliable evaluation score for the optimizer.
- The optimizer runs with a number of prompt candidates (parameter
dspy.num_candidates
) and uses the evaluation metric you defined to select the optimal prompt as the output. Avoid setting too few candidates that might miss opportunity for improvement. In the previous summarization example, we set five prompt candidates for optimizing through 80 training samples, and received good optimization performance. - The prompt candidates include a combination of prompt instructions and few-shot examples. You can specify the number of examples (parameter
dspy.max_labeled_demos
for examples from labeled samples, and parameterdspy.max_bootstrapped_demos
for examples from unlabeled samples); we recommend the example number be no less than 2. - The optimization runs in iteration (parameter
dspy.num_trials
); you should set enough iterations that allow you to refine prompts based on different scenarios and performance metrics, and gradually enhance clarity, relevance, and adaptability. If you optimize both the instructions and the few-shot examples in the prompt, we recommend you set the iteration number to no less than 2, preferably between 5–10.
In your use case, if your prompt structure is complex with chain-of-thoughts or tree-of-thoughts, long instructions in the system prompt, and multiple inputs in the user prompt, you can use a task-specific class to abstract the DSPy optimizer. The class helps encapsulate the optimization logic, standardize the prompt structure and optimization parameters, and allow straightforward implementation of different optimization strategies. The following is an example of the class created for text classification task:
Conclusion
In this post, we introduced the workflow and architecture for you to migrate your current generative AI workload into Amazon Nova models, and presented a comprehensive prompt optimization approach using Amazon Bedrock prompt optimization and a data-aware prompt optimization methodology with DSPy. The results on three LLM tasks demonstrated the optimized performance of Amazon Nova in its intelligence classes and the model performance improved by Amazon Bedrock prompt optimization post-model migration, which is further enhanced by the data-aware prompt optimization methodology presented in this post.
The Python library and code examples are publicly available on GitHub. You can use this LLM migration method and the prompt optimization solution to migrate your workloads into Amazon Nova, or in other model migration processes.
About the Authors
Yunfei Bai is a Principal Solutions Architect at AWS. With a background in AI/ML, data science, and analytics, Yunfei helps customers adopt AWS services to deliver business results. He designs AI/ML and data analytics solutions that overcome complex technical challenges and drive strategic objectives. Yunfei has a PhD in Electronic and Electrical Engineering. Outside of work, Yunfei enjoys reading and music.
Anupam Dewan is a Senior Solutions Architect with a passion for generative AI and its applications in real life. He and his team enable Amazon Builders who build customer facing application using generative AI. He lives in Seattle area, and outside of work loves to go on hiking and enjoy nature.
Shuai Wang is a Senior Applied Scientist and Manager at Amazon Bedrock, specializing in natural language processing, machine learning, large language modeling, and other related AI areas. Outside work, he enjoys sports, particularly basketball, and family activities.
Kashif Imran is a seasoned engineering and product leader with deep expertise in AI/ML, cloud architecture, and large-scale distributed systems. Currently a Senior Manager at AWS, Kashif leads teams driving innovation in generative AI and Cloud, partnering with strategic cloud customers to transform their businesses. Kashif holds dual master’s degrees in Computer Science and Telecommunications, and specializes in translating complex technical capabilities into measurable business value for enterprises.