Back to Blog
2026-03-05
8 min read
You write your agent's instructions, test them, tweak a word, test again, and hope the change helped. There's an algorithm that does this better than you do — evolutionary optimization finds prompts you'd never write yourself.
You write a prompt. You test it. It's okay but not great — the agent misses edge cases, ignores context, or produces output that's technically correct but tonally wrong. So you tweak a sentence. Test again. Better on one example, worse on another. You rewrite the whole thing. Test again. Repeat until you're tired or the deadline hits, whichever comes first.
This is how most people build AI agents. It's also how most people season food before they learn to taste as they go — a pinch of this, a dash of that, hope for the best. It works, kind of. But it doesn't scale, it doesn't generalize, and it leaves performance on the table that you'll never find through manual iteration.
What if an algorithm could do this loop for you — and do it better?
Manual prompt engineering has a ceiling. Humans are good at writing instructions that sound clear to other humans, but LLMs don't process text the way we do. The prompt that reads best to you isn't necessarily the prompt that produces the best output from the model.
There are a few specific failure modes:
These aren't character flaws. They're the natural limitations of a human trying to search a vast space of possible text through trial and error. The space of possible instructions for even a simple agent is effectively infinite, and your ability to explore it is limited by time, patience, and cognitive bias.
Evolutionary prompt optimization borrows from genetic algorithms. Instead of you guessing at better prompts, an algorithm systematically explores the space:
Repeat until the scores plateau or you hit an iteration limit.
The key insight is step 3. The mutation isn't random — it's informed. A reflection model sees the agent's actual outputs, the scores, and the feedback, then proposes specific text changes to address the weaknesses it observed. It's less "random mutation" and more "systematic recipe refinement" — tasting every dish that comes out of the kitchen, identifying exactly what's off, and adjusting the technique accordingly.
This is the approach described in the GEPA paper (Genetic-Pareto prompt optimizer). GEPA treats prompt components as evolvable genes and uses LLM-powered reflection instead of random crossover. The result is an optimizer that can improve agent instructions in ways that surprise even the person who wrote the original prompt.
gepa-adk is a Python library that brings evolutionary optimization to Google ADK agents. You give it an agent, training examples, and a critic — it gives you back a better prompt.
Here's the simplest possible example. Start with a greeting agent that has a generic instruction:
from google.adk.agents import LlmAgent
from gepa_adk import evolve_sync, EvolutionConfig, SimpleCriticOutput
# The agent to evolve — starts with a vague instruction
agent = LlmAgent(
name="greeter",
model="gemini-2.5-flash",
instruction="Greet the user appropriately.",
)
# A critic that knows what "good" looks like
critic = LlmAgent(
name="critic",
model="gemini-2.5-flash",
instruction="Score for formal, Dickens-style greetings. 0.0-1.0.",
output_schema=SimpleCriticOutput,
)
# Training examples covering different social contexts
trainset = [
{"input": "I am His Majesty, the King."},
{"input": "I am your mother."},
{"input": "I am a close friend."},
]
result = evolve_sync(agent, trainset, critic=critic)
print(f"Score: {result.original_score:.2f} -> {result.final_score:.2f}")
print(result.evolved_components["instruction"])
The original instruction — "Greet the user appropriately" — is five words. The evolved instruction might be three paragraphs of specific guidance about formality levels, period-appropriate language, honorific handling, and tonal variation. You would never write that prompt yourself. Not because you couldn't, but because you wouldn't think to be that specific in those particular ways.
The critic is what separates evolutionary optimization from "just running the prompt a bunch of times." Without a critic, you're scoring outputs manually or relying on the agent to self-assess (which is like asking the cook to rate their own soup — helpful but biased).
A critic agent is a separate LLM that evaluates outputs against explicit criteria. It returns a score and feedback:
class SimpleCriticOutput(BaseModel):
score: float = Field(ge=0.0, le=1.0)
feedback: str
The feedback is the ingredient that makes reflection work. The reflection model doesn't just see "this scored 0.4" — it sees "this scored 0.4 because the greeting was too casual for royalty and used modern slang." That specific diagnosis drives specific mutations.
You can also use multi-dimensional scoring — rate outputs on clarity, accuracy, tone, and format independently. GEPA tracks a Pareto frontier across dimensions, so you don't have to collapse everything into a single number and lose information about what's actually improving.
What actually happens inside evolve():
Iteration 1: The agent runs on all training examples with its original instruction. The critic scores each output. Average score: 0.35. The reflection model reads every output, every score, every piece of feedback. It proposes a new instruction that addresses the most common failure patterns.
Iteration 2: The agent runs again with the proposed instruction. Scores improve to 0.62. The reflection model sees what got better and what's still weak. It proposes another refinement.
Iteration 3: Scores hit 0.78. The reflection model notices diminishing returns and makes a smaller, more targeted adjustment.
Iteration 4: Scores reach 0.81. No improvement over the previous best after the patience window. Evolution stops.
The EvolutionConfig controls this process:
config = EvolutionConfig(
max_iterations=5, # Upper bound on iterations
patience=2, # Stop if no improvement for N iterations
reflection_model="gemini-2.5-flash", # Model for generating mutations
)
Each iteration makes multiple LLM calls — the agent runs on every training example, the critic scores every output, and the reflection model analyzes everything. This is why gepa-adk recommends local models via Ollama for development. Evolution is compute-hungry by nature, and local inference keeps the iteration loop fast and free.
Instructions are the most common target, but they're not the only evolvable component. gepa-adk can also optimize:
That last one is where things get interesting. In a multi-agent pipeline, the instructions of one agent affect the inputs to the next. Evolving them together means the optimizer can find coordination patterns that no amount of individual prompt tuning would discover — like a kitchen brigade where the saucier and the grill cook learn to time their dishes together instead of each optimizing in isolation.
Evolutionary optimization isn't for every prompt. It shines when:
It's less useful for one-off prompts, creative tasks without clear quality metrics, or situations where the "right answer" changes too frequently for a training set to be meaningful.
pip install gepa-adk