Back to Blog
2026-02-02
6 min read
Why at-least-once delivery means your AI pipeline will process duplicates, and why idempotency is the only reliable fix.
If you run AI workloads through a task queue, duplicates are not a bug. They are a feature of the delivery guarantee you almost certainly chose. Understanding why they happen, and designing for them, is the difference between a pipeline that wastes money on redundant LLM calls and one that shrugs off failures gracefully.
Think of it like a busy kitchen. A ticket printer spits out orders, and line cooks pull them. If a cook drops a plate halfway through and nobody crosses the ticket off, the expeditor calls it again. The kitchen makes the dish twice. That's at-least-once delivery: better to send an extra plate to the pass than to lose an order entirely.
Most engineers want exactly-once delivery: every message processed one time, no more, no less. The problem is that exactly-once requires coordination between the broker and the worker that is extremely hard to achieve in practice. The broker needs to know the worker finished, and the worker needs to know the broker won't send the message again. In a distributed system where either side can crash, restart, or lose a network connection at any moment, that coordination breaks down.
What you actually get in most systems is one of two alternatives:
At-least-once is the default in SQS, Celery with acks_late=True, Google Cloud Tasks, and most durable workflow engines like Temporal. It's the pragmatic choice for pipelines where losing a task is worse than doing it twice.
Duplicates are not some theoretical edge case. Three scenarios produce them regularly:
Worker crash mid-processing. A worker picks up a message, starts an expensive OCR extraction, and gets OOM-killed halfway through. The broker never received an acknowledgment, so it redelivers the message to another worker. Now two attempts exist: one partial (dead), one fresh.
Visibility timeout expiry. In SQS, when a worker receives a message, that message becomes invisible to other consumers for a configurable window (the visibility timeout). If your AI pipeline task takes longer than expected—say a large PDF triggers a slow embedding model—the timeout expires and SQS hands the message to a second worker while the first is still running.
Late acknowledgment. The worker finishes the job and sends an ack, but the network blips. The broker assumes the worker failed and redelivers. The work was already done, but the queue doesn't know that.
All three scenarios result in the same task being executed more than once.
Idempotency means that running an operation multiple times produces the same result as running it once. Back to the kitchen: if you've already plated the risotto and it's sitting under the heat lamp, you don't fire a second one just because the ticket printed again. You check the pass first. Same idea in code.
The most common pattern is an idempotency key—a unique identifier derived from the input that you check before doing expensive work:
def process_document(task: dict):
idempotency_key = f"doc:{task['document_id']}:v{task['version']}"
if store.exists(idempotency_key):
return # Already processed, skip
chunks = split_and_embed(task["content"]) # Expensive LLM call
vector_db.upsert(task["document_id"], chunks)
store.set(idempotency_key, ttl=86400) # Mark as done
The key is derived from the document ID and version so that a legitimate re-processing (new version) still runs, but a duplicate delivery of the same version gets caught. The TTL on the key keeps your idempotency store from growing forever.
This is not clever. It's a few lines of code. But it's the difference between paying for one embedding call and paying for three because a worker got slow under load.
When you combine at-least-once delivery with retry policies and a dead-letter queue, message flow looks like this:
The dead-letter queue (DLQ) is not where messages go to die. It's a throughput protection mechanism. Without it, a poison message—a malformed PDF, an input that consistently crashes your OCR service—cycles through the queue forever, consuming worker capacity and blocking healthy tasks behind it. The DLQ catches these after a configured number of retries and moves them aside so the rest of the pipeline keeps flowing. You review them later, fix the root cause, and replay if needed.
Every kitchen has an 86 board—the list of items that are cut from service because you're out of an ingredient or a piece of equipment is down. You don't keep trying to plate a dish you can't finish. You pull it and deal with it. A DLQ is your 86 board for tasks that can't be completed right now.
In a traditional web backend, retrying a database insert is cheap. In an AI pipeline, retries hit your wallet directly.
Consider a document ingestion pipeline: receive a PDF, OCR it, chunk it, generate embeddings, then extract structured fields with an LLM. If extraction fails and the task retries from the top, you're paying for OCR and embeddings again. At scale, this adds up:
Idempotency keys at each stage let you skip work that already completed successfully. The OCR result is cached. The embeddings are already in the vector store. The retry only re-executes the extraction step that actually failed. This is sometimes called partial progress recovery, and it turns an expensive full retry into a cheap targeted one.