Retrieval-Augmented Generation (RAG) is the most common pattern for knowledge-grounded AI agents. But most teams only track the inference costs and completely miss the embedding side.
The Two Cost Centers
Embedding Costs (Ingestion)
Every document you add to your knowledge base needs to be embedded. OpenAI's text-embedding-3-small costs $0.02 per million tokens. Sounds cheap until you're processing 100,000 documents with an average of 2,000 tokens each — that's $4.00 just for initial ingestion. Re-index after updates and it multiplies.
Inference Costs (Query Time)
Each user query triggers: (1) embed the query ($0.000002), (2) retrieve relevant chunks, (3) send chunks + query to the LLM ($0.02-0.10 depending on context size and model).
Tracking Both in AgentBurn
# Track embedding costs during ingestion
ingest_event(
agent_id="rag-indexer",
provider="openai",
model="text-embedding-3-small",
operation="embedding",
input_tokens=token_count,
cost_usd=token_count * 0.02 / 1_000_000
)
# Track inference costs at query time
ingest_event(
agent_id="rag-query-agent",
provider="anthropic",
model="claude-sonnet-4-20250514",
operation="llm_call",
input_tokens=context_tokens + query_tokens,
output_tokens=response_tokens,
cost_usd=calculated_cost
)
Illustrative Cost Breakdown
For a hypothetical RAG system processing 50K documents and serving 1,000 queries/day, based on published API pricing:
- Initial embedding: A few dollars (one-time) — embeddings are cheap per call
- Daily re-indexing (10% updates): Pennies per day
- Daily inference: The dominant cost — each query sends retrieved chunks to an LLM, and this scales linearly with query volume
- Key takeaway: Inference almost always dwarfs embedding costs at query volumes above a few hundred per day
Optimization Strategies
- Reduce context window — Send 3 relevant chunks instead of 10. Each chunk you remove saves tokens
- Use cheaper models for simple questions — Route factual lookups to Haiku/Flash, complex analysis to Sonnet
- Cache frequent queries — Many RAG systems see 30% query repetition
- Incremental indexing — Only re-embed changed documents, not the full corpus
AgentBurn's provider breakdown immediately shows the embedding vs inference split, making it clear where optimization effort should focus.