Sigmoda Blog
Cutting LLM costs without hurting quality
LLM spend rarely climbs smoothly. It jumps. Someone ships a new route, adds a little more context, retries a bit more aggressively… and suddenly your cost chart looks like a staircase.
The fix isn’t “be more careful with prompts.” Treat it like performance work: measure where the cost is created, pick the biggest offender, change one thing, and verify you didn’t quietly break quality.
Start with a route‑level cost view
Provider dashboards are good for invoices, not day‑to‑day decisions. You want cost segmented by the surface area you actually ship: route, environment, model, and (optionally) customer tier.
- Cost per route per day (and a rolling 7-day).
- Model mix per route (to catch “temporary” upgrades).
- Tokens-in vs tokens-out distributions (waste lives in tokens-in).
- p50/p95 latency per route (cost and latency tend to move together).
Even a rough estimate is fine. Directionally correct beats perfect. What matters is that you can answer: which route is expensive, which model is doing it, and whether the bill is coming from tokens‑in, tokens‑out, or retries.
Pull the levers that actually move the bill
Most teams start by tweaking prompts. That’s usually the slowest way to save money. The big wins show up in three buckets: call volume, tokens, and model choice. Attack them in that order.
Call volume: fewer calls, fewer surprises
The cheapest call is the one you don’t make. Look for routes that fire automatically (page load, background jobs) and confirm they’re genuinely needed.
- Debounce: don’t call the model on every keystroke; call on submit.
- Batch: if one screen triggers 3 calls, see if it can be 1 call.
- De‑dupe retries: cap them and log when you retry (retries silently multiply spend).
Tokens: spend less on context you don’t use
Tokens‑in is where waste hides: retrieval dumps, repeated boilerplate, entire documents shoved into context when you only needed one paragraph.
- Put a hard budget on retrieval (for example: cap added context at 1–2k tokens).
- Summarize long threads before adding them to the prompt.
- Stop re‑sending boilerplate. Version your system prompt and keep it short.
Fast win
If tokens‑in p95 is 6,000+, you’re almost certainly paying for context you don’t use. Add a context budget, log what gets dropped, and tighten from there.
Model choice: downshift without gambling quality
Downshifting works when you’re explicit about quality boundaries. Don’t “just switch models.” Gate it by route and add simple checks: long outputs, elevated flagged rate, and user complaints/labels.
A safe pattern is “small model first, bigger model on escalation.” Escalation triggers can be simple: the small model produced a flagged output, the request is from a paid tier, or the route is high‑stakes (billing, compliance).
// Pseudocode model routing
if (route === "support.reply") {
model = "small-model";
if (isVipUser || needsLongReasoning) model = "large-model";
}// Pseudocode model routing
if (route === "support.reply") {
model = "small-model";
if (isVipUser || needsLongReasoning) model = "large-model";
}Cache repeats (carefully)
Caching is powerful, but only when requests are repeatable. Start with obvious repeats: docs Q&A, template generation, analysis of static text, or summarizing the same artifact over and over.
- Canonicalize inputs (trim whitespace, normalize JSON, stable ordering).
- Key on: route + model + prompt signature + relevant params.
- Set a TTL that matches reality (minutes for fast-changing data, days for static).
- Never cache user-private answers across users unless you include user scope in the key.
Don’t let savings quietly wreck quality
Cost cuts can quietly degrade product quality. You want a feedback loop that surfaces problems quickly without needing a full eval team.
- Add a lightweight label flow (good/bad + note) for outputs on the high-cost routes.
- Review the top 10 most expensive prompts weekly (usually a handful dominate).
- Watch “flagged rate” and “p95 tokens_out” as leading indicators of regressions.
Do this consistently and you’ll usually find meaningful savings without dramatic product changes. Most cost comes from waste: too much context, too many calls, and “temporary” model upgrades that became permanent.