Does prompt caching actually reduce token usage?

It reduces the cost of tokens that repeat across requests — a stable prefix (system prompt, tool definitions) gets cached and billed at a discount on subsequent calls, rather than reducing the raw token count itself.

Does a shorter system prompt really matter at scale?

Yes, because it's sent on every single request. A system prompt that's 200 tokens longer than it needs to be costs that much extra on every call, all day, across every user — it compounds in a way a single verbose user message doesn't.

How does MCP tool bloat waste tokens?

Every connected MCP server registers its tools' names, descriptions, and parameter schemas into the model's context before any tool is actually called. A server with dozens of verbose tool descriptions can cost real tokens sitting idle.

Do few-shot examples cost more than they're worth?

Often, past two or three. Each example is real tokens on every call; past the point where the model has clearly grasped the pattern, additional examples add cost without adding accuracy.

Is limiting max_tokens a real token-saving strategy?

It caps runaway output, which controls cost, but it doesn't reduce input tokens. Combine it with a system prompt that explicitly asks for concise output if verbosity is the actual problem.

Prompt Engineering to Reduce Token Usage (OpenAI API, MCP, and Beyond)

Prompt caching, tighter system prompts, leaner MCP tool schemas, and fewer few-shot examples — the levers that actually cut token spend at the API level.

Every token-saving conversation eventually lands on the same handful of levers, whichever provider or protocol you’re building on. None of them are exotic — the reason they’re worth writing down is that they’re easy to skip when you’re focused on getting the feature working at all, and by the time it’s in production, the waste is baked into every single call.

1. Use prompt caching for anything that repeats

If your API supports prompt caching, a stable prefix — system prompt, tool definitions, a long reference document you send on every call — gets cached and billed at a discount on subsequent requests instead of full price every time. This is close to free savings if your requests share a consistent prefix; it costs nothing except structuring your prompt so the stable part comes first and the variable part (the actual user question) comes last. OpenAI’s own prompt caching docs put the input-token savings at up to 90% for cached prefixes — worth reading if you’re deciding where to draw the stable/variable line.

2. Trim the system prompt like it’s on the clock

A system prompt is sent on every single request, which means every extra sentence in it is a recurring cost multiplied by call volume, not a one-time cost like a long user message. Cut instructions the model doesn’t need for this specific integration, collapse redundant phrasing, and move anything reference-like (a full style guide, a long list of edge cases) into a document you load conditionally rather than always.

Before: "You are a helpful and knowledgeable assistant who always
tries their best to provide accurate, thorough, and detailed
responses to help the user with whatever they need..."

After: "Answer concisely. Cite sources for factual claims."

3. Cap few-shot examples at what the pattern actually needs

Few-shot examples are genuinely useful for steering format and tone, but each one is real tokens on every call. Two or three well-chosen examples usually teach the pattern; a fifth or sixth example rarely adds accuracy; it just adds cost, forever, on every request.

4. Keep MCP tool schemas lean

Every MCP server you connect registers its tools’ names, descriptions, and parameter schemas into context before the model calls any of them. A server exposing twenty tools with verbose, paragraph-length descriptions is spending real tokens on capability the model might use once per session, if at all.

Two fixes: connect only the servers a given integration actually needs, and if you’re the one writing the MCP server, keep tool descriptions to what’s needed for the model to choose correctly — not full documentation. The MCP specification doesn’t mandate verbose descriptions; that’s a convention, not a requirement, and it’s the first thing worth trimming.

5. Constrain output with structured formats instead of hoping for brevity

Asking for JSON with a defined schema (or using your provider’s structured-output mode) tends to produce tighter, more predictable responses than an open-ended prompt hoping the model stays concise. It also avoids the retry cost of parsing a response that came back in the wrong shape.

6. Don’t repeat context the model already has

If you’re managing a multi-turn conversation yourself (rather than using the provider’s native chat state), watch for accidentally re-sending documents or context blocks that were already established earlier in the conversation. This is an easy one to introduce by accident when building a custom orchestration layer.

7. For vision-input use cases specifically, check whether text would do

Image inputs are convenient, but if the information in the image is something you could extract as text — a UI element’s styles and selector, a table’s actual values, a form’s field names — sending the text directly is usually both cheaper and more reliably parsed than asking the model to read it out of an image first.

This is the exact case UICuts was built around: instead of screenshotting a webpage element and letting a model infer the selector and computed styles from pixels, it captures that data directly and hands it over as structured text — skipping the image-interpretation step and the retry that often follows a wrong guess.

Key lessons learned

Caching rewards a stable, front-loaded prompt structure — put what repeats first, what varies last.
System prompts and tool schemas are recurring costs multiplied by call volume; user messages are one-time. Optimize the recurring ones first.
If the information is already text (or could be), sending it as text usually beats making the model extract it from an image.

These are the API-level version of tactics that show up throughout the rest of this blog with different names: Claude Code applies the same discipline through CLAUDE.md and session hygiene, Claude Skills apply it to conditionally-loaded instructions, and GitHub Copilot and OpenCode apply it at the editor/CLI level instead of the raw API. If you’re driving an agent through OpenClaw, the MCP-schema tactics above apply just as directly there.

If frontend UI feedback is anywhere in your pipeline, try UICuts free — it turns a screenshot’s worth of guessing into a paste-ready block of exact context.

Prompt Engineering to Reduce Token Usage (OpenAI API, MCP, and Beyond)

1. Use prompt caching for anything that repeats

2. Trim the system prompt like it’s on the clock

3. Cap few-shot examples at what the pattern actually needs

4. Keep MCP tool schemas lean

5. Constrain output with structured formats instead of hoping for brevity

6. Don’t repeat context the model already has

7. For vision-input use cases specifically, check whether text would do

Key lessons learned

Frequently asked

Keep reading

Less guessing.
Faster fixes!

Prompt Engineering to Reduce Token Usage (OpenAI API, MCP, and Beyond)

1. Use prompt caching for anything that repeats

2. Trim the system prompt like it’s on the clock

3. Cap few-shot examples at what the pattern actually needs

4. Keep MCP tool schemas lean

5. Constrain output with structured formats instead of hoping for brevity

6. Don’t repeat context the model already has

7. For vision-input use cases specifically, check whether text would do

Key lessons learned

Related reading

Frequently asked

Keep reading

Less guessing.Faster fixes!

Less guessing.
Faster fixes!