IA

AI tokens, cache and LLM costs: a simple budget guide

21 May 2026 WG 6 min read

Introduction

Many companies still assess AI cost with the wrong question: “how much is the subscription?”. For an internal tool, a SaaS product or an automation, the real question is different: how much do tokens, outputs, tools, cache, searches, failed calls and validations cost?

A subscription may look clear. An API bill can become confusing. Costs vary depending on the model, context size, output volume, cache availability, tools called and sometimes the processing region.

This article explains the basics for managing an AI budget without unnecessary jargon. It does not provide the real cost of WG products, because that cost must be proven by invoices, logs and consolidated live usage before any public announcement.

A token is not a word

A token is a technical unit used by models to read and generate text. It is not exactly a word. Depending on language, punctuation, code and characters, the number of tokens can vary.

To manage a budget, you need to separate three things:

Element What it means Budget impact
Input tokens What you send to the model Large prompts, files, history, tools
Output tokens What the model generates Long answers, articles, code, reports
Tool tokens Tool definitions, results, search, files Can grow quickly with agents

The common trap is to look only at the user prompt. In a real workflow, the model also receives system instructions, project context, schemas, tools, search results and sometimes files or code.

Why output can cost more

Across several pricing grids reviewed, output tokens can cost more than input tokens depending on the provider and model. Prices change regularly, so the official pricing page must be checked on the day a budget is prepared.

The business principle remains useful even without freezing a number: a long output can cost significantly more than a short input when it is generated often.

Examples of outputs that can inflate costs:

  • Long articles in several variants.
  • Detailed audit reports.
  • Generated code with long explanations.
  • Agents that summarize every action instead of producing short proof.
  • Workflows that restart several times instead of correcting one step.

Cache can help, but it is not a magic wand

OpenAI documents prompt caching as an automatic mechanism on eligible prompts, with possible latency and cost benefits when identical prefixes are reused. The documentation also indicates that prompts must reach a minimum threshold and that static content should be placed at the beginning to maximize cache hits.

Anthropic also documents prompt caching with cache write and cache read logic, and durations such as 5 minutes or 1 hour depending on configuration and pricing.

The practical conclusion:

  • Cache mainly helps when the same stable context is reused.
  • It works better when stable instructions and documents are placed at the beginning.
  • It does not fix a poorly designed workflow.
  • It does not automatically reduce all costs.

Internal mini-calculator

To estimate an AI workflow, use a simple table with these fields.

Field Internal example How to use it
Model To fill in Input/output pricing to recheck
Average input 20,000 tokens Prompt + context + tools
Average output 3,000 tokens Final answer or report
Cache rate 0%, 50%, 80% Hypothesis to prove through usage
Number of calls 1,000/month Real volume or scenario
Tool cost Web search, file search, shell, etc. Depends on provider
Total cost Calculated Do not publish without proof

This calculator should not display any official cost until live data has been consolidated.

Errors that make the bill explode

  1. Sending all context on every request without separating stable and variable parts.
  2. Asking for long answers by default.
  3. Letting an agent multiply searches without a retrieval budget.
  4. Using a model that is too strong for a simple task.
  5. Not logging input_tokens, output_tokens and cached tokens.
  6. Confusing a one-off test with monthly cost at real volume.

The solution is not always to choose the cheapest model. The solution is to choose the right model level, the right context, the right output format and the right verification.

WG control method

A clean AI workflow should store at least:

  • The model used.
  • The number of calls.
  • Input/output tokens.
  • Cached tokens when available.
  • Tools called.
  • Failure or regeneration rate.
  • The business value of the result.

Without this base, you are not managing an AI system. You are consuming a black box.

Verified official sources

Sources rechecked on 2026-05-20 before publication.

Note: AI prices, model names and features can change. Official sources must be rechecked before any budget or technical decision.

W

WG

Web development and SEO expert at Web Generation Agency. Since 2007, nearly 20 years of experience building high-performance websites and delivering natural search engine optimization.

Do you have a Project ?

Let's discuss your web project. Free and no-commitment quote.

Start a Project WhatsApp