Skip to content

Your Claude Code Quota Is Burning 20x Faster Than You Think

A community investigation intercepted 17,610 API calls and found that invisible thinking tokens are the dominant quota cost in Claude Code. Six confirmed bugs, two still unfixed.

Your Claude Code Quota Is Burning 20x Faster Than You Think

You're midway through a refactor, Claude Code is humming along, and then: rate limited. Again. After barely an hour.

You're not imagining it. A community investigation just dropped hard numbers: invisible "thinking tokens" are the dominant cost in your Claude Code quota, and you can't see or control them.

What the proxy revealed

A developer built cc-relay, a transparent monitoring proxy that intercepts Claude Code's API calls using the standard ANTHROPIC_BASE_URL environment variable. It logs every request and response without modifying any behavior.

Across 17,610 logged requests over one week:

  • Your 5-hour quota window gives you roughly 30-50 moderate coding tasks. That's it.
  • Visible output per 1% of quota: 9K-16K tokens. The rest is extended thinking tokens you never see.
  • Cache hit rates dropped to 36.1% in certain versions when they should be above 90%.
  • Rolling back from v2.1.89 to v2.1.68 restored cache performance to 97.6%.

The quota system uses dual sliding windows: a 5-hour counter and a 7-day counter. In 100% of captured requests, the 5-hour window is the binding constraint, regardless of time of day.

Six confirmed bugs, two still unfixed

The investigation uncovered six client-side bugs.

Fixed in v2.1.91

Sentinel Bug. The standalone binary broke the cache prefix, reducing cache effectiveness to 4-17% and multiplying costs by 20x.

Resume Bug. The --resume flag replayed full context without using cache, causing complete cache misses on 500K-token conversations.

Still unfixed

Budget Cap. Tool results face a 200K character aggregate limit. After roughly 15-20 file reads, older results get truncated to 1-41 characters. This happened in 72,839 measured events with a 100% truncation rate. You're paying for 1M-token context. You're not getting it for tool results.

False Rate Limiter. The client generates fake "Rate limit reached" messages. Across 65 sessions, 151 synthetic errors occurred even though no actual API calls were attempted.

Microcompact. The server silently removes old tool results from context: 3,782 clearing events removed 15,998 items total.

JSONL Inflation. Extended thinking duplicates entries in local logs, creating an average 2.37x token inflation across 532 files.

The budget cap and microcompact mechanisms are controlled by server-side A/B testing flags. Anthropic can modify behavior without releasing client updates. No environment variable provides overrides.

Independent confirmation

Two separate analyses back this up:

  • An 18-day forensic comparison showed token consumption dropping from 3.2B to 88M between unlimited and limited periods, at 90% utilization. Visible token counts alone don't explain that reduction.
  • A second analysis documented a 34-143x capacity reduction and confirmed that cache-related fixes improved efficiency, but the capacity drop appeared independent of client-side bugs.

Why this matters

When your tooling silently burns through quota via invisible tokens, you lose the ability to plan capacity or debug cost spikes. You hit a wall mid-session with no warning and no recourse.

This is a straightforward observability problem. If you can see what your infrastructure is doing, you can catch these issues before they eat your budget. If you can't, you're flying blind.

What to do right now

  1. Update to v2.1.91 to fix the cache regression
  2. Stop using --resume and --continue -- start fresh sessions instead
  3. Start new sessions periodically to reset the 200K budget cap
  4. Use a single terminal, since multiple terminals don't share cache

Full analysis with raw data, proxy tool, and community tools: github.com/ArkNill/claude-code-hidden-problem-analysis

This was a 17-person community effort. Credit to the researchers in the full writeup.

Ready to give your AI agent a real desktop?

View plans

Get our next articles

Subscribe to our newsletter so you don't miss a thing.