Context Compression for LLMs: The Complete Technical Guide (2026)

Seven context compression methods compared with real benchmarks: LLM summarization, opaque compression, verbatim compaction, LLMlingua token pruning, observation masking, ACON adaptive control, and multi-agent isolation. Factory.ai's 36K message eval, ACON's 26-54% reduction, JetBrains' zero-cost masking, and why prevention beats compression.

March 13, 2026 · 2 min read

Coding agents spend 60% of their time searching for code. Every file read, grep result, and tool output accumulates in the context window. The question is not whether to compress, but which method loses the least signal. Seven distinct approaches have emerged, each with different tradeoffs between compression ratio, fidelity, speed, and hallucination risk.

60%
Agent time spent searching (Cognition)
3.70/5
Best compression score (Factory.ai, 36K msgs)
26-54%
Token reduction (ACON, 95%+ accuracy)
20x
Max compression (LLMlingua)

Why Context Compression Matters for LLMs

Context compression reduces the number of tokens an LLM processes while preserving the information it needs to produce correct output. For coding agents running multi-step tasks, context accumulates fast: file reads, grep results, tool outputs, error traces, and conversation history all compete for the same finite window.

The problem is not running out of space. It's signal dilution. As context rot research shows, LLM performance degrades as input length increases, even when the window is nowhere near full. Liu et al.'s "Lost in the Middle" paper found that LLM accuracy drops by over 30% when relevant information sits in the middle of the context rather than at the beginning or end. Every irrelevant token makes the model worse at attending to what matters.

Transformer attention is quadratic: at 10,000 tokens, the model tracks 100 million pairwise relationships. At 100,000 tokens, that number is 10 billion. More context doesn't just dilute relevance. It makes the model architecturally worse at locating the right tokens.

Jason Liu framed the value precisely: "If in-context learning is gradient descent, then compaction is momentum." It preserves the trajectory of the conversation while shedding the weight of irrelevant history.

Bigger windows don't solve the problem

A model with a 1M token context window still exhibits context rot at 50K tokens. Chroma tested 18 frontier models (GPT-4.1, Claude Opus 4, Gemini 2.5, Qwen 3) and found every model's reliability declines as input length increases, even on simple retrieval tasks. The degradation is non-uniform and hard to predict. You can't plan around it.

The Seven Context Compression Methods

Each method operates at a different level of abstraction, from token-level pruning to full architectural isolation. The right choice depends on what you need to preserve and what you can afford to lose.

MethodMechanismCompressionHallucination RiskSpeed
LLM SummarizationRewrites into sections70-90%MediumSlow (full LLM call)
Opaque CompressionModel-internal reductionVariableUnknownVariable
Verbatim CompactionDeletes noise, keeps text50-70%Zero3,300+ tok/s
Token Pruning (LLMlingua)Removes low-info tokens2-20xLow3-6x faster than LLM
Observation MaskingReplaces outputs with placeholders~100% per outputZeroFree (no compute)
ACON Adaptive ControlTask-aware observation trimming26-54%LowDistillable
Multi-Agent IsolationSeparate context per agentArchitecturalZeroParallel

Method 1: LLM Summarization

The most widely deployed approach. An LLM rewrites conversation history into organized sections: completed work, current state, pending tasks. Claude Code's auto-compact uses this method, triggering when conversations approach context limits. OpenAI Codex runs server-side summarization after every turn.

Factory.ai ran the most rigorous public evaluation: 36,000 real software engineering messages from production coding sessions. Their structured summarizer scored 3.70/5 overall. OpenAI's opaque compression scored 3.35/5. The gap comes from structure: organized sections (what was done, what failed, what's next) help the model pick up where it left off.

3.70/5
Factory.ai structured summary score
70-90%
Typical compression ratio
Full LLM call
Cost per compression

The tradeoff: Summarization achieves the highest compression ratios, but the model rewrites the original text. Code snippets get paraphrased. File paths become "the auth module." Line numbers vanish. Error messages are described rather than quoted. This creates the re-reading loop problem.

Claude Code lets you customize what compaction preserves. Adding /compact Focus on code samples and API usage or instructions in CLAUDE.md directs the summarizer to keep specific categories of content. This partially mitigates signal loss but doesn't eliminate it.

Summarization cost compounds

Every summarization call is itself an LLM inference. If you compress context by calling Claude or GPT-4 to summarize, you pay for that call in tokens and latency. For high-frequency compression (every few turns), the overhead compounds. A single summarization of 100K tokens costs the same as a full model inference.

Method 2: Opaque Compression

OpenAI's Codex uses the /responses/compact endpoint to compress conversation state server-side. The compression logic is internal to the model. You send context in, you get a smaller representation back. You cannot inspect what was kept or dropped.

In Factory.ai's evaluation, this approach scored 3.35/5, below structured summarization's 3.70. The likely reason: without explicit structure (sections for done/failed/next), the compressed output is harder for the model to parse when it resumes work.

The opacity is the fundamental problem. When compression fails (and it will), you have no way to diagnose what was lost. With summarization, you can read the summary. With compaction, you can diff against the original. With opaque compression, the internal representation is a black box.

3.35/5
Factory.ai opaque compression score
Variable
Compression ratio (opaque)
0%
Inspectability

Method 3: Verbatim Compaction

Verbatim compaction takes a fundamentally different approach: deletion, not rewriting. The model identifies which tokens carry signal and which are noise, then removes the noise. Every sentence that survives compression is word-for-word identical to the original.

3,300+
Tokens per second
50-70%
Reduction ratio
98%
Verbatim accuracy
0%
Hallucination risk

Morph Compact implements this approach. It processes context at 3,300+ tokens per second with 50-70% reduction. Because the output is a strict subset of the input, there is zero risk of the compression step introducing errors. If a file path, error message, or code block survives compression, it is character-for-character identical to the original.

The historical problem with deletion-based compression was speed. Scoring every token for relevance is expensive, and early approaches took 8-15 seconds per compaction. Morph built custom inference engines optimized for the compaction workload, bringing latency under 3 seconds. Fast enough to run before every LLM call, not just as an emergency measure.

Summarization vs. verbatim compaction on real agent output

# Original context (from a coding agent session):
Tool output: Read file src/api/webhooks/stripe.ts (247 lines)
  Lines 42-89: handleSubscriptionUpdated()
  Lines 90-134: handlePaymentFailed() — has retry logic bug
  Lines 135-180: handleInvoicePaid()
Error at line 98: TypeError: Cannot read property 'retryCount'
  of undefined. subscription.metadata.retryCount is null when
  the customer has no prior failed payments.
Agent note: Need to add null check at line 98 before accessing
  retryCount. Also update the test at test/webhooks.test.ts:156.

# After SUMMARIZATION:
"Found a bug in the Stripe webhook handler related to retry
 logic. The subscription metadata needs a null check."
→ Lost: exact file path, line numbers, error message, test location

# After VERBATIM COMPACTION:
Tool output: Read file src/api/webhooks/stripe.ts
  Lines 90-134: handlePaymentFailed() — has retry logic bug
Error at line 98: TypeError: Cannot read property 'retryCount'
  of undefined. subscription.metadata.retryCount is null when
  the customer has no prior failed payments.
Agent note: Need to add null check at line 98 before accessing
  retryCount. Also update the test at test/webhooks.test.ts:156.
→ Kept: exact file, line numbers, error, test path. Removed: irrelevant functions.

The summarized version is shorter but lost every detail the agent needs for the next edit. The compacted version is longer but preserves all actionable information. For coding agents, where a wrong file path or missing line number triggers a full re-search, this distinction is the difference between making progress and entering a re-reading loop.

For a deeper comparison between these approaches, see compaction vs. summarization.

Method 4: Token-Level Pruning (LLMlingua)

Microsoft's LLMlingua (EMNLP 2023) compresses prompts at the token level. A smaller language model scores each token by information entropy, then removes low-information tokens. The result reads like telegraphic text: grammatically broken but semantically preserved.

Up to 20x
Maximum compression (LLMlingua)
2-5x
Practical compression (LLMlingua-2)
3-6x
Faster than LLM-based methods
21.4%
Performance boost at 4x fewer tokens

LLMlingua achieves up to 20x compression with minimal performance loss on benchmarks like GSM8K, BBH, and ShareGPT. Its successor, LLMlingua-2 (ACL 2024), reformulates compression as a token classification problem using bidirectional Transformer encoders (XLM-RoBERTa-large). This runs 3-6x faster than the original with 1.6-2.9x end-to-end latency acceleration.

The related LongLLMlingua variant targets long-context scenarios specifically, achieving 21.4% performance improvement with 4x fewer tokens on NaturalQuestions.

The limitation: Token-level pruning operates below the semantic level. It removes function words and low-entropy tokens, but it doesn't understand that src/api/webhooks/stripe.ts:98 is a high-signal reference that should survive intact. A pruned file path might lose its line number. A pruned error message might lose a key identifier. For coding agents, this matters. For general Q&A or summarization tasks, the tradeoff is often worth it.

For more on prompt-level compression approaches, see prompt compression.

Method 5: Observation Masking

The simplest approach: replace tool outputs with placeholders after the model has processed them. The model saw the file contents, made its decision, and now those contents are replaced with [File read: src/auth.ts, 247 lines]. No LLM call. No computation. Free.

JetBrains tested this approach in Junie, their coding agent. They compared observation masking against full LLM summarization on SWE-bench and found that simple masking matched summarization quality at a fraction of the cost. The insight: once the model has acted on information, it doesn't need the raw data in subsequent turns. The placeholder is enough to remind the model that it already processed that content.

$0
Cost per compression
~100%
Reduction per masked output
SWE-bench
Validated benchmark

The limitation: Masking is irreversible. If the model needs to re-reference a file it already read, the content is gone. It has to re-read the file, adding a new tool call and new context. This works well when agents rarely backtrack. It breaks down in iterative debugging where the agent revisits earlier outputs.

Masking also discards information that might be useful in aggregate. A single file read may not be important, but the pattern across 10 file reads might reveal an architectural issue. Masking destroys those cross-reference patterns.

Method 6: Adaptive Context Control (ACON)

The ACON framework from Microsoft Research targets the largest source of context bloat: tool call observations. File reads, grep results, and command outputs typically fill 60-80% of an agent's context window. ACON adaptively controls observation length based on task requirements.

26-54%
Peak token reduction
95%+
Task accuracy preserved
46%
Performance gain for smaller LMs
3
Benchmarks tested (AppWorld, OfficeBench, Multi-obj QA)

ACON works by analyzing paired trajectories: cases where full context succeeds but compressed context fails. An LLM examines why the compressed version failed and iteratively improves compression guidelines. Short observations pass through unchanged. Long ones get compressed using the learned guidelines.

The key insight: ACON treats compression as an optimization problem, not a fixed algorithm. It learns which types of observations can be safely compressed and which must be preserved. This makes it adaptive across tasks, not just a static compression threshold.

ACON also distills optimized compressors into smaller models, preserving over 95% of accuracy with lower compute overhead. This enables smaller language models to handle long-horizon agent tasks, achieving up to 46% performance improvement.

Method 7: Multi-Agent Isolation

The most architecturally distinct approach: don't compress at all. Instead, decompose the task so each sub-task runs in its own context window. A lead agent delegates search, analysis, and implementation to subagents. Each subagent processes only the context relevant to its task and returns a condensed summary to the lead.

Anthropic's multi-agent research system demonstrated this directly. An Opus 4 lead agent delegating to Sonnet 4 subagents outperformed a single Opus 4 agent by 90.2% on research tasks. Each subagent operates in its own context window, receives only task-relevant information, and returns a condensed summary.

90.2%
Multi-agent improvement (Anthropic)
0%
Hallucination from compression
3-4x
Effective context expansion

ContextEvolve formalizes multi-agent compression with three specialized agents: a Summarizer that condenses semantic state, a Navigator that extracts optimization direction, and a Sampler that manages experience distribution. On the ADRS benchmark, this achieves 33.3% performance improvement over baselines with 29% token reduction.

WarpGrep is a specialized search subagent trained with reinforcement learning. It runs in its own context window, processes search queries independently, and returns only the relevant code snippets (not entire files) to the parent agent. 0.73 F1 in 3.8 steps vs grep's 0.19 F1 in 12 steps. The parent agent never sees the search noise.

The tradeoff: Multi-agent isolation requires more total compute (N agents, each with their own context window). But the total tokens processed can be lower because each agent's context stays clean. Claude Code now supports agent teams with this pattern, though Anthropic notes teams use roughly 7x more tokens than standard sessions.

For more on this approach, see context distillation and LLM context management.

Head-to-Head Benchmarks

Factory.ai: 36,000 Real Engineering Messages

Factory.ai ran the largest public evaluation of context compression. They tested compression methods on 36,000 real software engineering messages from production coding sessions. Not synthetic benchmarks. Not toy examples. Real multi-turn conversations with file reads, code edits, and debugging sessions.

MethodOverall ScoreInformation RetentionActionability
Structured Summary (Factory)3.70/5HighHigh
Opaque Compression (OpenAI)3.35/5MediumMedium
Verbatim Compaction (Morph)N/A (different axis)Highest (verbatim)Highest (exact code)

Verbatim compaction doesn't appear in Factory's evaluation because it optimizes for a different metric: fidelity, not summarization quality. The methods are complementary, not competing. Summarization tells the model what happened. Compaction preserves exactly what was there.

ACON: Across Three Agent Benchmarks

ACON was tested on AppWorld, OfficeBench, and Multi-objective QA. Across all three, it achieved 26-54% peak token reduction while preserving 95%+ task accuracy. The distilled versions (smaller models trained on ACON's compression guidelines) enabled up to 46% performance improvement for smaller language models.

JetBrains: SWE-bench Masking vs. Summarization

JetBrains tested Junie with observation masking against full LLM summarization on SWE-bench. The result: masking matched summarization quality while costing nothing. No LLM call, no latency, no token spend. The takeaway: for many agent tasks, a simple placeholder after processing is as effective as an expensive summarization step.

LLMlingua: Academic Benchmarks

VariantCompressionSpeedKey Benchmark
LLMlingua (2023)Up to 20xBaselineGSM8K, BBH, ShareGPT
LLMlingua-2 (2024)2-5x3-6x fasterMeetingBank, LongBench, ZeroScrolls
LongLLMlingua4x1.4-2.6x latency reductionNaturalQuestions (+21.4%)

Selective Context: Self-Information Filtering

The Selective Context approach uses self-information to filter less informative content, demonstrating effectiveness across summarization and question answering on academic papers, news articles, and conversation transcripts. A simpler baseline that validates the core principle: not all tokens matter equally.

The Re-Reading Loop Problem

The re-reading loop is the most expensive failure mode of summarization-based compression, and it explains why higher compression ratios don't always translate to better performance.

The loop works like this:

  1. The agent searches for code and finds relevant results
  2. Context fills up. Summarization compresses the search results
  3. The summary paraphrases file paths and line numbers
  4. The agent needs the exact file path for its next edit
  5. It re-searches, because the summary lost the precise reference
  6. New search results refill context. Summarization triggers again
  7. The cycle repeats

This loop burns tokens at roughly 2x the rate of a clean session. The agent does twice the work because compression discarded the details it needed. Cognition's measurement that agents spend 60% of their time searching is partially explained by this: some of that searching is re-searching for information that was summarized away.

Verbatim compaction breaks the loop because surviving content is exact. If a file path survives compaction, the agent doesn't need to re-search. If a line number is in the compacted output, it's the correct line number.

The re-reading loop in practice

# Turn 1: Agent reads file
> Read src/api/webhooks/stripe.ts → 247 lines added to context

# Turn 3: Agent reads another file
> Read src/lib/stripe-helpers.ts → 189 lines added to context

# Turn 5: Context at 80%. Summarization fires.
> Summary: "Reviewed Stripe webhook handler and helper functions.
>          Found retry logic issues in the payment flow."
> Context: 80% → 25%

# Turn 6: Agent needs to edit the bug
> Agent: "Where was that retry bug? I need the file and line number."
> Agent re-reads src/api/webhooks/stripe.ts → 247 lines AGAIN
> Context: 25% → 45%

# With VERBATIM COMPACTION instead:
> Compacted output preserves: "src/api/webhooks/stripe.ts:98"
> Agent proceeds to edit. No re-reading needed.
> Context: 80% → 35%, and stays there.

Prevention vs. Compression: The FlashCompact Approach

Every method above assumes context waste has already happened and needs to be cleaned up. Morph FlashCompact challenges that assumption. The most effective compression is the compression you never need to run.

Three sources account for most context waste in coding agents:

Search Waste

Grep returns 500 lines to find a 10-line function. The agent processes all 500 lines, but only 10 matter. WarpGrep returns only the relevant snippets: 0.73 F1 in 3.8 steps vs grep's 0.19 F1 in 12 steps.

Edit Waste

Full file rewrites echo the entire file back into context to change 3 lines. Fast Apply uses compact diffs at 10,500 tok/s with 98% accuracy. Only the changed lines enter context.

Residual Noise

Even after prevention, some noise accumulates from tool outputs, error traces, and exploration dead-ends. Morph Compact cleans up what remains: 3,300+ tok/s, 50-70% reduction, zero hallucination risk.

The combined effect: 3-4x longer context life. Auto-compact fires 3-4x less often. The agent spends more time reasoning and less time compressing and re-searching.

Morph achieves state-of-the-art results on SWE-Bench Pro with this prevention-first approach. WarpGrep lifts every frontier model to first place while being 15.6% cheaper and 28% faster than baseline configurations.

DimensionCompression (After)Prevention (FlashCompact)
When it runsAfter context fills upBefore waste enters context
Re-reading loopsCauses them (lost details)Prevents them (precise results)
Compaction frequencyEvery N turns3-4x less often
Search accuracyGrep: 0.19 F1 in 12 stepsWarpGrep: 0.73 F1 in 3.8 steps
Edit costFull file rewrite into contextCompact diff, 10,500 tok/s
Residual cleanupNot addressedMorph Compact at 3,300+ tok/s

When to Compress: Timing Strategies

Even with prevention, compression remains necessary for long sessions. Two timing strategies govern when it runs.

Threshold-Based (Claude Code, Codex)

Most production agents use threshold-based compression. Claude Code triggers compaction when the conversation approaches context limits. The model summarizes history, preserves the current state, and continues. You can customize this with /compact and CLAUDE.md instructions.

The downside: context quality degrades before the threshold triggers. An agent at 60% capacity with 40% noise tokens is already performing worse than the same agent with clean context.

Inline / Continuous (Recommended)

Inline compression runs continuously. Tool outputs get compacted as they arrive. Observation data from file reads and grep results is reduced before it enters the main conversation context. The agent never accumulates noise in the first place.

OpenAI's Codex team explicitly recommends compaction as a default long-run primitive, not an emergency fallback. This aligns with inline compression: run it continuously as part of the normal agent loop.

Inline compression with Morph Compact

// Threshold-based: compress when you're almost out of space
// Problem: context already degraded by the time you compress
agent.on('contextLimit', () => {
  const compressed = await compact(conversation);
  agent.restart(compressed);
});

// Inline: compress tool outputs before they enter context
// The agent's context stays clean throughout the session
agent.on('toolResult', async (result) => {
  if (result.tokens > 500) {
    result.content = await morph.compact(result.content);
  }
  agent.addToContext(result);
});
StrategyWhen It TriggersProsCons
Threshold-basedAt 70-95% of window limitSimple to implementNoise degrades quality before trigger
Inline / continuousOn every tool outputContext stays clean throughoutRequires fast compression (>3K tok/s)
HybridInline for tools + threshold for historyBest of bothMore complex to implement

Implementation Guide

Morph Compact is available through the standard OpenAI SDK. Point the base URL at Morph's API and use the morph-compact model.

Basic usage with OpenAI SDK (Python)

from openai import OpenAI

client = OpenAI(
    api_key="your-morph-api-key",
    base_url="https://api.morphllm.com/v1"
)

# Compact a long context
response = client.chat.completions.create(
    model="morph-compact",
    messages=[
        {
            "role": "user",
            "content": long_context_string  # Your agent's accumulated context
        }
    ]
)

compacted = response.choices[0].message.content
# Every surviving sentence is verbatim from the original
# 50-70% smaller, zero hallucination risk

Inline compression in an agent loop (TypeScript)

import OpenAI from "openai";

const morph = new OpenAI({
  apiKey: process.env.MORPH_API_KEY,
  baseURL: "https://api.morphllm.com/v1",
});

async function compactIfNeeded(content: string): Promise<string> {
  const tokens = estimateTokens(content);
  if (tokens < 500) return content;

  const response = await morph.chat.completions.create({
    model: "morph-compact",
    messages: [{ role: "user", content }],
  });

  return response.choices[0].message.content ?? content;
}

// In your agent loop:
for (const toolCall of pendingToolCalls) {
  const result = await executeTool(toolCall);
  const compacted = await compactIfNeeded(result.output);
  conversation.addToolResult(toolCall.id, compacted);
  // Agent never sees the noise — only high-signal tokens
}

Streaming for large context compression

# Streaming for large context compression
response = client.chat.completions.create(
    model="morph-compact",
    messages=[{"role": "user", "content": large_context}],
    stream=True
)

compacted_chunks = []
for chunk in response:
    if chunk.choices[0].delta.content:
        compacted_chunks.append(chunk.choices[0].delta.content)

compacted = "".join(compacted_chunks)

Frequently Asked Questions

What is context compression for LLMs?

Context compression reduces the number of tokens an LLM processes while preserving the information needed for correct output. Seven main approaches exist: LLM summarization (rewrites into organized sections), opaque compression (model-internal reduction), verbatim compaction (deletes noise while keeping surviving text identical), token-level pruning (LLMlingua), observation masking (replaces processed outputs with placeholders), adaptive context control (ACON), and multi-agent isolation (separate context windows per agent).

What is the difference between summarization and compaction?

Summarization rewrites context into shorter form. In Factory.ai's evaluation of 36,000 real engineering messages, structured summarization scored 3.70/5. Compaction deletes low-signal tokens while keeping every surviving sentence word-for-word identical. Summarization achieves higher compression (70-90%) but introduces hallucination risk. Compaction achieves 50-70% reduction with zero hallucination risk. For coding agents, where exact file paths, line numbers, and error messages must survive intact, compaction preserves what matters.

How much can context compression reduce token usage?

ACON achieves 26-54% peak token reduction while preserving 95%+ task accuracy. Morph Compact achieves 50-70% reduction at 3,300+ tok/s. LLMlingua can compress up to 20x, though practical ratios are 2-5x. JetBrains found observation masking matches summarization quality at zero compute cost. Prevention-first approaches (FlashCompact) extend effective context life by 3-4x.

Does context compression cause hallucinations?

Summarization-based compression can introduce hallucinations because it rewrites context in the model's own words. File paths, line numbers, and code snippets can be altered. Opaque compression is a black box with unknown fidelity. Verbatim compaction (Morph Compact) produces zero hallucinations: every surviving token is identical to the original input. Observation masking also avoids hallucination since it removes content rather than rewriting it.

What is the ACON framework?

ACON (Adaptive Context Optimization for Agents) targets tool call observations, which constitute 60-80% of context. It learns compression guidelines by analyzing why compressed versions fail, achieving 26-54% token reduction with 95%+ accuracy on AppWorld, OfficeBench, and Multi-objective QA. It also distills learned compressors into smaller models for efficient deployment.

How does Claude Code auto-compact work?

Claude Code runs automatic summarization when conversations approach context limits. It summarizes history into structured sections (completed work, current state, pending tasks). You can customize what it preserves with /compact Focus on code samples and API usage or add instructions in your CLAUDE.md file. Claude Code also supports subagent isolation for delegating verbose operations to separate context windows.

What is LLMlingua and how does it compress prompts?

LLMlingua (Microsoft, EMNLP 2023) compresses prompts at the token level using information entropy, achieving up to 20x compression. LLMlingua-2 (ACL 2024) uses bidirectional Transformer encoders (XLM-RoBERTa) for token classification, running 3-6x faster with 2-5x compression. Both are task-agnostic, work with any LLM, and have publicly available code.

What is the best context compression method for coding agents?

Prevention beats compression. Morph FlashCompact combines WarpGrep (semantic search returning only relevant snippets, 0.73 F1 in 3.8 steps), Fast Apply (compact diffs at 10,500 tok/s), and Morph Compact (verbatim deletion at 3,300+ tok/s). This extends context life by 3-4x. When compression is needed, verbatim compaction preserves exact code, file paths, and error messages without hallucination risk. Morph achieves state-of-the-art results on SWE-Bench Pro with this approach.

Related Resources

Extend Context Life by 3-4x

Morph FlashCompact prevents context waste at the source. WarpGrep returns only relevant code (0.73 F1). Fast Apply uses compact diffs (10,500 tok/s). Morph Compact cleans up the rest (3,300+ tok/s, zero hallucination). State-of-the-art on SWE-Bench Pro.