LLM costs scale linearly with input tokens. A 50% reduction in input tokens is a 50% reduction in input cost. But cost is the boring reason to compress prompts. The interesting reason: LLMs produce worse output as inputs get longer, even when the context window isn't full. Compression removes noise that causes models to miss signal. Fewer tokens, lower cost, better results.
What Is Prompt Compression
Prompt compression encompasses any technique that reduces the token count of an LLM input while retaining the semantic content needed for correct output. The term covers two distinct problems:
- Input compression: reducing a prompt before sending it to the model. Removing boilerplate from retrieved documents, pruning low-information tokens, or extracting only task-relevant sentences.
- Context compression: reducing accumulated context during long-running sessions. As coding agents read files, search codebases, and debug errors, their context windows fill with content that was useful earlier but is no longer relevant. Claude Code auto-compacts at 95% capacity. Codex runs server-side compaction after every turn. Cursor truncates old history.
Both forms solve the same underlying constraint: LLMs charge per token, and their performance degrades as input length increases. Sending fewer, higher-signal tokens costs less and produces better results.
Prompt compression vs. prompt engineering
Prompt engineering optimizes how you phrase a request. Prompt compression optimizes how much context accompanies that request. They're complementary. A well-engineered prompt with 50K tokens of noisy context will underperform a mediocre prompt with 10K tokens of relevant context.
Why Compression Improves Output Quality (Not Just Cost)
The standard framing is "compress to save money." The more important framing: compress to get better output.
Liu et al. (2023) demonstrated the "lost in the middle" effect: LLMs access information well at the beginning and end of a prompt but degrade significantly for content positioned in the middle. On multi-document QA and key-value retrieval, performance follows a U-shaped curve. The model doesn't just miss middle content by a little. It misses it substantially, even when the context window is far from full.
This has a direct implication for compression. If you have 100K tokens of context and only 20K tokens are relevant to the current task, sending all 100K doesn't just cost 5x more. It degrades accuracy because the model must reason through 80K tokens of noise. Relevant content in positions 30K-70K is particularly at risk of being ignored.
LongLLMLingua directly exploits this: it reorders documents to place high-relevance content at the beginning and end of the prompt, then compresses the rest. On NaturalQuestions, this combination boosts performance by 21.4% while using 4x fewer tokens. The compression itself improves accuracy.
For coding agents, context rot compounds the problem. As agents accumulate tool outputs, file contents, and error messages across dozens of turns, the signal-to-noise ratio degrades. Early investigation steps become irrelevant once the bug is found. File contents from abandoned approaches clutter the context. Compression isn't cleanup. It's a performance optimization.
The Code vs. Prose Problem
Most prompt compression research evaluates on natural language: QA, summarization, reasoning benchmarks. Code is a fundamentally different compression target, and this asymmetry explains why general-purpose methods underperform on agent workloads.
| Property | Natural Language | Code |
|---|---|---|
| Token removal tolerance | High (redundant grammar) | Low (syntax-breaking) |
| Information density | Variable (boilerplate common) | High (every token meaningful) |
| Structural integrity | Flexible (reorderable) | Rigid (order-dependent) |
| Error tolerance | Graceful degradation | Binary (works or doesn't) |
| Critical details | Distributable | Exact (paths, line numbers, types) |
Consider a perplexity-based pruner processing this code context:
What perplexity-based pruning does to code
Original:
File: src/middleware/auth.ts:47
TypeError: Cannot read property 'jwt' of undefined
After token-level pruning (high-perplexity tokens kept):
File: src/middleware/:47
TypeError: Cannot read property '' of undefined
The pruner removed "auth.ts" (predictable given the path pattern) and "jwt"
(predictable given the error type). Both are the exact details the agent
needs to fix the bug.The token "auth.ts" has low perplexity (predictable from the path pattern "src/middleware/"). The token "jwt" has low perplexity (predictable given "Cannot read property" in an auth context). A perplexity scorer correctly identifies these as low-information tokens. But for the downstream task of fixing the bug, these are the only tokens that matter.
This is why Morph FlashCompact uses verbatim compaction for code: it operates on semantic units (entire statements, blocks, files) rather than individual tokens. A file path is either present exactly as it appeared or absent entirely. No corrupted paths. No approximate error messages.
Eight Prompt Compression Techniques
Each technique makes different tradeoffs between compression ratio, speed, accuracy preservation, and hallucination risk. No single method is best for all use cases.
1. LLMLingua: Perplexity-Based Token Pruning
LLMLingua (Microsoft Research, EMNLP 2023, 5.9K GitHub stars) uses a small language model to score each token by information content and removes those with the lowest perplexity scores. Three components work together:
- Budget controller: allocates compression capacity across different prompt segments (instructions, demonstrations, questions), preserving more tokens in high-sensitivity regions.
- Token-level iterative compression: models interdependencies between compressed segments. Tokens that seem redundant in isolation may be critical when surrounding context has already been removed.
- Distribution alignment: instruction-tunes the small scoring model (GPT-2 or LLaMA-7B) to better match the target LLM's token distribution, improving compression decisions.
On reasoning benchmarks, the results are strong: GPT-4 recovered all 9 steps from compressed chain-of-thought prompts, producing answers nearly identical to those from uncompressed prompts. Cross-model transfer also works. GPT-2-small scored 76.27 on GSM8K (baseline 74.9), and Claude v1.3 scored 82.61 (baseline 81.8).
LLMLingua limitation: structured content
Token-level pruning can break JSON, code blocks, file paths, and any content where individual tokens carry structural meaning. The approach works best on natural language paragraphs, few-shot examples, and reasoning chains. For agent workloads that process code, it requires careful domain segmentation.
2. LLMLingua-2: Token Classification with BERT
LLMLingua-2 (ACL 2024) reformulated compression from a perplexity calculation to a token classification problem. Two architectural changes:
- Bidirectional context: uses a Transformer encoder (XLM-RoBERTa-large or mBERT) instead of a unidirectional LM, capturing context from both directions when deciding which tokens to keep.
- Data distillation: trains the classifier on compression decisions distilled from GPT-4, learning directly what tokens matter rather than inferring it from perplexity.
The result: 3-6x faster than LLMLingua-1 with comparable compression quality, and 1.6-2.9x end-to-end latency reduction at 2-5x compression ratios. Tested on MeetingBank, LongBench, ZeroScrolls, GSM8K, and BBH with robust generalization across different target LLMs.
3. Selective Context: Sentence-Level Information Filtering
Selective Context (ACL 2023) computes self-information scores for each sentence and removes those below a threshold. Coarser than token-level methods, but it preserves sentence boundaries, so it can't break structured content the way token-level pruning can.
Evaluated on arXiv papers (summarization), news articles (QA), and conversation transcripts (response generation). The compression-quality tradeoff is gentle: 50% cost reduction with near-zero quality degradation. The main limitation is lower maximum compression ratios since you can only remove or keep entire sentences.
4. RECOMP: Trained Compressors for RAG
RECOMP trains dedicated compressors specifically for retrieval-augmented language models. Two compressor types serve different needs:
- Extractive compressor: selects task-relevant sentences from retrieved documents. No rewriting, so content integrity is preserved.
- Abstractive compressor: generates summaries synthesized across multiple documents. Higher compression but introduces rewriting risk.
A distinctive feature: compressors can return an empty string when retrieved documents are irrelevant. This selective augmentation prevents the model from being distracted by unhelpful context. At compression rates as low as 6% of original document length, RECOMP achieves minimal performance loss on language modeling and open-domain QA, significantly outperforming off-the-shelf summarization models.
5. 500xCompressor: Extreme Compression into Special Tokens
500xCompressor pushes to the theoretical extreme: compressing entire contexts into as few as one special token. It adds approximately 0.3% additional parameters to existing LLMs without requiring base model fine-tuning.
Compression ratios range from 6x to 480x, with models retaining 62-73% of their capabilities. A notable finding: KV cache values significantly outperform embeddings for preserving information at high compression ratios, suggesting that the key-value representation captures more recoverable structure than dense embeddings.
The tradeoff is clear: at 480x compression, you retain only ~63% of capability. This is viable for background context or reference material where approximate understanding suffices, not for agent workloads where exact details matter.
6. AutoCompressors: Learned Summary Vectors
AutoCompressors (EMNLP 2023) fine-tune pretrained models (OPT, Llama-2) to compress long contexts into compact summary vectors that function as soft prompts. Trained on sequences up to 30,720 tokens using an unsupervised objective, the summary vectors substitute for plain-text demonstrations.
The approach improves perplexity on long contexts and demonstrates benefits across in-context learning, retrieval-augmented language modeling, and passage re-ranking. Unlike discrete compression (which produces readable text), AutoCompressors produce continuous vectors, making them useful as a component in larger systems but not directly inspectable.
7. Context Caching (Anthropic, Google)
Context caching isn't compression per se, but it reduces the cost of repeated prefixes. Anthropic's prompt caching charges 0.1x the base input price on cache reads (90% reduction). Cache writes cost 1.25x (5-minute TTL) or 2x (1-hour TTL).
| Model | Base Input | Cache Write (5m) | Cache Read | Savings |
|---|---|---|---|---|
| Claude Opus 4 | $15.00 | $18.75 | $1.50 | 90% |
| Claude Sonnet 4 | $3.00 | $3.75 | $0.30 | 90% |
| Claude Haiku 3.5 | $0.80 | $1.00 | $0.08 | 90% |
Minimum cacheable lengths: 1,024-4,096 tokens depending on model. Up to 4 cache breakpoints per request. Effective for system prompts, reference documentation, and few-shot examples. Does not help with dynamic context that changes every request, which is most of what agents deal with.
8. Verbatim Compaction (Morph Compact)
Morph Compact takes a fundamentally different approach: delete low-signal content while keeping every surviving token identical to the input. Nothing is rewritten. Nothing is paraphrased. The output is a strict subset of the input tokens.
This eliminates the hallucination risk inherent in summarization and the syntax corruption risk of token-level pruning. When an agent needs a file path, error code, or code snippet, verbatim compaction guarantees it's either present exactly as it appeared in the original or absent entirely.
| Technique | Compression | Speed | Accuracy Risk | Best For |
|---|---|---|---|---|
| LLMLingua | 2-20x | 1.7-5.7x speedup | Breaks code/JSON | Reasoning, NL docs |
| LLMLingua-2 | 2-5x | 3-6x faster than v1 | Same structural risk | Speed-sensitive NL |
| Selective Context | ~2x | Fast (sentence-level) | Low (boundary-preserving) | Boilerplate removal |
| RECOMP | Up to 17x | Training required | Extractive: low; Abstractive: medium | RAG pipelines |
| 500xCompressor | 6-480x | Fast (small params) | 62-73% capability retained | Background reference |
| AutoCompressors | Variable | Inference cost | Non-inspectable output | Soft prompt substitution |
| Context caching | N/A (cost only) | Instant | None (no modification) | Repeated prefixes |
| Morph Compact | 50-70% | 3,300+ tok/s | Zero (verbatim output) | Agent context, code |
Benchmarks: Compression vs. Task Performance
The critical question isn't "how much can you compress?" It's "how much can you compress before task performance degrades?" Each technique has a different compression-performance curve.
LLMLingua on Reasoning Tasks
LLMLingua's strongest results are on reasoning benchmarks where the prompt contains chain-of-thought examples. At 20x compression on GSM8K and BBH, performance drops only 1.5 points. GPT-4 can recover all 9 reasoning steps from compressed CoT prompts. The key insight: reasoning chains contain massive redundancy in natural language connectives and transitions. The mathematical content (numbers, operations) carries high perplexity and is preserved.
LongLLMLingua on Long-Context QA
| Benchmark | Compression | Performance Change | Key Finding |
|---|---|---|---|
| NaturalQuestions | 4x fewer tokens | +21.4% | Compression + reordering improves accuracy |
| LooGLE | 94% cost reduction | Maintained | Extreme cost savings on long docs |
| End-to-end latency | 2-6x compression | 1.4-2.6x faster | Latency scales sub-linearly |
Factory Context Compaction (36K Real Coding Messages)
The most practically relevant benchmark comes from Factory, which tested three compression approaches on 36,000 messages from real Claude Code coding sessions. This is the only large-scale evaluation on actual agent workloads rather than academic benchmarks.
| Method | Overall Score | Compression | Key Weakness |
|---|---|---|---|
| Factory structured summaries | 3.70/5 | 98.6% | Custom implementation, not public |
| Anthropic summaries | 3.44/5 | 98.7% | Loses file paths/error specifics |
| OpenAI opaque | 3.35/5 | 99.3% | Lowest accuracy on exact details |
The critical finding: all three methods achieved 98%+ compression ratios. The differentiator wasn't compression ratio. It was accuracy on specific details. File paths, line numbers, error messages, stack traces. These are exactly the tokens coding agents need to function, and summarization-based approaches systematically degraded them.
Compression ratio vs. accuracy
A 99% compression ratio that loses a critical file path is worse than a 60% compression ratio that preserves it exactly. For coding agents, accuracy-per-surviving-token matters more than raw compression ratio. This is the core argument for verbatim compaction: lower compression ratio, but guaranteed token-level accuracy on everything that survives.
Selective Context on Three Domains
Selective Context's results demonstrate the gentlest compression-quality curve: 50% context reduction with only 0.023 BERTscore drop and 0.038 faithfulness drop across arXiv summarization, news QA, and conversation response generation. The sentence-level granularity means it can't achieve the extreme ratios of token-level methods, but it also can't corrupt structured content.
Cost Analysis at Scale
Input tokens dominate agent costs because agents consume far more context than they produce. A coding agent might read 50 files, execute 20 searches, and process 30 error messages in a single session. Most of that input is consumed once and never referenced again.
| Model | Input $/M | No Compress | 30% Compress | 50% Compress | 70% Compress |
|---|---|---|---|---|---|
| Claude Opus 4 | $15.00 | $22,500 | $15,750 | $11,250 | $6,750 |
| Claude Sonnet 4 | $3.00 | $4,500 | $3,150 | $2,250 | $1,350 |
| GPT-4.1 | $2.00 | $3,000 | $2,100 | $1,500 | $900 |
| Gemini 2.5 Pro | $1.25 | $1,875 | $1,313 | $938 | $563 |
These numbers assume 50% compression on a single model. The savings amplify in three ways:
- Fewer retries: compressed context means fewer hallucinations and wrong turns, reducing the total number of turns per task.
- Faster completion: less input means lower latency per request, which compounds across multi-turn sessions.
- Extended session life: agents can work longer before hitting context limits, completing more complex tasks without starting over.
Why Prevention Beats Compression
Every compression method discussed above operates after context has already been consumed. The deeper insight: the best compression is the compression you never need to run.
Cognition (Devin) measured that agents spend 60% of their time searching for code. Each search dumps results into context. Each file read adds the entire file. Each code edit echoes the full file back. The context fills not because agents need all that information, but because their tools are blunt instruments that return far more than necessary.
WarpGrep: Targeted Retrieval
Returns only relevant code snippets instead of entire files. 0.73 F1 in 3.8 steps. An agent searching for a function definition gets the 10-line function, not the 500-line file it lives in. Context consumption drops by 90%+ per search operation.
Fast Apply: Compact Diffs
Applies code changes as compact diffs at 10,500 tok/s instead of echoing the entire modified file back into context. A 3-line change to a 200-line file consumes ~10 tokens of context instead of ~1,000.
Morph Compact: Cleanup
Verbatim compaction for whatever noise remains. 50-70% compression at 3,300+ tok/s with zero hallucination risk. Operates on semantic units, not individual tokens, so code structure is preserved.
The combination extends effective context life by 3-4x. An agent that would hit compaction at turn 15 now hits it at turn 45-60. This means compaction fires 3-4x less often, and when it does fire, the context is cleaner because less noise accumulated in the first place.
FlashCompact: the three-layer stack
Morph FlashCompact combines all three layers: WarpGrep for targeted retrieval (prevent waste), Fast Apply for compact diffs (prevent echo), and Morph Compact for verbatim compaction (clean up the rest). State-of-the-art on SWE-Bench Pro.
Implementation Guide
Morph Compact exposes an OpenAI-compatible API. Integration requires minimal code changes regardless of your framework.
Basic prompt compression with Morph Compact
from openai import OpenAI
client = OpenAI(
base_url="https://api.morphllm.com/v1",
api_key="your-morph-api-key"
)
# Compress a long context before sending to your main model
long_context = open("conversation_history.txt").read()
response = client.chat.completions.create(
model="morph-compact",
messages=[
{"role": "user", "content": long_context}
]
)
compressed = response.choices[0].message.content
# compressed is a strict subset of the original tokens
# no rewriting, no hallucination — just the high-signal content
# Now use compressed context with your main model
main_response = openai.chat.completions.create(
model="claude-sonnet-4-20250514",
messages=[
{"role": "system", "content": compressed},
{"role": "user", "content": "Fix the auth bug described above"}
]
)Agent context compression pipeline
# Compress accumulated agent context before each reasoning step
def compress_context(messages: list[dict], threshold: int = 5) -> list[dict]:
"""Compress old messages, keep recent ones intact.
The key insight: compress early, before context fills up.
Don't wait for 95% capacity like Claude Code's default.
Compress at 60-70% to maintain higher signal density.
"""
if len(messages) <= threshold:
return messages # nothing to compress yet
# Compress older messages, keep last N untouched
old_messages = messages[:-threshold]
recent_messages = messages[-threshold:]
old_text = "\n".join(m["content"] for m in old_messages if m.get("content"))
response = client.chat.completions.create(
model="morph-compact",
messages=[{"role": "user", "content": old_text}]
)
compressed_msg = {
"role": "user",
"content": f"[Compressed context]\n{response.choices[0].message.content}"
}
return [compressed_msg] + recent_messagesLLMLingua integration (for natural language contexts)
# pip install llmlingua
from llmlingua import PromptCompressor
# Initialize with a small scoring model
compressor = PromptCompressor(
model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",
use_llmlingua2=True, # Use the faster BERT-based version
)
# Compress a prompt
compressed = compressor.compress_prompt(
original_prompt,
rate=0.5, # Target 50% compression
force_tokens=["def ", "class ", "import ", "return "], # Preserve code markers
drop_consecutive=True,
)
print("Original:", compressed['origin_tokens'], "tokens")
print("Compressed:", compressed['compressed_tokens'], "tokens")
print("Ratio:", compressed['ratio'])
# Note: LLMLingua works well for natural language contexts.
# For code-heavy agent contexts, use Morph Compact instead
# to avoid syntax corruption from token-level pruning.Prompt Compression for RAG Pipelines
RAG is itself a form of prompt compression: instead of sending full documents, you retrieve and send only relevant chunks. But retrieved chunks still contain noise. Combining RAG with a compression layer gives you two levels of filtering.
RAG + Morph Compact pipeline
from openai import OpenAI
from langchain_core.documents import Document
morph = OpenAI(
base_url="https://api.morphllm.com/v1",
api_key="your-morph-api-key"
)
def compact_documents(docs: list[Document]) -> list[Document]:
"""Compress retrieved documents with verbatim compaction.
Two-stage compression:
1. RAG retriever selects relevant chunks (coarse filter)
2. Morph Compact removes noise within chunks (fine filter)
Every token in the output existed in the original document.
"""
compressed = []
for doc in docs:
response = morph.chat.completions.create(
model="morph-compact",
messages=[{"role": "user", "content": doc.page_content}]
)
compressed.append(Document(
page_content=response.choices[0].message.content,
metadata=doc.metadata
))
return compressed
# Integration with LangChain:
# 1. Retrieve documents with your existing retriever
# 2. Compact them before sending to the reasoning model
# 3. Every token in the output existed in the original — zero hallucinationThe combination addresses RAG's two weaknesses. First, retrieved chunks often contain irrelevant paragraphs alongside relevant ones. Compression removes the noise within each chunk. Second, when multiple chunks are retrieved, there's often redundancy between them. Compacting the combined context eliminates the overlap.
RECOMP vs. post-retrieval compression
RECOMP trains compressors end-to-end for retrieval tasks, learning to compress documents specifically for downstream QA or language modeling. This can outperform generic post-retrieval compression because the compressor learns what matters for the task. The tradeoff: RECOMP requires training data and a fixed task format. For general-purpose agent workloads where tasks vary, post-retrieval verbatim compaction is more flexible.
Frequently Asked Questions
What is prompt compression?
Prompt compression reduces the number of tokens in an LLM prompt while preserving meaning and task accuracy. Techniques range from token-level pruning (LLMLingua scores tokens by perplexity and prunes low-information ones, achieving up to 20x compression), to sentence-level filtering (Selective Context), to retrieval-based compression (RECOMP), to verbatim compaction (Morph Compact keeps every surviving token identical to the original).
How does LLMLingua prompt compression work?
LLMLingua (Microsoft Research, EMNLP 2023) uses a small language model (GPT-2 or LLaMA-7B) to compute perplexity for each token. Tokens with low perplexity (highly predictable from surrounding context) are removed. A budget controller allocates compression across prompt segments, and distribution alignment tunes the scorer to match the target LLM. LLMLingua-2 (ACL 2024) reformulated this as BERT-based token classification, running 3-6x faster.
Does prompt compression affect LLM output quality?
Often compression improves output quality. Lost-in-the-middle research shows LLMs degrade on content positioned in the middle of long inputs. Removing low-signal tokens reduces noise, helping the model focus on relevant information. LongLLMLingua demonstrated a 21.4% accuracy improvement on NaturalQuestions with 4x fewer tokens. The risk depends on technique: token-level pruning can break code syntax; summarization can hallucinate; verbatim compaction preserves exact tokens with zero rewriting risk.
What is the difference between prompt compression and summarization?
Summarization rewrites content, risking hallucinated details or lost specifics. Prompt compression techniques like verbatim compaction remove content without rewriting. Every surviving token is identical to the original. Factory's benchmarks on 36K real coding messages found that accuracy on specific details (file paths, line numbers, error messages) was the biggest differentiator, not raw compression ratio. See our compaction vs. summarization deep dive.
What is the best prompt compression technique for code?
Code is structurally fragile. Token-level pruning (LLMLingua) can remove tokens that break file paths, JSON syntax, or function signatures. Summarization loses exact paths and line numbers. Verbatim compaction operates on semantic units (entire statements, blocks), so a file path is either present exactly or absent entirely. For coding agents, the best strategy is prevention: WarpGrep returns only relevant code snippets instead of entire files, reducing what needs compression.
How do I use prompt compression with LangChain?
LangChain provides ContextualCompressionRetriever that wraps any retriever with a compression layer. You can integrate LLMLingua via pip (pip install llmlingua), use LLM-based extractors, or call the Morph Compact API as a post-retrieval compression step. For code-heavy workloads, prefer Morph Compact over LLMLingua to avoid syntax corruption.
How much money does prompt compression save?
Savings scale linearly. At 50% compression on Claude Opus ($15/M input tokens), you save $7.50 per million tokens. An agent team running 100 sessions/day at 500K tokens each saves $11,250/month ($135K/year). Even on cheaper models like GPT-4.1 ($2/M), the same workload saves $1,500/month. Savings compound because compressed context also reduces retries and corrections.
What is verbatim compaction?
Verbatim compaction deletes low-signal content while keeping every surviving token identical to the input. The output is a strict subset of the original tokens. Unlike summarization (which rewrites and can hallucinate), unlike token-level pruning (which can break structured content), verbatim compaction guarantees that file paths, error codes, and code snippets are either preserved exactly or removed entirely. Morph Compact achieves 50-70% compression at 3,300+ tok/s with this approach.
Related Resources
Compress Prompts Without Losing Accuracy
Morph FlashCompact combines targeted retrieval (WarpGrep), compact diffs (Fast Apply), and verbatim compaction (Morph Compact). Extend context life by 3-4x. Zero hallucination risk. State-of-the-art on SWE-Bench Pro.