Prompt Compression: The Complete Technical Guide to Reducing LLM Costs Without Losing Performance

Prompt compression reduces LLM input tokens while preserving task accuracy. This guide covers 8 techniques (LLMLingua, Selective Context, RECOMP, verbatim compaction), real benchmarks, compression-vs-performance curves, and why code requires different compression than prose.

March 13, 2026 · 6 min read

LLM costs scale linearly with input tokens. A 50% reduction in input tokens is a 50% reduction in input cost. But cost is the boring reason to compress prompts. The interesting reason: LLMs produce worse output as inputs get longer, even when the context window isn't full. Compression removes noise that causes models to miss signal. Fewer tokens, lower cost, better results.

20x
Max compression (LLMLingua on reasoning)
3,300+
Tokens/sec (Morph Compact)
0%
Hallucination risk (verbatim compaction)
$135K/yr
Savings at 100 agent sessions/day

What Is Prompt Compression

Prompt compression encompasses any technique that reduces the token count of an LLM input while retaining the semantic content needed for correct output. The term covers two distinct problems:

  • Input compression: reducing a prompt before sending it to the model. Removing boilerplate from retrieved documents, pruning low-information tokens, or extracting only task-relevant sentences.
  • Context compression: reducing accumulated context during long-running sessions. As coding agents read files, search codebases, and debug errors, their context windows fill with content that was useful earlier but is no longer relevant. Claude Code auto-compacts at 95% capacity. Codex runs server-side compaction after every turn. Cursor truncates old history.

Both forms solve the same underlying constraint: LLMs charge per token, and their performance degrades as input length increases. Sending fewer, higher-signal tokens costs less and produces better results.

Prompt compression vs. prompt engineering

Prompt engineering optimizes how you phrase a request. Prompt compression optimizes how much context accompanies that request. They're complementary. A well-engineered prompt with 50K tokens of noisy context will underperform a mediocre prompt with 10K tokens of relevant context.

Why Compression Improves Output Quality (Not Just Cost)

The standard framing is "compress to save money." The more important framing: compress to get better output.

Liu et al. (2023) demonstrated the "lost in the middle" effect: LLMs access information well at the beginning and end of a prompt but degrade significantly for content positioned in the middle. On multi-document QA and key-value retrieval, performance follows a U-shaped curve. The model doesn't just miss middle content by a little. It misses it substantially, even when the context window is far from full.

This has a direct implication for compression. If you have 100K tokens of context and only 20K tokens are relevant to the current task, sending all 100K doesn't just cost 5x more. It degrades accuracy because the model must reason through 80K tokens of noise. Relevant content in positions 30K-70K is particularly at risk of being ignored.

U-shaped
Performance curve on long inputs
50%
Context cost reduction (Selective Context)
21.4%
Performance boost with 4x fewer tokens (LongLLMLingua)

LongLLMLingua directly exploits this: it reorders documents to place high-relevance content at the beginning and end of the prompt, then compresses the rest. On NaturalQuestions, this combination boosts performance by 21.4% while using 4x fewer tokens. The compression itself improves accuracy.

For coding agents, context rot compounds the problem. As agents accumulate tool outputs, file contents, and error messages across dozens of turns, the signal-to-noise ratio degrades. Early investigation steps become irrelevant once the bug is found. File contents from abandoned approaches clutter the context. Compression isn't cleanup. It's a performance optimization.

The Code vs. Prose Problem

Most prompt compression research evaluates on natural language: QA, summarization, reasoning benchmarks. Code is a fundamentally different compression target, and this asymmetry explains why general-purpose methods underperform on agent workloads.

PropertyNatural LanguageCode
Token removal toleranceHigh (redundant grammar)Low (syntax-breaking)
Information densityVariable (boilerplate common)High (every token meaningful)
Structural integrityFlexible (reorderable)Rigid (order-dependent)
Error toleranceGraceful degradationBinary (works or doesn't)
Critical detailsDistributableExact (paths, line numbers, types)

Consider a perplexity-based pruner processing this code context:

What perplexity-based pruning does to code

Original:
  File: src/middleware/auth.ts:47
  TypeError: Cannot read property 'jwt' of undefined

After token-level pruning (high-perplexity tokens kept):
  File: src/middleware/:47
  TypeError: Cannot read property '' of undefined

The pruner removed "auth.ts" (predictable given the path pattern) and "jwt"
(predictable given the error type). Both are the exact details the agent
needs to fix the bug.

The token "auth.ts" has low perplexity (predictable from the path pattern "src/middleware/"). The token "jwt" has low perplexity (predictable given "Cannot read property" in an auth context). A perplexity scorer correctly identifies these as low-information tokens. But for the downstream task of fixing the bug, these are the only tokens that matter.

This is why Morph FlashCompact uses verbatim compaction for code: it operates on semantic units (entire statements, blocks, files) rather than individual tokens. A file path is either present exactly as it appeared or absent entirely. No corrupted paths. No approximate error messages.

Eight Prompt Compression Techniques

Each technique makes different tradeoffs between compression ratio, speed, accuracy preservation, and hallucination risk. No single method is best for all use cases.

1. LLMLingua: Perplexity-Based Token Pruning

LLMLingua (Microsoft Research, EMNLP 2023, 5.9K GitHub stars) uses a small language model to score each token by information content and removes those with the lowest perplexity scores. Three components work together:

  • Budget controller: allocates compression capacity across different prompt segments (instructions, demonstrations, questions), preserving more tokens in high-sensitivity regions.
  • Token-level iterative compression: models interdependencies between compressed segments. Tokens that seem redundant in isolation may be critical when surrounding context has already been removed.
  • Distribution alignment: instruction-tunes the small scoring model (GPT-2 or LLaMA-7B) to better match the target LLM's token distribution, improving compression decisions.
20x
Max compression (GSM8K/BBH)
1.5 pt
Performance loss at 20x
1.7-5.7x
End-to-end speedup
~7B
Scoring model params

On reasoning benchmarks, the results are strong: GPT-4 recovered all 9 steps from compressed chain-of-thought prompts, producing answers nearly identical to those from uncompressed prompts. Cross-model transfer also works. GPT-2-small scored 76.27 on GSM8K (baseline 74.9), and Claude v1.3 scored 82.61 (baseline 81.8).

LLMLingua limitation: structured content

Token-level pruning can break JSON, code blocks, file paths, and any content where individual tokens carry structural meaning. The approach works best on natural language paragraphs, few-shot examples, and reasoning chains. For agent workloads that process code, it requires careful domain segmentation.

2. LLMLingua-2: Token Classification with BERT

LLMLingua-2 (ACL 2024) reformulated compression from a perplexity calculation to a token classification problem. Two architectural changes:

  • Bidirectional context: uses a Transformer encoder (XLM-RoBERTa-large or mBERT) instead of a unidirectional LM, capturing context from both directions when deciding which tokens to keep.
  • Data distillation: trains the classifier on compression decisions distilled from GPT-4, learning directly what tokens matter rather than inferring it from perplexity.

The result: 3-6x faster than LLMLingua-1 with comparable compression quality, and 1.6-2.9x end-to-end latency reduction at 2-5x compression ratios. Tested on MeetingBank, LongBench, ZeroScrolls, GSM8K, and BBH with robust generalization across different target LLMs.

3. Selective Context: Sentence-Level Information Filtering

Selective Context (ACL 2023) computes self-information scores for each sentence and removes those below a threshold. Coarser than token-level methods, but it preserves sentence boundaries, so it can't break structured content the way token-level pruning can.

50%
Context cost reduction
0.023
BERTscore drop
0.038
Faithfulness drop

Evaluated on arXiv papers (summarization), news articles (QA), and conversation transcripts (response generation). The compression-quality tradeoff is gentle: 50% cost reduction with near-zero quality degradation. The main limitation is lower maximum compression ratios since you can only remove or keep entire sentences.

4. RECOMP: Trained Compressors for RAG

RECOMP trains dedicated compressors specifically for retrieval-augmented language models. Two compressor types serve different needs:

  • Extractive compressor: selects task-relevant sentences from retrieved documents. No rewriting, so content integrity is preserved.
  • Abstractive compressor: generates summaries synthesized across multiple documents. Higher compression but introduces rewriting risk.

A distinctive feature: compressors can return an empty string when retrieved documents are irrelevant. This selective augmentation prevents the model from being distracted by unhelpful context. At compression rates as low as 6% of original document length, RECOMP achieves minimal performance loss on language modeling and open-domain QA, significantly outperforming off-the-shelf summarization models.

5. 500xCompressor: Extreme Compression into Special Tokens

500xCompressor pushes to the theoretical extreme: compressing entire contexts into as few as one special token. It adds approximately 0.3% additional parameters to existing LLMs without requiring base model fine-tuning.

Compression ratios range from 6x to 480x, with models retaining 62-73% of their capabilities. A notable finding: KV cache values significantly outperform embeddings for preserving information at high compression ratios, suggesting that the key-value representation captures more recoverable structure than dense embeddings.

The tradeoff is clear: at 480x compression, you retain only ~63% of capability. This is viable for background context or reference material where approximate understanding suffices, not for agent workloads where exact details matter.

6. AutoCompressors: Learned Summary Vectors

AutoCompressors (EMNLP 2023) fine-tune pretrained models (OPT, Llama-2) to compress long contexts into compact summary vectors that function as soft prompts. Trained on sequences up to 30,720 tokens using an unsupervised objective, the summary vectors substitute for plain-text demonstrations.

The approach improves perplexity on long contexts and demonstrates benefits across in-context learning, retrieval-augmented language modeling, and passage re-ranking. Unlike discrete compression (which produces readable text), AutoCompressors produce continuous vectors, making them useful as a component in larger systems but not directly inspectable.

7. Context Caching (Anthropic, Google)

Context caching isn't compression per se, but it reduces the cost of repeated prefixes. Anthropic's prompt caching charges 0.1x the base input price on cache reads (90% reduction). Cache writes cost 1.25x (5-minute TTL) or 2x (1-hour TTL).

ModelBase InputCache Write (5m)Cache ReadSavings
Claude Opus 4$15.00$18.75$1.5090%
Claude Sonnet 4$3.00$3.75$0.3090%
Claude Haiku 3.5$0.80$1.00$0.0890%

Minimum cacheable lengths: 1,024-4,096 tokens depending on model. Up to 4 cache breakpoints per request. Effective for system prompts, reference documentation, and few-shot examples. Does not help with dynamic context that changes every request, which is most of what agents deal with.

8. Verbatim Compaction (Morph Compact)

Morph Compact takes a fundamentally different approach: delete low-signal content while keeping every surviving token identical to the input. Nothing is rewritten. Nothing is paraphrased. The output is a strict subset of the input tokens.

This eliminates the hallucination risk inherent in summarization and the syntax corruption risk of token-level pruning. When an agent needs a file path, error code, or code snippet, verbatim compaction guarantees it's either present exactly as it appeared in the original or absent entirely.

50-70%
Compression ratio
3,300+
Tokens/second throughput
0%
Hallucination risk
Semantic
Unit of operation (not token-level)
TechniqueCompressionSpeedAccuracy RiskBest For
LLMLingua2-20x1.7-5.7x speedupBreaks code/JSONReasoning, NL docs
LLMLingua-22-5x3-6x faster than v1Same structural riskSpeed-sensitive NL
Selective Context~2xFast (sentence-level)Low (boundary-preserving)Boilerplate removal
RECOMPUp to 17xTraining requiredExtractive: low; Abstractive: mediumRAG pipelines
500xCompressor6-480xFast (small params)62-73% capability retainedBackground reference
AutoCompressorsVariableInference costNon-inspectable outputSoft prompt substitution
Context cachingN/A (cost only)InstantNone (no modification)Repeated prefixes
Morph Compact50-70%3,300+ tok/sZero (verbatim output)Agent context, code

Benchmarks: Compression vs. Task Performance

The critical question isn't "how much can you compress?" It's "how much can you compress before task performance degrades?" Each technique has a different compression-performance curve.

LLMLingua on Reasoning Tasks

LLMLingua's strongest results are on reasoning benchmarks where the prompt contains chain-of-thought examples. At 20x compression on GSM8K and BBH, performance drops only 1.5 points. GPT-4 can recover all 9 reasoning steps from compressed CoT prompts. The key insight: reasoning chains contain massive redundancy in natural language connectives and transitions. The mathematical content (numbers, operations) carries high perplexity and is preserved.

LongLLMLingua on Long-Context QA

BenchmarkCompressionPerformance ChangeKey Finding
NaturalQuestions4x fewer tokens+21.4%Compression + reordering improves accuracy
LooGLE94% cost reductionMaintainedExtreme cost savings on long docs
End-to-end latency2-6x compression1.4-2.6x fasterLatency scales sub-linearly

Factory Context Compaction (36K Real Coding Messages)

The most practically relevant benchmark comes from Factory, which tested three compression approaches on 36,000 messages from real Claude Code coding sessions. This is the only large-scale evaluation on actual agent workloads rather than academic benchmarks.

MethodOverall ScoreCompressionKey Weakness
Factory structured summaries3.70/598.6%Custom implementation, not public
Anthropic summaries3.44/598.7%Loses file paths/error specifics
OpenAI opaque3.35/599.3%Lowest accuracy on exact details

The critical finding: all three methods achieved 98%+ compression ratios. The differentiator wasn't compression ratio. It was accuracy on specific details. File paths, line numbers, error messages, stack traces. These are exactly the tokens coding agents need to function, and summarization-based approaches systematically degraded them.

Compression ratio vs. accuracy

A 99% compression ratio that loses a critical file path is worse than a 60% compression ratio that preserves it exactly. For coding agents, accuracy-per-surviving-token matters more than raw compression ratio. This is the core argument for verbatim compaction: lower compression ratio, but guaranteed token-level accuracy on everything that survives.

Selective Context on Three Domains

Selective Context's results demonstrate the gentlest compression-quality curve: 50% context reduction with only 0.023 BERTscore drop and 0.038 faithfulness drop across arXiv summarization, news QA, and conversation response generation. The sentence-level granularity means it can't achieve the extreme ratios of token-level methods, but it also can't corrupt structured content.

Cost Analysis at Scale

Input tokens dominate agent costs because agents consume far more context than they produce. A coding agent might read 50 files, execute 20 searches, and process 30 error messages in a single session. Most of that input is consumed once and never referenced again.

ModelInput $/MNo Compress30% Compress50% Compress70% Compress
Claude Opus 4$15.00$22,500$15,750$11,250$6,750
Claude Sonnet 4$3.00$4,500$3,150$2,250$1,350
GPT-4.1$2.00$3,000$2,100$1,500$900
Gemini 2.5 Pro$1.25$1,875$1,313$938$563
$7.50
Saved per 1M tokens (50% on Opus)
$3.75
Saved per 500K-token session
$11,250
Monthly savings (100 sessions/day)
$135K
Annual savings at this scale

These numbers assume 50% compression on a single model. The savings amplify in three ways:

  • Fewer retries: compressed context means fewer hallucinations and wrong turns, reducing the total number of turns per task.
  • Faster completion: less input means lower latency per request, which compounds across multi-turn sessions.
  • Extended session life: agents can work longer before hitting context limits, completing more complex tasks without starting over.

Why Prevention Beats Compression

Every compression method discussed above operates after context has already been consumed. The deeper insight: the best compression is the compression you never need to run.

Cognition (Devin) measured that agents spend 60% of their time searching for code. Each search dumps results into context. Each file read adds the entire file. Each code edit echoes the full file back. The context fills not because agents need all that information, but because their tools are blunt instruments that return far more than necessary.

WarpGrep: Targeted Retrieval

Returns only relevant code snippets instead of entire files. 0.73 F1 in 3.8 steps. An agent searching for a function definition gets the 10-line function, not the 500-line file it lives in. Context consumption drops by 90%+ per search operation.

Fast Apply: Compact Diffs

Applies code changes as compact diffs at 10,500 tok/s instead of echoing the entire modified file back into context. A 3-line change to a 200-line file consumes ~10 tokens of context instead of ~1,000.

Morph Compact: Cleanup

Verbatim compaction for whatever noise remains. 50-70% compression at 3,300+ tok/s with zero hallucination risk. Operates on semantic units, not individual tokens, so code structure is preserved.

The combination extends effective context life by 3-4x. An agent that would hit compaction at turn 15 now hits it at turn 45-60. This means compaction fires 3-4x less often, and when it does fire, the context is cleaner because less noise accumulated in the first place.

FlashCompact: the three-layer stack

Morph FlashCompact combines all three layers: WarpGrep for targeted retrieval (prevent waste), Fast Apply for compact diffs (prevent echo), and Morph Compact for verbatim compaction (clean up the rest). State-of-the-art on SWE-Bench Pro.

Implementation Guide

Morph Compact exposes an OpenAI-compatible API. Integration requires minimal code changes regardless of your framework.

Basic prompt compression with Morph Compact

from openai import OpenAI

client = OpenAI(
    base_url="https://api.morphllm.com/v1",
    api_key="your-morph-api-key"
)

# Compress a long context before sending to your main model
long_context = open("conversation_history.txt").read()

response = client.chat.completions.create(
    model="morph-compact",
    messages=[
        {"role": "user", "content": long_context}
    ]
)

compressed = response.choices[0].message.content
# compressed is a strict subset of the original tokens
# no rewriting, no hallucination — just the high-signal content

# Now use compressed context with your main model
main_response = openai.chat.completions.create(
    model="claude-sonnet-4-20250514",
    messages=[
        {"role": "system", "content": compressed},
        {"role": "user", "content": "Fix the auth bug described above"}
    ]
)

Agent context compression pipeline

# Compress accumulated agent context before each reasoning step
def compress_context(messages: list[dict], threshold: int = 5) -> list[dict]:
    """Compress old messages, keep recent ones intact.

    The key insight: compress early, before context fills up.
    Don't wait for 95% capacity like Claude Code's default.
    Compress at 60-70% to maintain higher signal density.
    """
    if len(messages) <= threshold:
        return messages  # nothing to compress yet

    # Compress older messages, keep last N untouched
    old_messages = messages[:-threshold]
    recent_messages = messages[-threshold:]

    old_text = "\n".join(m["content"] for m in old_messages if m.get("content"))

    response = client.chat.completions.create(
        model="morph-compact",
        messages=[{"role": "user", "content": old_text}]
    )

    compressed_msg = {
        "role": "user",
        "content": f"[Compressed context]\n{response.choices[0].message.content}"
    }

    return [compressed_msg] + recent_messages

LLMLingua integration (for natural language contexts)

# pip install llmlingua
from llmlingua import PromptCompressor

# Initialize with a small scoring model
compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",
    use_llmlingua2=True,  # Use the faster BERT-based version
)

# Compress a prompt
compressed = compressor.compress_prompt(
    original_prompt,
    rate=0.5,            # Target 50% compression
    force_tokens=["def ", "class ", "import ", "return "],  # Preserve code markers
    drop_consecutive=True,
)

print("Original:", compressed['origin_tokens'], "tokens")
print("Compressed:", compressed['compressed_tokens'], "tokens")
print("Ratio:", compressed['ratio'])

# Note: LLMLingua works well for natural language contexts.
# For code-heavy agent contexts, use Morph Compact instead
# to avoid syntax corruption from token-level pruning.

Prompt Compression for RAG Pipelines

RAG is itself a form of prompt compression: instead of sending full documents, you retrieve and send only relevant chunks. But retrieved chunks still contain noise. Combining RAG with a compression layer gives you two levels of filtering.

RAG + Morph Compact pipeline

from openai import OpenAI
from langchain_core.documents import Document

morph = OpenAI(
    base_url="https://api.morphllm.com/v1",
    api_key="your-morph-api-key"
)

def compact_documents(docs: list[Document]) -> list[Document]:
    """Compress retrieved documents with verbatim compaction.

    Two-stage compression:
    1. RAG retriever selects relevant chunks (coarse filter)
    2. Morph Compact removes noise within chunks (fine filter)

    Every token in the output existed in the original document.
    """
    compressed = []
    for doc in docs:
        response = morph.chat.completions.create(
            model="morph-compact",
            messages=[{"role": "user", "content": doc.page_content}]
        )
        compressed.append(Document(
            page_content=response.choices[0].message.content,
            metadata=doc.metadata
        ))
    return compressed

# Integration with LangChain:
# 1. Retrieve documents with your existing retriever
# 2. Compact them before sending to the reasoning model
# 3. Every token in the output existed in the original — zero hallucination

The combination addresses RAG's two weaknesses. First, retrieved chunks often contain irrelevant paragraphs alongside relevant ones. Compression removes the noise within each chunk. Second, when multiple chunks are retrieved, there's often redundancy between them. Compacting the combined context eliminates the overlap.

RECOMP vs. post-retrieval compression

RECOMP trains compressors end-to-end for retrieval tasks, learning to compress documents specifically for downstream QA or language modeling. This can outperform generic post-retrieval compression because the compressor learns what matters for the task. The tradeoff: RECOMP requires training data and a fixed task format. For general-purpose agent workloads where tasks vary, post-retrieval verbatim compaction is more flexible.

Frequently Asked Questions

What is prompt compression?

Prompt compression reduces the number of tokens in an LLM prompt while preserving meaning and task accuracy. Techniques range from token-level pruning (LLMLingua scores tokens by perplexity and prunes low-information ones, achieving up to 20x compression), to sentence-level filtering (Selective Context), to retrieval-based compression (RECOMP), to verbatim compaction (Morph Compact keeps every surviving token identical to the original).

How does LLMLingua prompt compression work?

LLMLingua (Microsoft Research, EMNLP 2023) uses a small language model (GPT-2 or LLaMA-7B) to compute perplexity for each token. Tokens with low perplexity (highly predictable from surrounding context) are removed. A budget controller allocates compression across prompt segments, and distribution alignment tunes the scorer to match the target LLM. LLMLingua-2 (ACL 2024) reformulated this as BERT-based token classification, running 3-6x faster.

Does prompt compression affect LLM output quality?

Often compression improves output quality. Lost-in-the-middle research shows LLMs degrade on content positioned in the middle of long inputs. Removing low-signal tokens reduces noise, helping the model focus on relevant information. LongLLMLingua demonstrated a 21.4% accuracy improvement on NaturalQuestions with 4x fewer tokens. The risk depends on technique: token-level pruning can break code syntax; summarization can hallucinate; verbatim compaction preserves exact tokens with zero rewriting risk.

What is the difference between prompt compression and summarization?

Summarization rewrites content, risking hallucinated details or lost specifics. Prompt compression techniques like verbatim compaction remove content without rewriting. Every surviving token is identical to the original. Factory's benchmarks on 36K real coding messages found that accuracy on specific details (file paths, line numbers, error messages) was the biggest differentiator, not raw compression ratio. See our compaction vs. summarization deep dive.

What is the best prompt compression technique for code?

Code is structurally fragile. Token-level pruning (LLMLingua) can remove tokens that break file paths, JSON syntax, or function signatures. Summarization loses exact paths and line numbers. Verbatim compaction operates on semantic units (entire statements, blocks), so a file path is either present exactly or absent entirely. For coding agents, the best strategy is prevention: WarpGrep returns only relevant code snippets instead of entire files, reducing what needs compression.

How do I use prompt compression with LangChain?

LangChain provides ContextualCompressionRetriever that wraps any retriever with a compression layer. You can integrate LLMLingua via pip (pip install llmlingua), use LLM-based extractors, or call the Morph Compact API as a post-retrieval compression step. For code-heavy workloads, prefer Morph Compact over LLMLingua to avoid syntax corruption.

How much money does prompt compression save?

Savings scale linearly. At 50% compression on Claude Opus ($15/M input tokens), you save $7.50 per million tokens. An agent team running 100 sessions/day at 500K tokens each saves $11,250/month ($135K/year). Even on cheaper models like GPT-4.1 ($2/M), the same workload saves $1,500/month. Savings compound because compressed context also reduces retries and corrections.

What is verbatim compaction?

Verbatim compaction deletes low-signal content while keeping every surviving token identical to the input. The output is a strict subset of the original tokens. Unlike summarization (which rewrites and can hallucinate), unlike token-level pruning (which can break structured content), verbatim compaction guarantees that file paths, error codes, and code snippets are either preserved exactly or removed entirely. Morph Compact achieves 50-70% compression at 3,300+ tok/s with this approach.

Related Resources

Compress Prompts Without Losing Accuracy

Morph FlashCompact combines targeted retrieval (WarpGrep), compact diffs (Fast Apply), and verbatim compaction (Morph Compact). Extend context life by 3-4x. Zero hallucination risk. State-of-the-art on SWE-Bench Pro.