Async Context Compression¶

Filter v1.2.2

Reduces token consumption in long conversations through intelligent summarization while maintaining conversational coherence.

Overview¶

The Async Context Compression filter helps manage token usage in long conversations by:

Intelligently summarizing older messages
Preserving important context
Reducing API costs
Maintaining conversation coherence

This is especially useful for:

Long-running conversations
Complex multi-turn discussions
Cost optimization
Token limit management

Features¶

Smart Compression: AI-powered context summarization
Async Processing: Non-blocking background compression
Context Preservation: Keeps important information
Cost Reduction: Minimize token usage
Frontend Debugging: Debug logs in browser console
Enhanced Error Reporting: Clear error status notifications
Open WebUI v0.7.x Compatibility: Dynamic DB session handling
Improved Compatibility: Summary role changed to assistant
Enhanced Stability: Resolved race conditions in state management
Preflight Context Check: Validates context fit before sending
Structure-Aware Trimming: Preserves document structure
Native Tool Output Trimming: Trims verbose tool outputs (Note: Non-native tool outputs are not fully injected into context)
Detailed Token Logging: Granular token breakdown
Smart Model Matching: Inherit config from base models
Multimodal Support: Images are preserved but tokens are NOT calculated

Installation¶

Download the plugin file: async_context_compression.py
Upload to OpenWebUI: Admin Panel → Settings → Functions
Configure compression settings
Enable the filter

How It Works¶

graph TD
    A[Incoming Messages] --> B{Token Count > Threshold?}
    B -->|No| C[Pass Through]
    B -->|Yes| D[Summarize Older Messages]
    D --> E[Preserve Recent Messages]
    E --> F[Combine Summary + Recent]
    F --> G[Send to LLM]

Configuration¶

Option	Type	Default	Description
`compression_threshold_tokens`	integer	`64000`	Trigger compression above this token count
`max_context_tokens`	integer	`128000`	Hard limit for context
`keep_first`	integer	`1`	Always keep the first N messages
`keep_last`	integer	`6`	Always keep the last N messages
`summary_model`	string	`None`	Model to use for summarization
`summary_model_max_context`	integer	`0`	Max context tokens for summary model
`max_summary_tokens`	integer	`16384`	Maximum tokens for the summary
`enable_tool_output_trimming`	boolean	`false`	Enable trimming of large tool outputs

Example¶

Before Compression¶

[Message 1] User: Tell me about Python...
[Message 2] AI: Python is a programming language...
[Message 3] User: What about its history?
[Message 4] AI: Python was created by Guido...
[Message 5] User: And its features?
[Message 6] AI: Python has many features...
... (many more messages)
[Message 20] User: Current question

After Compression¶

[Summary] Previous conversation covered Python basics,
history, features, and common use cases...

[Message 18] User: Recent question about decorators
[Message 19] AI: Decorators in Python are...
[Message 20] User: Current question

Requirements¶

Prerequisites

OpenWebUI v0.3.0 or later
Access to an LLM for summarization

Best Practices

Set appropriate token thresholds based on your model's context window
Preserve more recent messages for technical discussions
Test compression settings in non-critical conversations first

Troubleshooting¶

Compression not triggering?

Check if the token count exceeds your configured threshold. Enable debug logging for more details.

Important context being lost?

Increase the preserve_recent setting or lower the compression ratio.

Source Code¶

View on GitHub