Async Context Compression¶
Filter v1.2.2
Reduces token consumption in long conversations through intelligent summarization while maintaining conversational coherence.
Overview¶
The Async Context Compression filter helps manage token usage in long conversations by:
- Intelligently summarizing older messages
- Preserving important context
- Reducing API costs
- Maintaining conversation coherence
This is especially useful for:
- Long-running conversations
- Complex multi-turn discussions
- Cost optimization
- Token limit management
Features¶
- Smart Compression: AI-powered context summarization
- Async Processing: Non-blocking background compression
- Context Preservation: Keeps important information
- Cost Reduction: Minimize token usage
- Frontend Debugging: Debug logs in browser console
- Enhanced Error Reporting: Clear error status notifications
- Open WebUI v0.7.x Compatibility: Dynamic DB session handling
- Improved Compatibility: Summary role changed to
assistant - Enhanced Stability: Resolved race conditions in state management
- Preflight Context Check: Validates context fit before sending
- Structure-Aware Trimming: Preserves document structure
- Native Tool Output Trimming: Trims verbose tool outputs (Note: Non-native tool outputs are not fully injected into context)
- Detailed Token Logging: Granular token breakdown
- Smart Model Matching: Inherit config from base models
- Multimodal Support: Images are preserved but tokens are NOT calculated
Installation¶
- Download the plugin file:
async_context_compression.py - Upload to OpenWebUI: Admin Panel → Settings → Functions
- Configure compression settings
- Enable the filter
How It Works¶
graph TD
A[Incoming Messages] --> B{Token Count > Threshold?}
B -->|No| C[Pass Through]
B -->|Yes| D[Summarize Older Messages]
D --> E[Preserve Recent Messages]
E --> F[Combine Summary + Recent]
F --> G[Send to LLM] Configuration¶
| Option | Type | Default | Description |
|---|---|---|---|
compression_threshold_tokens | integer | 64000 | Trigger compression above this token count |
max_context_tokens | integer | 128000 | Hard limit for context |
keep_first | integer | 1 | Always keep the first N messages |
keep_last | integer | 6 | Always keep the last N messages |
summary_model | string | None | Model to use for summarization |
summary_model_max_context | integer | 0 | Max context tokens for summary model |
max_summary_tokens | integer | 16384 | Maximum tokens for the summary |
enable_tool_output_trimming | boolean | false | Enable trimming of large tool outputs |
Example¶
Before Compression¶
[Message 1] User: Tell me about Python...
[Message 2] AI: Python is a programming language...
[Message 3] User: What about its history?
[Message 4] AI: Python was created by Guido...
[Message 5] User: And its features?
[Message 6] AI: Python has many features...
... (many more messages)
[Message 20] User: Current question
After Compression¶
[Summary] Previous conversation covered Python basics,
history, features, and common use cases...
[Message 18] User: Recent question about decorators
[Message 19] AI: Decorators in Python are...
[Message 20] User: Current question
Requirements¶
Prerequisites
- OpenWebUI v0.3.0 or later
- Access to an LLM for summarization
Best Practices
- Set appropriate token thresholds based on your model's context window
- Preserve more recent messages for technical discussions
- Test compression settings in non-critical conversations first
Troubleshooting¶
Compression not triggering?
Check if the token count exceeds your configured threshold. Enable debug logging for more details.
Important context being lost?
Increase the preserve_recent setting or lower the compression ratio.