Skip to content

Async Context Compression

Filter v1.2.2

Reduces token consumption in long conversations through intelligent summarization while maintaining conversational coherence.


Overview

The Async Context Compression filter helps manage token usage in long conversations by:

  • Intelligently summarizing older messages
  • Preserving important context
  • Reducing API costs
  • Maintaining conversation coherence

This is especially useful for:

  • Long-running conversations
  • Complex multi-turn discussions
  • Cost optimization
  • Token limit management

Features

  • Smart Compression: AI-powered context summarization
  • Async Processing: Non-blocking background compression
  • Context Preservation: Keeps important information
  • Cost Reduction: Minimize token usage
  • Frontend Debugging: Debug logs in browser console
  • Enhanced Error Reporting: Clear error status notifications
  • Open WebUI v0.7.x Compatibility: Dynamic DB session handling
  • Improved Compatibility: Summary role changed to assistant
  • Enhanced Stability: Resolved race conditions in state management
  • Preflight Context Check: Validates context fit before sending
  • Structure-Aware Trimming: Preserves document structure
  • Native Tool Output Trimming: Trims verbose tool outputs (Note: Non-native tool outputs are not fully injected into context)
  • Detailed Token Logging: Granular token breakdown
  • Smart Model Matching: Inherit config from base models
  • Multimodal Support: Images are preserved but tokens are NOT calculated

Installation

  1. Download the plugin file: async_context_compression.py
  2. Upload to OpenWebUI: Admin PanelSettingsFunctions
  3. Configure compression settings
  4. Enable the filter

How It Works

graph TD
    A[Incoming Messages] --> B{Token Count > Threshold?}
    B -->|No| C[Pass Through]
    B -->|Yes| D[Summarize Older Messages]
    D --> E[Preserve Recent Messages]
    E --> F[Combine Summary + Recent]
    F --> G[Send to LLM]

Configuration

Option Type Default Description
compression_threshold_tokens integer 64000 Trigger compression above this token count
max_context_tokens integer 128000 Hard limit for context
keep_first integer 1 Always keep the first N messages
keep_last integer 6 Always keep the last N messages
summary_model string None Model to use for summarization
summary_model_max_context integer 0 Max context tokens for summary model
max_summary_tokens integer 16384 Maximum tokens for the summary
enable_tool_output_trimming boolean false Enable trimming of large tool outputs

Example

Before Compression

[Message 1] User: Tell me about Python...
[Message 2] AI: Python is a programming language...
[Message 3] User: What about its history?
[Message 4] AI: Python was created by Guido...
[Message 5] User: And its features?
[Message 6] AI: Python has many features...
... (many more messages)
[Message 20] User: Current question

After Compression

[Summary] Previous conversation covered Python basics,
history, features, and common use cases...

[Message 18] User: Recent question about decorators
[Message 19] AI: Decorators in Python are...
[Message 20] User: Current question

Requirements

Prerequisites

  • OpenWebUI v0.3.0 or later
  • Access to an LLM for summarization

Best Practices

  • Set appropriate token thresholds based on your model's context window
  • Preserve more recent messages for technical discussions
  • Test compression settings in non-critical conversations first

Troubleshooting

Compression not triggering?

Check if the token count exceeds your configured threshold. Enable debug logging for more details.

Important context being lost?

Increase the preserve_recent setting or lower the compression ratio.


Source Code

View on GitHub