Context Window Management and Compaction in Luminy

Every large language model has a finite context window — a hard limit on how many tokens it can hold in memory at once. In long coding sessions, this limit is easy to hit: the agent reads files, produces diffs, calls tools, and accumulates turn after turn of conversation history. Luminy’s compaction system manages this automatically, summarizing older turns to reclaim space while keeping your most recent context verbatim and intact.

Context Usage Indicator

A progress bar appears at the top of the chat view when your context usage reaches 70% of the available window — an early warning that the context is filling up. The bar gives you a live read of how much of the context window is currently occupied. You don’t need to do anything at this point; it’s informational. Luminy will handle compaction automatically when needed.

How Compaction Works: The 4-Zone System

Luminy divides the conversation history token budget into four zones. As your session grows, older messages move progressively through the zones from newest to oldest.

Verbatim  →  Summary  →  Ultra-compressed  →  Archive
  (50%)        (30%)           (10%)            (10%)

Zone	Budget share	What you see
Verbatim	50%	Recent messages shown in full, exactly as written
Summary	30%	Older turns replaced with a concise summary
Ultra-compressed	10%	Summary-of-summaries for even older turns
Archive	10%	Oldest messages removed from context entirely (still stored in the database)

Archived messages are never deleted. They remain in Luminy’s local SQLite database and are visible in your chat history — they just no longer consume tokens when the model is called.

Zone transitions in detail

Verbatim is where all new messages begin. The most recent turns are always presented to the model exactly as they were written — no information loss. Summary zone messages have been summarized once. When the model assembles context, it substitutes the summary for the full content of those turns. Tool call details are preserved in compressed form. Ultra-compressed messages have been summarized a second time (a summary of a summary). These represent turns from much earlier in the session and contribute only a compact snapshot to context. Archive messages cost zero tokens. They are excluded from the context assembly entirely, though the chat UI still displays them in your session history.

Auto-Compaction Threshold

When context usage reaches 85%, Luminy triggers compaction automatically. You do not need to take any action. During compaction:

The oldest messages in the Verbatim zone are moved to Summary, their content replaced by a model-generated summary.
If the Summary zone is full, its oldest messages are moved to Ultra-compressed.
If Ultra-compressed is full, its oldest messages are archived.
The context is then re-assembled with the compacted history, freeing headroom for the conversation to continue.

Each compaction event is logged internally so there is an audit trail of when and how compaction occurred.

Compaction is fully automatic. You can continue typing and the agent keeps running — compaction happens in the background without interrupting the conversation flow.

Manual Alternatives

If you’d prefer more direct control over context management, you have two options:

Fork the session

Forking creates a copy of your session from a specific message checkpoint. The fork starts with a clean context that only includes messages up to the fork point, in their current compaction state. See Sessions for details.

Start a new session

If the context is fully saturated and the conversation has naturally reached a stopping point, starting a new session gives you a completely fresh context window with no compaction overhead.

Compaction Quality

The quality of the generated summaries depends on the model you’re using. More capable models (Claude 3.5 Sonnet, GPT-4o, Gemini 2.0 Flash) produce summaries that faithfully preserve the key decisions, file paths, and reasoning from earlier in the session. Smaller or less capable models may produce lower-quality summaries that omit details.

If you are using a small local model (e.g., a 7B parameter Ollama model) for a very long session, compaction summaries may lose important context from early turns. Consider forking the session at key milestones rather than relying on repeated compaction.

Tips for Long Sessions

Fork at checkpoints. After a major feature is complete or a bug is fixed, fork the session. The fork captures the resolved state as a clean baseline, and future context usage starts fresh from that point rather than accumulating from the very beginning of your session.

Use a model with a large context window. Gemini 2.0 Flash offers a 1 million token context window — even very long sessions with heavy tool use will hit the 70% context warning far less frequently than with a 128k-token model.

​Context Usage Indicator

​How Compaction Works: The 4-Zone System

​Zone transitions in detail

​Auto-Compaction Threshold

​Manual Alternatives

Fork the session

Start a new session

​Compaction Quality

​Tips for Long Sessions

Context Usage Indicator

How Compaction Works: The 4-Zone System

Zone transitions in detail

Auto-Compaction Threshold

Manual Alternatives

Compaction Quality

Tips for Long Sessions