Salience-Aware KV Cache Compression With the Forecasting Cache

Divyang Mandani·Jul 03, 2026·5 min read·Insights

Autoregressive decoding at long context is bound by memory, not compute. At every generated token, the model reads the entire key-value cache from high-bandwidth memory. As context grows past 64K tokens, that read dominates the decode step, and the KV cache becomes the primary scalability wall for long context LLM inference. This is the well-known KV cache memory bandwidth bottleneck. While compute optimization remains important, recent advances in inference-time compute scaling demonstrate that reasoning quality also depends on how efficiently inference resources are allocated.

The dominant response has been eviction. Methods such as H2O, SnapKV, and StreamingLLM shrink the cache by discarding tokens judged unimportant. Every one of these methods estimates importance from a token's past attention. We argue this is the wrong quantity to measure. Importance is not a static property that a token carries. It is a dynamical property that changes as decoding unfolds.

At KriraAI, we studied a failure mode we call deferred salience. A token can receive almost no attention for thousands of steps, then become decisive for a single late reasoning step. Attention-mass eviction discards exactly these tokens, and once evicted, they can never be recovered. The result is confident, silent errors on long reasoning chains.

We propose salience-aware KV cache compression built on a learned forecaster. Instead of scoring past attention, we predict each cached token's future attention demand, and we never permanently evict. Our method, the Salience Forecasting Cache, recovers 98.7% of full-cache accuracy at a 25% hot budget on 128K context, against 71.4% for H2O. This post covers the problem mechanics, the full architecture, our experimental protocol, results and ablations, and where the approach still breaks.

The Problem: Why KV Cache Eviction Silently Breaks Long-Context Reasoning

Long-context serving faces a hard constraint. Retaining the full KV cache is quality-optimal but bandwidth-prohibitive, while shrinking it trades quality for speed. The community has largely accepted this trade as unavoidable. Our research questions whether the trade is fundamental or an artifact of how eviction decisions are made.

The bandwidth wall in decoding

Decode is memory-bandwidth-bound because arithmetic intensity is low. Each step performs a single matrix-vector product against the cached keys and values. On an H100, a 128K-token cache for an 8B model in FP16 exceeds 40GB per sequence and must be streamed every step. At batch scale, this read saturates HBM long before the tensor cores are busy. Reducing bytes read per step, not FLOPs, is what moves throughput. Organizations building production-grade AI systems often optimize inference infrastructure alongside model architecture through our AI Development Company services.

Why attention-mass eviction fails

Eviction methods share one assumption. They treat accumulated or recent attention mass as a proxy for a token's future value. This is a recency-biased estimator. It works when importance is stationary and fails when importance is deferred.

The concrete failure looks like this. A variable is defined early in a reasoning trace. It draws little attention while intermediate steps proceed. At the final step, the model needs it, but the token was evicted long ago. The model cannot re-attend to what is gone. It then fabricates a plausible value. This is the mechanism behind much of the KV cache eviction quality degradation reported on multi-hop tasks.

The limits of existing methods

We group prior work into three families, each with a specific shortcoming.

Attention-score eviction such as H2O and Scissorhands keeps heavy hitters by past attention mass, which structurally cannot retain deferred-salience tokens because their past mass is near zero.
Window and sink methods such as StreamingLLM keep a recent window plus attention sinks, which preserves fluency but discards arbitrary long-range dependencies outside the window.
KV quantization methods such as KIVI reduce bytes per entry rather than entry count, which helps bandwidth but degrades sharply below INT3 and does not address which tokens to prioritize.

None of these families models the future. They all commit to irreversible decisions using backward-looking signals. That is the gap our research targets.

Core Insight: Salience Is a Forecastable Dynamical Quantity

Our central hypothesis is simple to state and consequential in practice. A cached token's future attention demand is predictable from its recent trajectory and its key geometry. If salience can be forecast, eviction is the wrong primitive. Demotion and timely promotion are the right ones.

We arrived at this through a measurement study. We instrumented decoding on long-context traces and logged per-token attention over time. We found that future salience is not random. Tokens that later spike show characteristic precursors, including slowly rising key-query alignment and membership in specific heads. These precursors give a learnable signal.

We also found that salience dynamics are extremely non-uniform across heads. A small set of heads behaves like retrieval heads, spiking on distant specific tokens, while most heads are local. This heterogeneity means a single global eviction budget is structurally suboptimal. Different heads need different retention policies. These two observations, forecastability and head heterogeneity, motivate the entire architecture.

Methodology: The Salience Forecasting Cache

This section is the technical centrepiece. Our approach to salience-aware KV cache compression, the Salience Forecasting Cache, has three components that operate alongside a frozen base model. Nothing in the base weights changes. We add a tiny forecaster, a tiered memory manager, and a head-adaptive budget controller. We describe each and the objective used to train the forecaster.

The design goal is to eliminate irreversible loss. We never delete a token. We move it to a slower, cheaper memory tier and bring it back before it is needed. Forecasting is what makes timely return possible.

Component one: the Latent Salience Forecaster

The Latent Salience Forecaster is a lightweight recurrent module that runs per grouped-query attention group rather than per head, keeping its cost negligible. For each cached token, it consumes a compact feature vector and emits a forecasted salience score over a horizon window H. The forecast is the probability that the token receives attention above threshold tau within the next H decode steps.

The per-token feature vector is intentionally small and cheap to maintain. It contains the exponentially weighted moving average of the token's recent attention weight, its positional distance from the decode front, the L2 norm of its key vector, and a low-rank projection of the key into an 8-dimensional latent. We call this the salience latent, and this latent salience forecasting signal is what generalizes across tasks. The module is a single-layer GRU with a shared projection head, totalling roughly 0.2% of base model parameters. Designing lightweight forecasting modules like this requires expertise in machine learning model development, particularly when optimizing models for low-latency production inference.

We chose a GRU over a Transformer forecaster deliberately. The forecaster must run every step at negligible overhead. A recurrent state gives O(1) per-token updates and avoids a second quadratic attention. We considered a small MLP on windowed features but rejected it because it discarded temporal order, which our ablations show carries most of the predictive signal.

Component two: the three-tier cascade memory

Forecasts drive placement in a memory hierarchy with three tiers.

Tier 0 holds full-precision keys and values in HBM for the hot set of currently or imminently salient tokens.
Tier 1 holds INT4-quantized keys and values in HBM for warm tokens that may return within the horizon.
Tier 2 holds full-precision keys and values in host memory or NVMe for cold tokens with low forecasted salience.

Promotion moves a token toward Tier 0 when its forecast crosses a threshold. Demotion moves it outward when its forecast decays. Because Tier 2 preserves the exact token, no information is destroyed. This is the property that attention-mass eviction can never offer.

Component three: head-adaptive budget allocation

We allocate the scarce Tier 0 budget across heads by measured salience volatility rather than uniformly. A controller estimates each head group's volatility from the variance of its recent forecasts. Retrieval-like groups with spiky distant attention receive larger hot budgets. Local groups receive small budgets since their salient tokens are always recent and cheap to keep. The controller rebalances every K steps, which we set to 128 in our experiments.

Training the forecaster with an asymmetric salience loss

The forecaster is trained offline against decoding traces from the frozen base model, so no base retraining is needed. We label each cached token at each step by whether it actually receives above-threshold attention within the following H steps. The forecaster predicts that label.

The key design choice is asymmetry. A false negative is catastrophic because it demotes a token that will be needed and risks a stall or a quality loss. A false positive merely wastes some HBM on an unneeded warm entry. We therefore use an asymmetric binary cross-entropy where the positive class carries a weight alpha much greater than one. In our runs, alpha equals 6.0. We add a focal term to concentrate learning on hard, near-threshold tokens.

We also add a promotion-latency term. A forecast that fires too late to hide the Tier 2 transfer is nearly useless. The term penalizes correct predictions whose lead time is shorter than the measured transfer latency for that tier. This shapes the forecaster to fire early enough to prefetch.

Speculative promotion scheduling

Promotion is speculative and latency-aware. When a cold token's forecast crosses the threshold and predicts need at step t plus delta, the scheduler prefetches it now if delta exceeds the tier's transfer latency. The dequantization or PCIe transfer then overlaps with ongoing compute and is hidden. If a forecast fires too late, the fetch is reactive and the step stalls until the token arrives. Minimizing the frequency and cost of these stalls is the practical objective of the whole system.

Experimental Setup

We designed experiments to test one claim above all others. Forecasting future salience preserves deferred-salience tokens that attention-mass eviction destroys, at a competitive memory and bandwidth budget. Our protocol targets long context LLM inference where the effect is largest.

Models and context regimes

We evaluated three base models to test generality across scale and family. We used Llama-3.1-8B, Mistral-Nemo-12B, and Llama-3.1-70B. We ran each at three context regimes of 32K, 64K, and 128K tokens. All comparisons hold the hot Tier 0 budget fixed as a percentage of full cache so that methods are compared at equal HBM footprint.

Baselines

We selected baselines spanning the three prior families plus two references. We compared against H2O and SnapKV for attention-score eviction, StreamingLLM for window plus sink, and KIVI at INT4 and INT2 for quantization. We included the full cache as a quality upper bound and PagedAttention as an uncompressed bandwidth reference. These are fair comparisons because each represents the strongest public method in its family at the time of our study.

Benchmarks including DEFER-Bench

We used RULER for synthetic multi-hop and variable tracking, and LongBench for realistic document tasks. Because no public benchmark isolates deferred salience, we built DEFER-Bench. Its tasks plant a fact early in context that becomes necessary only during the final generated answer, with distractor traffic in between. We also included an enterprise long-document QA set over regulatory and contract text, which KriraAI curated from our applied deployments to reflect real long-context workloads.

Metrics

We measured task accuracy, KV cache footprint in gigabytes, decode throughput in tokens per second, and effective per-step KV read bandwidth. We added one new metric, the Deferred Retention Rate, defined as the fraction of deferred-salience tokens available in HBM or promoted in time when they are needed. This metric directly captures the failure mode our research targets.

Hardware and configuration

We ran on nodes of eight H100 80GB GPUs with NVLink, and repeated constrained runs on a single H100 and on an L40S to test edge-like conditions. The forecaster was trained on 40 million logged decoding steps, requiring roughly 6 GPU-hours total. Horizon H was set to 512 steps and threshold tau to the ninetieth percentile of per-head attention.

Results and Analysis

Our results support the central claim and reveal a sharp mechanistic story about where the benefit comes from. We report accuracy, efficiency, ablations, and honest failure cases.

Main long-context accuracy

At 128K context on Llama-3.1-8B with a 25% Tier 0 budget, salience-aware KV cache compression recovered 98.7% of full-cache accuracy on RULER multi-hop. At the same budget, H2O reached 71.4%, and SnapKV reached 83.2%. Our method therefore improved on H2O by 27.3 points at equal HBM footprint. The gap widened as context grew, which matches the intuition that deferred salience is rarer at short context and pervasive at long context.

The DEFER-Bench results were the most striking. Our Deferred Retention Rate was 94.1%, against 38.6% for H2O. This is the quantitative fingerprint of the deferred-salience failure. Attention-mass methods lose the majority of tokens whose importance arrives late, and that loss maps almost directly onto their accuracy collapse.

Memory and bandwidth

The efficiency gains were substantial and came from reads, not just storage. Our method reduced HBM KV footprint by 3.8x at 128K while holding accuracy loss under 1.5%. For contrast, KIVI at INT4 reduced footprint by 4x but dropped 6.2% on multi-hop, because precision loss and deferred loss compound. Because most reads hit the small hot tier, effective per-step KV read bandwidth fell by 62%, yielding a 2.3x decode throughput improvement over the full cache and a 1.4x improvement over SnapKV.

Speculative promotion did most of the work of hiding tier latency. Across runs, 96.2% of promotions completed before the token was needed. The remaining 3.8% incurred a reactive fetch with a mean stall of 0.9 milliseconds. This confirms that forecasting with an adequate horizon converts most potential stalls into hidden background transfers.

Ablation study

We ablated each component to isolate its contribution.

Replacing the forecaster with accumulated attention collapsed the method toward H2O, confirming that forecasting, not the memory hierarchy alone, drives the gain.
Replacing demotion with true eviction removed the recovery path and dropped DEFER-Bench retention from 94.1% to 51.0%, confirming that irreversibility is the core harm.
Removing speculative promotion left reactive fetching, which preserved accuracy but cut throughput by 34% due to stalls.
Replacing head-adaptive budgets with uniform budgets wasted hot memory on local heads and lowered accuracy by 4.9 points at fixed footprint.
Switching the asymmetric loss to symmetric raised false negatives and reduced retention by 11 points.

Failure cases and a counterintuitive finding

Two findings were surprising and important for honesty. First, the benefit concentrated in a tiny set of heads. The 4% to 6% of head groups our controller identified as retrieval heads captured 88% of the total improvement. Forecasting on the remaining local heads added almost nothing. This suggests future systems could forecast only retrieval heads and save further compute.

Second, our method degraded on adversarially unpredictable salience. When we planted references with no precursor signal, the forecaster could not predict them and behaved like H2O. Below a 10% hot budget, the whole system also collapsed, because Tier 2 promotion bandwidth saturated and speculative fetches could not keep pace. Forecasting helps only where salience carries a signal, and only when the memory hierarchy has headroom to move tokens.

Discussion and Implications

Our results reframe the long-context efficiency problem. The field has treated eviction quality as a fixed cost of compression. Our findings suggest a large part of that cost was self-inflicted by backward-looking, irreversible decisions. When importance is forecastable and demotion is reversible, the quality cost of compression shrinks dramatically.

The head-concentration result has practical weight for anyone building production systems. If a small set of retrieval heads carries almost all long-range dependency, then serving stacks should identify those heads and protect their cache aggressively, while compressing local heads freely. This is a more surgical policy than the uniform budgets most systems apply today. It also implies that interpretability work on retrieval heads has direct efficiency payoffs.

There is a broader lesson about what we choose to measure. Much of KV cache eviction quality degradation traces to using a convenient proxy, past attention, in place of the quantity we actually care about, future need. Our research shows that the harder quantity is learnable at trivial cost. We expect the same reframing to apply beyond caching, to prefetching, scheduling, and any system that currently makes irreversible decisions from lagging signals.

For practitioners, the takeaway is concrete. At KriraAI, we deploy long-context assistants over contracts and regulatory filings where a single early clause can decide a late answer. Salience forecasting gave us the memory savings of aggressive compression without the silent errors that make such systems unsafe to ship. That combination is what moves a technique from a benchmark result to a production capability. The same challenge appears in healthcare AI applications, where a clinical observation recorded thousands of tokens earlier may directly affect later diagnostic recommendations.

Limitations and Future Work

This research does not solve the general problem, and several limitations deserve emphasis. The forecaster is trained per model family and does not transfer zero-shot across architectures, since attention dynamics differ. Tier 2 promotion bandwidth is a hard ceiling, and under many concurrent long-context requests, PCIe contention limits how much speculative promotion can hide. The asymmetric weight alpha and the horizon H are fixed hyperparameters that we tuned per deployment rather than learned online.

There are further gaps. Our method helps decoding but does nothing for prefill, where the full cache is first constructed. Batched serving with heterogeneous context lengths complicates head-budget allocation, since one schedule must serve sequences at very different stages. And on genuinely unpredictable salience, forecasting cannot beat chance, so worst-case behavior only matches attention-mass baselines rather than exceeding them.

Our future work targets these directly. We are studying a learned adaptive horizon that expands when forecasts are uncertain, a meta-learned forecaster that transfers across model families with light calibration, and forecaster fusion with speculative decoding so promotion and draft acceptance share signal. We are also extending salience forecasting to chunked prefill. KriraAI is pursuing these as part of an ongoing long-context efficiency program.

Conclusion

This research makes three contributions we consider durable. The problem insight is that KV cache importance is a forecastable dynamical quantity, and that treating it as a static, backward-looking proxy is the root cause of much KV cache eviction quality degradation on long context. The methodological contribution is the Salience Forecasting Cache, a salience-aware KV cache compression system built from a tiny latent salience forecasting module, a reversible three-tier cascade memory, and head-adaptive budgets. The key finding is that this recovers 98.7% of full-cache accuracy at a quarter of the hot budget while cutting per-step read bandwidth by 62%, with the benefit concentrated in a small population of retrieval heads.

What we find most instructive is the general lesson. Systems that make irreversible decisions from lagging signals pay a hidden quality tax, and that tax is often avoidable once the forward-looking quantity is made learnable. We believe this reframing extends well beyond caching into scheduling and prefetching across the inference stack.

This post is one piece of a broader research program at KriraAI. We conduct original applied AI research and publish our findings openly, then fold those insights back into the production systems we build for enterprise clients where reliability at long context is not optional. If you are working on long context LLM inference, on the KV cache memory bandwidth bottleneck, or on retrieval-head interpretability, we would welcome the exchange. Reach out to the KriraAI research team to discuss these findings, challenge our assumptions, or explore collaboration on the next set of open questions.

FAQs

KV cache eviction degrades long-context reasoning because eviction methods estimate a token's importance from its past attention, which is a recency-biased proxy for future need. Tokens with deferred salience receive little attention early, so they are evicted, and once removed, they can never be re-attended. On multi-hop and variable-tracking tasks, this produces confident, silent errors. Our measurements show attention-mass methods lose the majority of deferred-salience tokens, which explains most of the accuracy collapse observed at aggressive compression budgets on long context.

Salience-aware KV cache compression uses a small learned forecaster that reads a compact per-token feature vector, including a recent-attention moving average, positional distance, key norm, and a low-rank key latent, and outputs the probability the token will be attended above a threshold within a future horizon. The forecaster is a single-layer recurrent module of roughly 0.2 percent of model parameters, trained offline on decoding traces from the frozen base model. It runs at negligible cost every step and drives placement of tokens across a tiered memory hierarchy.

Attention-score eviction looks backward and irreversibly, keeping tokens with high accumulated past attention and deleting the rest. Latent salience forecasting looks forward and reversibly, predicting each token's future attention demand and demoting low-forecast tokens to a slower memory tier rather than deleting them. The critical distinction is that forecasting can retain tokens whose importance has not yet arrived, and demotion preserves the exact token so it can be promoted back before it is needed. In our ablations, removing reversibility alone cut deferred retention from 94.1 percent to 51.0 percent.

In our experiments at 128K context, salience-aware KV cache compression reduced effective per-step KV read bandwidth by 62 percent and HBM KV footprint by 3.8x, because most decode-step reads hit a small hot tier instead of the full cache. This translated to a 2.3x decode throughput improvement over the full cache on an H100 and 1.4x over SnapKV at equal footprint, with accuracy loss under 1.5 percent. The savings depend on having headroom in the promotion path, and they shrink when concurrent requests saturate transfer bandwidth.

Yes, but only if the compression method models future need rather than past attention and avoids irreversible deletion. Our results show that a learned forecaster combined with a demote-and-promote memory hierarchy recovers 98.7 percent of full-cache accuracy at a 25 percent hot budget, where attention-mass eviction reaches only 71.4 percent. The benefit concentrates in a small set of retrieval heads that carry most long-range dependencies. Compression fails to preserve dependencies when salience is genuinely unpredictable or when the hot budget falls below roughly 10 percent, and promotion bandwidth saturates.

Divyang Mandani

Founder & CEO

Jul 03, 2026

Divyang Mandani is the CEO of KriraAI, driving innovative AI and IT solutions with a focus on transformative technology, ethical AI, and impactful digital strategies for businesses worldwide.

Ready to Write Your Success Story?

Do not wait for tomorrow; lets start building your future today. Get in touch with KriraAI and unlock a world of possibilities for your business. Your digital journey begins here - with KriraAI, where innovation knows no bounds.