Spectral Persistence Scoring: A KV-Cache Eviction Policy for Long-Context Inference

Divyang Mandani·Apr 15, 2026·12 min read·Insights

Long-context inference in transformer-based language models is fundamentally constrained by the memory cost of the key-value cache. As context windows extend to 128K tokens and beyond, the KV-cache grows linearly with sequence length, consuming tens of gigabytes of GPU memory and becoming the primary bottleneck for deployment at scale. The dominant approach to managing this is KV-cache eviction, where less important entries are discarded to keep memory within budget. Every existing KV-cache eviction policy answers the question of which entries to evict using some variant of attention score magnitude as the importance signal.

We have found through systematic investigation at KriraAI that attention-score-based eviction contains a fundamental blind spot. Tokens that are structurally important to the model's information routing can have low instantaneous attention scores across many intermediate steps while remaining essential for future retrieval. Evicting these tokens causes irreversible quality degradation that manifests unpredictably on downstream tasks. This failure is particularly severe for long-context inference memory management because the tokens most likely to have temporarily low attention are those encoding information deposited early in the context and needed much later.

We propose Spectral Persistence Scoring (SPS), a KV-cache eviction policy that replaces attention magnitude with a spectral importance signal derived from the persistence of tokens in the dominant singular vectors of the attention matrix. Our experiments show that SPS retains 91.3 percent of full-cache accuracy on RULER at 25 percent cache budget, compared to 76.8 percent for H2O and 71.2 percent for StreamingLLM. This blog presents the full methodology, experimental validation, and analysis of when and why spectral token importance scoring outperforms attention-based alternatives.

The KV-Cache Eviction Problem in Long-Context Deployment

KV-cache memory optimization is among the most pressing deployment challenges for long-context models. A single Llama-3-8B inference pass with 128K tokens requires approximately 32 GB of KV-cache memory in FP16, often exceeding available GPU memory after accounting for model weights. Serving concurrent users compounds this cost multiplicatively. Without effective cache management, long-context inference at scale requires either hardware overprovisioning or context truncation.

The current landscape of eviction policies relies on three primary importance signals. Attention score accumulation, as in H2O (Heavy Hitter Oracle), sums scores received by each token across heads and layers, evicting lowest-scoring entries. Recency-based approaches like StreamingLLM preserve a fixed recent window plus initial attention sink tokens. Hybrid approaches like Scissorhands combine recency with attention thresholds. All three families assume that current attention patterns predict future patterns, an assumption our analysis shows is violated in precisely the scenarios where attention-based cache compression matters most.

Why Attention Scores Fail as Eviction Signals

The failure stems from a mismatch between what attention scores measure and what determines long-term structural importance. Attention scores reflect the relevance of a key to the current query at a specific head. They are instantaneous, local, and query-dependent. A token's structural importance to information routing is a global property depending on its role in the low-rank structure of the full attention pattern.

We conducted diagnostic analysis on Llama-3-8B processing 64K token documents from LongBench. We tracked each cached token's attention score percentile at every generation step and correlated it with eventual retrieval importance. We found that 31 percent of tokens critical for correct retrieval fell below the 20th percentile in attention score at some point between insertion and retrieval. These tokens would be discarded by any attention-score-based policy.

The mechanism is what we term latent retrieval. Tokens encoding facts, references, or premises deposited early in the context receive minimal attention during intervening generation. They persist in a dormant state until a later query triggers retrieval. During dormancy, attention-based eviction discards them. Once evicted, the information is irrecoverable, causing hallucination or reasoning failure that appears unpredictable because the eviction decision was made many steps earlier.

This problem grows non-linearly with compression ratio. At 75 percent budget, only the least-attended 25 percent is evicted, and most are genuinely unimportant. At 25 percent budget, 75 percent must be discarded, and the probability that latent retrieval tokens fall into the evicted set rises sharply. Quality degradation is non-linear precisely because the latent retrieval failure mode activates only at aggressive compression.

Spectral Persistence Scoring: Methodology

Core Insight

The insight behind SPS is that a token's structural importance can be measured by its persistence in the dominant singular vectors of the attention matrix, independent of its instantaneous attention score. The attention matrix has low-rank structure where top singular vectors capture primary information routing patterns. Tokens consistently projecting onto these dominant vectors across consecutive steps are structurally embedded in the model's reasoning process, even without high attention at any individual step. This spectral persistence provides a fundamentally different and more predictive signal of future retrieval importance than attention magnitude.

Running SVD Approximation

Computing full SVD at every step would be prohibitive. We maintain a running rank-r approximation using incremental updates. At each step, we extract the top-r right singular vectors using randomized SVD with r = 16 and maintain an exponentially weighted moving average of their outer products. The amortized cost is O(n * r) per step. At r = 16, this adds 3.2 percent additional latency per generation step, modest given the memory savings enabled. Our ablation confirms diminishing returns beyond r = 16, with r = 32 adding 6.1 percent latency for only 0.2 percentage points of accuracy gain.

Persistence Score Computation

For each cached token j, we compute a spectral persistence score P(j) measuring how consistently token j appears in the dominant subspace across a sliding window of W = 64 recent steps. We compute the average projection magnitude onto the top-r right singular vectors across the window. Tokens with high persistence participate consistently in primary information routing, regardless of current query relevance. This captures a qualitatively different signal from attention accumulation. A token with moderate but consistent participation in the dominant subspace scores high on persistence but may have low peak attention, making it invisible to attention-based methods.

Composite Eviction Scoring

SPS combines persistence with recency-weighted attention magnitude in a composite score S(j) = alpha * P(j) + (1 - alpha) * M(j). The attention component ensures tokens with genuinely high recent attention are retained even if their spectral persistence is moderate. The persistence component preserves structurally important dormant tokens that attention scoring would discard. Grid search on held-out validation yielded optimal alpha = 0.6, indicating that spectral persistence is the stronger signal but attention magnitude contributes meaningfully.

The weighting reflects a specific trade-off. Newly inserted tokens have not yet accumulated spectral history and therefore have low persistence scores regardless of their actual importance. The attention magnitude component provides coverage during this initialization period, preventing premature eviction of recent tokens before their spectral profile stabilizes. Eviction proceeds by discarding the lowest-scoring token per head when budget is exceeded, allowing head-specific importance patterns to be preserved independently.

Experimental Setup

Benchmarks and Models

We evaluated SPS across three long-context benchmarks.

RULER: Synthetic benchmark with 13 task categories testing retrieval, multi-hop reasoning, and aggregation from 4K to 128K tokens.
LongBench: Real-world long-context tasks spanning QA, summarization, few-shot learning, and code completion.
Needle-in-a-Haystack (NIAH): Retrieval stress test directly testing whether eviction preserves retrieval-critical tokens at controlled positions within distractor context.

All experiments used Llama-3-8B-Instruct with 128K context as the primary model, with replication on Mistral-7B-Instruct-v0.3. Inference ran on NVIDIA A100 80GB GPUs using vLLM.

Baselines and Cache Configurations

We compared against five baselines: full KV-cache (oracle upper bound), H2O, StreamingLLM, Scissorhands, and random eviction (lower bound). We evaluated at 75 percent, 50 percent, and 25 percent cache budgets. The 25 percent budget is most deployment-relevant, representing four-times serving capacity on the same hardware. Spectral window W = 64 and SVD rank r = 16 were used for all experiments.

Results and Analysis

Main Results

SPS achieves the strongest performance at every cache budget, with the advantage growing as budget decreases. At 75 percent budget on RULER, SPS retains 97.8 percent of full-cache accuracy compared to 95.1 percent for H2O and 93.4 percent for StreamingLLM. At 50 percent, SPS retains 95.2 percent versus 87.3 percent for H2O. At 25 percent, SPS retains 91.3 percent versus 76.8 percent for H2O and 71.2 percent for StreamingLLM.

The non-linear scaling confirms our analysis. At aggressive ratios, attention-based methods evict latent retrieval tokens while SPS preserves them. The 14.5 point gap at 25 percent budget represents a qualitative deployment difference. A system at 91.3 percent accuracy is production-viable for many applications. One at 76.8 percent is not.

On Needle-in-a-Haystack at 64K context with 25 percent budget, SPS achieves 94.7 percent retrieval accuracy versus 68.3 percent for H2O. The 26.4 point gap directly demonstrates spectral token importance scoring capturing what attention scores miss. The needle token, deposited early and unattended during intervening context, would be evicted by attention methods but is retained by SPS through spectral persistence.

Ablation Studies

Systematic ablation on RULER at 25 percent budget isolated each component.

Spectral persistence only (alpha = 1.0): 88.9 percent accuracy, a 12.1 point improvement over attention-only baselines.
Attention magnitude only (alpha = 0.0): 78.4 percent, equivalent to enhanced H2O.
Full SPS composite (alpha = 0.6): 91.3 percent accuracy.

Spectral persistence accounts for approximately 71 percent of total improvement. The composite design contributes the remaining 29 percent by retaining high-attention tokens that persistence alone might undervalue due to recent insertion.

We also varied SVD rank: r = 4 yields 87.2 percent, r = 8 yields 89.6 percent, r = 16 yields 91.3 percent, r = 32 yields 91.5 percent. Diminishing returns above r = 16 justify our default.

Failure Cases and Surprising Findings

SPS underperforms H2O on short-range summarization tasks in LongBench by 2.1 points at 25 percent budget. These tasks require attending primarily to recent context, and the spectral window introduces lag in recognizing sudden importance shifts. This is inherent to the spectral approach and represents a genuine trade-off.

A surprising finding emerged at 75 percent budget. SPS achieved 98.6 percent on RULER aggregation tasks, outperforming the full cache by 0.8 points. Evicting low-persistence tokens appears to act as implicit regularization, removing noisy entries. This is small and task-specific but suggests some spectral-informed pruning may improve behavior rather than merely preserving it.

Discussion and Implications

The most significant finding is that the attention matrix's information geometry contains richer importance signals than attention scores themselves. The KV-cache eviction literature has treated attention scores as the natural importance measure because they are the mechanism through which cached tokens influence generation. Our results show that a token's role in the spectral structure is more predictive, particularly for tokens with latent rather than active importance.

This connects to broader questions about how transformers route information. Persistence of tokens in dominant singular subspaces suggests attention patterns have stable structural components across many steps. These components define the model's active working context at a structural level, independent of any single query. KriraAI's ongoing research into attention geometry suggests this spectral structure may be informative for understanding model behavior beyond cache management, potentially offering a new lens for interpreting how transformers encode long-range dependencies.

For practitioners, our results provide actionable guidance for KV-cache memory optimization. At moderate compression of 50 to 75 percent, attention-based policies remain adequate. At 25 percent, necessary for cost-effective multi-user serving, spectral persistence scoring provides a substantially better quality-memory trade-off. The 3.2 percent latency overhead is modest relative to the four-times capacity increase.

[Icon Point Image Title: Deployment Implications of SPS 01: Four-Times Serving Capacity Increase 02: Sub-Four Percent Latency Overhead 03: Predictable Quality Retention 04: Head-Specific Importance Preservation]

Limitations and Future Work

The spectral window size W = 64 was validated on our benchmarks, and optimal values may vary across deployment scenarios. We have not evaluated contexts exceeding 128K tokens. The method assumes standard multi-head attention. Applicability to grouped-query and multi-query attention architectures, increasingly common in production, requires investigation since reduced head counts change available spectral structure.

The running SVD adds approximately 2 MB memory overhead per head, negligible compared to cache savings but fixed regardless of budget. KriraAI is investigating approximate persistence methods reducing this further, learned eviction policies distilling the spectral signal into lightweight scorers, and combining SPS with KV-cache quantization for compound savings. A system using SPS at 25 percent budget with 4-bit quantization would achieve approximately 16-times memory reduction, though preliminary results suggest quantization noise interacts with persistence computation in ways requiring careful calibration.

Conclusion

This research makes three contributions to long-context KV-cache eviction policy design. First, we identified the latent retrieval failure mode, demonstrating that 31 percent of retrieval-critical tokens fall into the eviction zone of attention-based methods during dormancy. Second, we introduced Spectral Persistence Scoring, which leverages the spectral structure of attention patterns to identify structurally important tokens, with the spectral component accounting for 71 percent of improvement. Third, we demonstrated that SPS retains 91.3 percent of full-cache accuracy at 25 percent budget versus 76.8 percent for H2O, making aggressive cache compression viable for production long-context inference memory deployment.

These findings indicate that the attention score is not the best signal for managing the transformer's memory. The spectral geometry carries richer information, and future KV-cache memory optimization should explore this perspective broadly.

This work is part of KriraAI's research programme on making large-scale inference efficient and reliable for enterprise deployment. We approach deployment challenges as research problems requiring principled investigation. We invite researchers and practitioners working on attention-based cache compression, spectral token importance scoring, and inference efficiency to engage with these findings and explore extending spectral persistence to the broader family of inference-time memory management problems. KriraAI publishes this research because the efficiency challenges of frontier AI are best solved through shared investigation.

FAQs

A KV-cache eviction policy determines which key-value pairs to discard from the attention cache during autoregressive generation when memory budget is exceeded. It matters because KV-cache grows linearly with context length, becoming the dominant memory bottleneck. At 128K tokens, Llama-3-8B requires approximately 32 GB of KV-cache in FP16. Without eviction, serving concurrent users becomes infeasible on standard GPU hardware. The eviction policy directly determines how much compression is achievable before output quality degrades, making it critical for production long-context inference memory management. Better policies enable more aggressive compression, translating directly into higher serving capacity on the same hardware.

Attention-based eviction fails because it assumes tokens with low current attention will not be needed later. This is violated by latent retrieval tokens, which encode information deposited early and receive minimal attention during intermediate generation before becoming critical for later retrieval. Our analysis found 31 percent of retrieval-critical tokens fell below the 20th attention percentile during dormancy. At 25 percent cache budget, 75 percent of tokens must be evicted, making it highly probable that these dormant important tokens are discarded. The resulting information loss causes unpredictable quality degradation on retrieval and reasoning tasks, making attention-based cache compression unreliable at the aggressive ratios needed for cost-effective deployment.

Spectral persistence scoring measures how consistently a token participates in the dominant singular vectors of the attention matrix across a window of recent steps. Attention accumulation sums the attention weights a token receives. The fundamental difference is that spectral persistence captures structural role in the attention pattern's low-rank geometry, while accumulation captures how strongly the token is directly attended to. A token can have high spectral persistence despite low attention if it consistently influences dominant routing pathways. In our experiments, spectral persistence accounts for 71 percent of SPS's improvement over attention-only baselines, confirming it captures a qualitatively different importance signal.

The running SVD approximation adds approximately 3.2 percent additional per-token latency compared to standard KV-cache inference. This comes from computing rank-16 randomized SVD of attention weights at each step and updating persistence scores. The amortized cost is O(n * r) per step. Additional memory overhead is approximately 2 MB per attention head. This overhead is fixed regardless of cache budget and negligible compared to the tens of gigabytes of KV-cache memory that SPS enables saving through aggressive eviction. At 25 percent budget, the 3.2 percent latency increase enables four-times serving capacity, a highly favorable trade-off for production systems.

KV-cache quantization reduces per-entry memory through lower numerical precision, while SPS reduces the number of entries retained. These approaches are orthogonal and combinable for compound savings. A system using SPS at 25 percent budget with 4-bit quantization would achieve approximately 16-times memory reduction versus full FP16 cache. KriraAI is investigating this combination, though preliminary results suggest quantization noise interacts with spectral persistence computation. The persistence scorer may need to operate on pre-quantization attention values while cached entries are stored in reduced precision, adding architectural complexity that requires careful engineering and validation.

Divyang Mandani

Founder & CEO

Apr 15, 2026

Divyang Mandani is the CEO of KriraAI, driving innovative AI and IT solutions with a focus on transformative technology, ethical AI, and impactful digital strategies for businesses worldwide.

Ready to Write Your Success Story?

Do not wait for tomorrow; lets start building your future today. Get in touch with KriraAI and unlock a world of possibilities for your business. Your digital journey begins here - with KriraAI, where innovation knows no bounds.