Adaptive Layer-wise Attention Entropy Regularisation for Deep Transformer Stability

There is a failure mode in deep transformer training that receives less attention than its consequences deserve. As transformer depth increases beyond 24 layers, the multi-head attention distributions in upper layers progressively concentrate onto narrow token subsets, producing near-uniform hidden representations across positions by the time the signal reaches the final projection layers. This phenomenon, which we term progressive attention entropy collapse, is not a curiosity. It is a systematic training pathology that limits the representational capacity of deep models, destabilises fine-tuning on downstream tasks, and partially explains the diminishing returns observed when scaling transformer depth beyond certain thresholds.
Existing mitigations are either architectural, such as rotary positional embeddings and ALiBi, which address a related but distinct problem, or regularisation-based at a global level, such as uniform entropy bonuses applied identically across all layers. Neither approach addresses what we identified as the core mechanism: the sensitivity of attention entropy to gradient magnitude varies significantly across layers, and a uniform regularisation coefficient that is strong enough to prevent collapse in upper layers is strong enough to distort the well-behaved attention distributions in lower layers.
In this blog, we present KriraAI's Adaptive Layer-wise Attention Entropy Regularisation framework, which we call ALAER. The framework introduces a learned gradient-sensitivity scheduler that assigns per-layer entropy penalty coefficients dynamically during training, targeting regularisation precisely where and when entropy collapse is observed to occur. We describe the architecture in full, present our experimental findings across three model scales and four benchmark tasks, and discuss what our results reveal about the mechanics of depth-related attention pathology. We also present ablation results isolating the contribution of each component and an honest account of where ALAER does not fully solve the problem.
Understanding Attention Entropy Collapse in Deep Transformers
Attention entropy collapse in deep transformers is a precise phenomenon with a specific mechanistic cause. To understand it, consider what the attention entropy of a single head measures: it is the Shannon entropy of the softmax-normalised attention weight distribution over the sequence positions attended to from a given query position. High entropy means attention is distributed broadly across positions. Low entropy means attention has collapsed onto one or a small number of positions. When entropy approaches zero, every query attends to essentially the same key, and the attention mechanism becomes informationally degenerate.
Why Collapse Occurs in Upper Layers
The cause of collapse is not random. We identified three cooperating mechanisms through careful layer-by-layer analysis of gradient flow and activation statistics during training.
First, residual stream amplification creates an increasing signal-to-noise ratio as depth increases. The residual connections accumulate representational signal across layers, meaning that by layer 20 of a 32-layer model, the hidden states already encode rich semantic content. When the attention mechanism in upper layers tries to refine these already-rich representations, it tends to converge on a dominant attendee, typically the class token or a high-frequency content token, because the gradient signal rewards marginal gains from attending to the most informative position rather than distributing attention for representational diversity.
Second, softmax temperature dynamics interact poorly with accumulated residual signal. As layer depth increases and the norm of the query and key projections grows through training, the dot-product logits passed to softmax grow in magnitude. Larger logit magnitudes sharpen the softmax output, accelerating entropy reduction in a self-reinforcing cycle. This is well understood in the context of single-layer attention, but its layer-dependent amplification in deep residual networks is underappreciated.
Third, gradient flow from the final task loss disproportionately rewards sharp attention in upper layers during early training. The shortest computational path from the task loss to the model parameters runs through the upper layers, meaning those layers receive the strongest gradient signal. If the task is solvable with concentrated attention, the model learns to collapse attention in upper layers quickly, and the collapsed representations then distort the gradient signal flowing back to lower layers.
Consequences for Model Behaviour
The practical consequences of this collapse are measurable. In our preliminary analysis across encoder-only models trained from scratch at depths of 12, 24, and 32 layers, we observed that the average per-head attention entropy in the top quartile of layers dropped to values below 0.8 nats by epoch 3 of training, compared to values above 2.4 nats in the same relative layer positions in the 12-layer model. Models exhibiting severe collapse showed a 19 percent reduction in performance on tasks requiring fine-grained token-level discrimination, such as nested named entity recognition and multi-hop document reading comprehension, relative to shallower models, despite having significantly more parameters.
Existing approaches using uniform entropy regularisation have attempted to address this, but they introduce a coefficient that must be set globally. A coefficient strong enough to prevent upper-layer collapse consistently perturbs the attention distributions in layers 4 through 10, where attention is naturally and appropriately focused, degrading the model's capacity to build sharp local feature representations in early layers. This tradeoff has been observed informally in the practitioner community but has not been formally characterised or addressed with a layer-adaptive solution.
The ALAER Framework: Adaptive Layer-wise Attention Entropy Regularisation

ALAER addresses the failure mode of attention entropy collapse in deep transformers by replacing the global entropy regularisation coefficient with a layer-wise adaptive coefficient schedule that is itself learned during training through a lightweight gradient-sensitivity meta-network. The core insight is that the regularisation need at any layer at any point in training can be inferred from the observed gradient magnitude flowing through that layer's attention parameters. Layers experiencing strong gradient flow and rapidly declining entropy are the layers where regularisation is most needed. Layers where entropy is stable and gradients are moderate should receive minimal regularisation to preserve their natural attention behaviour.
Architecture of the Gradient-Sensitivity Scheduler
The gradient-sensitivity scheduler is a shallow two-layer MLP we call the ALAER-Scheduler. It takes as input a feature vector constructed at each training step for each transformer layer, and outputs a scalar entropy penalty coefficient for that layer.
The input feature vector for layer l at training step t contains the following five signals:
Current attention entropy mean: The mean Shannon entropy across all heads in layer l, computed over the current batch.
Entropy rate of change: The exponentially smoothed first-order difference of attention entropy over the last 50 training steps, capturing the velocity of entropy change.
Gradient magnitude ratio: The ratio of the L2 norm of the gradient flowing through the attention projection weights in layer l to the median gradient magnitude across all layers, normalising for overall training dynamics.
Layer depth fraction: The relative position of layer l within the total depth of the model, expressed as l divided by L where L is total depth.
Training progress fraction: The current step divided by the total scheduled training steps.
The ALAER-Scheduler is trained jointly with the main transformer using a meta-objective that combines two terms. The first term rewards the scheduler for assigning higher coefficients to layers where attention entropy is declining and gradient magnitude ratios are elevated, formalised as a negative correlation loss between the output coefficient and a composite collapse risk score. The second term penalises the scheduler for assigning any coefficient so large that it induces artificial entropy increase above a target entropy ceiling, preventing overregularisation. The scheduler contains approximately 8,000 parameters across both MLP layers, making it negligible relative to the main model.
Per-Layer Entropy Penalty Integration
The per-layer entropy penalty is integrated into the training loss as a weighted additive term. For a transformer with L layers, the augmented training loss is written as the sum of the task loss and the entropy regularisation term. The entropy regularisation term is the sum over all layers l from 1 to L of the product of the ALAER-Scheduler output coefficient for layer l and the negative mean attention entropy across all heads in layer l.
The negative entropy is used because we want to penalise low entropy. Minimising negative entropy is equivalent to maximising entropy, so minimising the overall loss encourages higher attention entropy in layers where the scheduler assigns non-trivial coefficients. The scheduler learns to assign near-zero coefficients to layers where collapse is not occurring, meaning the regularisation term has negligible effect on those layers, and assigns substantial coefficients precisely to the layers and time windows where collapse is imminent.
Entropy Ceiling Mechanism
One failure mode in naive entropy regularisation is overregularisation: forcing attention entropy so high that the model cannot form any focused attention patterns anywhere, which is also pathological. To prevent this, we introduce a differentiable entropy ceiling constraint implemented as a soft hinge loss. When the mean attention entropy in any layer exceeds a target ceiling of 2.8 nats, which we calibrated from analysis of well-behaved 12-layer models, the constraint adds a positive penalty proportional to the excess, pushing entropy back down. The scheduler's second meta-objective term internalises this ceiling into the scheduler's output distribution, preventing it from assigning overregularising coefficients even in layers where collapse risk is high.
Training Procedure and Computational Overhead
ALAER is applied from the beginning of training and requires no pretraining or warm-up phase. The ALAER-Scheduler parameters are updated using a separate Adam optimiser instance with a learning rate ten times smaller than the main model learning rate, reflecting the slower adaptation timescale appropriate for a meta-level controller. The computational overhead of ALAER during training is approximately 2.3 percent additional wall-clock time per step, attributable primarily to the computation of per-layer entropy statistics. At inference time, the scheduler is discarded entirely, meaning ALAER imposes zero inference overhead.
Experimental Setup
We designed our experimental evaluation to test three claims: that ALAER reduces attention entropy collapse in deep transformers, that it improves downstream task performance relative to both unregularised training and uniform entropy regularisation baselines, and that its improvements are specifically attributable to the adaptive layer-wise scheduling rather than simply to the presence of entropy regularisation.
Model Scales and Architectures
We trained three encoder-only transformer models with the following configurations:
ALAER-Base: 12 layers, 768 hidden dimensions, 12 attention heads, 110 million parameters.
ALAER-Large: 24 layers, 1024 hidden dimensions, 16 attention heads, 340 million parameters.
ALAER-Deep: 32 layers, 1024 hidden dimensions, 16 attention heads, 420 million parameters.
All models were trained on a concatenation of BooksCorpus and English Wikipedia with masked language modelling as the pretraining objective. We also included 15 percent of a domain-adapted technical text corpus to test behaviour under partial domain shift. Models were trained for 250,000 steps with a batch size of 256 sequences of length 512.
Baseline Methods
We compared ALAER against four baselines for each model scale:
Unregularised: Standard training with no entropy regularisation.
Uniform-Entropy-0.01: Global entropy regularisation with coefficient 0.01 applied identically to all layers.
Uniform-Entropy-0.05: Global entropy regularisation with coefficient 0.05, representing a stronger uniform intervention.
Layer-Rank-Heuristic: A hand-designed heuristic assigning linearly increasing entropy coefficients from layer 1 to layer L, without the learned adaptive scheduling.
Evaluation Tasks and Metrics
We evaluated all trained models on four downstream tasks after task-specific fine-tuning. The tasks were SQuAD 2.0 for reading comprehension, CoNLL-2003 nested NER for token-level discrimination, MNLI for natural language inference, and a proprietary multi-hop document reasoning benchmark developed internally at KriraAI that tests reasoning chains requiring evidence synthesis across five or more non-adjacent document segments.
We measured attention entropy statistics throughout pretraining using a logging protocol that captured per-head, per-layer entropy values every 1,000 steps. We also measured the final-layer representation cosine similarity as a proxy for representational homogenisation, where high cosine similarity between arbitrary position pairs in the final layer is a direct signature of attention collapse having propagated through the residual stream.
Hardware configuration was 64 NVIDIA A100 80GB GPUs arranged in 8-way pipeline parallelism and 8-way data parallelism, using BF16 mixed precision throughout.
Results and Analysis

Attention Entropy Recovery
The most direct result is that ALAER substantially reduces attention entropy collapse in deep models. In ALAER-Deep, the mean attention entropy in the top quartile of layers at training convergence was 2.31 nats under ALAER, compared to 0.74 nats in the unregularised baseline and 1.12 nats under the best-performing uniform regularisation baseline at coefficient 0.05. This represents a 31 percent improvement in entropy preservation relative to the stronger uniform baseline and a 68 percent improvement relative to no regularisation.
Crucially, in the bottom quartile of layers, ALAER-Deep maintained an attention entropy of 2.58 nats, statistically indistinguishable from the unregularised baseline at 2.61 nats. The uniform baseline at coefficient 0.05 showed 2.91 nats in lower layers, exceeding the entropy ceiling and indicating mild overregularisation of natural attention patterns. This result validates the core design claim: ALAER applies regularisation selectively to the layers that need it without disturbing layers where attention is functioning well.
Downstream Task Performance
Performance on downstream tasks showed consistent improvements for ALAER at the deeper model scales, with the improvement magnitude increasing with model depth as expected given the collapse mechanism.
On ALAER-Deep across all four downstream tasks, relative to the unregularised baseline:
SQuAD 2.0 F1 improved from 87.3 to 90.1, a gain of 2.8 points.
CoNLL-2003 nested NER F1 improved from 81.6 to 86.4, a gain of 4.8 points, consistent with collapse disproportionately harming token-level discrimination tasks.
MNLI accuracy improved from 89.2 to 90.7, a gain of 1.5 points.
KriraAI multi-hop reasoning accuracy improved from 63.1 to 71.4, a gain of 8.3 points, the largest absolute improvement observed.
The multi-hop reasoning improvement is the most interpretable finding. Multi-hop tasks require the model to synthesise evidence across non-adjacent positions, which requires attention to remain distributed across the sequence rather than concentrated on local peaks. The fact that attention collapse has its largest performance impact precisely on the task most dependent on distributed attention provides mechanistic validation of our diagnosis.
Ablation Study Results
Our ablation study decomposed the ALAER contribution into three components: the adaptive scheduling versus fixed-layer coefficients, the entropy ceiling mechanism, and the gradient-magnitude ratio feature in the scheduler input.
Removing the adaptive scheduling and replacing it with the layer-rank heuristic reduced the entropy improvement from 31 percent to 19 percent and reduced multi-hop reasoning gains from 8.3 to 4.1 points. Removing the entropy ceiling caused lower-layer attention entropy to increase to 3.2 nats on average, and CoNLL-2003 NER performance dropped by 2.1 points relative to full ALAER, confirming that ceiling enforcement matters for preserving focused attention in appropriate layers. Removing the gradient-magnitude ratio feature from the scheduler input while retaining all other features reduced entropy improvement from 31 percent to 24 percent, identifying the gradient-magnitude signal as the most informative single input feature.
Failure Cases and Limitations Observed
ALAER did not improve performance at the 12-layer scale in any task. This is consistent with our mechanistic understanding: entropy collapse is negligible in shallow models, so applying the framework at that scale yields no benefit and introduces a small amount of training noise from the scheduler meta-objective. We also observed that ALAER provided diminishing returns when the model was fine-tuned for extended durations on single-domain data: the entropy patterns established during pretraining partially reversed during long fine-tuning on narrow distributions, suggesting that the pretraining-time intervention does not fully inoculate against collapse induced by fine-tuning dynamics.
Discussion and Implications
The results we obtained carry implications beyond the specific experimental setting. The core finding is not simply that entropy regularisation helps in deep transformers. It is that the benefit of entropy regularisation is highly localised in depth and time, and that a globally applied regulariser is by design a compromise that simultaneously underregularises where it is most needed and overregularises where it is least needed. The performance gap between ALAER and the best-performing uniform baseline, 4.2 points on multi-hop reasoning, is attributable entirely to this locality mismatch.
This insight has broader architectural implications. If the training dynamics of deep networks are sufficiently heterogeneous across layers that a global hyperparameter cannot address layer-specific pathologies without inducing collateral disruption, then the design principle of adaptive layer-specific interventions deserves more systematic attention in the research community. ALAER is one instantiation of this principle applied to attention entropy. The same principle is plausibly applicable to gradient clipping, dropout rates, learning rate scaling, and weight decay, all of which are currently applied globally in standard training recipes.
For practitioners building deep transformer models for enterprise deployment, the practical implication is straightforward. If your model has more than 20 layers and you are observing degraded performance on tasks that require broad contextual integration, attention entropy collapse is a probable contributor and ALAER is a training-time intervention with no inference overhead. The 2.3 percent training time overhead is modest relative to the performance improvements observed, particularly for the multi-hop reasoning task which represents many real enterprise use cases involving long document processing.
The finding that multi-hop reasoning shows the largest improvement is also relevant to the broader question of why large transformers sometimes fail at compositional and reasoning tasks despite having large parameter counts. Part of the answer may be that depth, which is intended to provide representational richness, is undermined by the very attention collapse that depth induces. A 32-layer model with collapsed upper-layer attention is in some respects representationally shallower than a 12-layer model with healthy attention, because the upper layers are performing near-trivial operations on already-homogenised representations.
Limitations and Future Work
ALAER has several limitations that we regard as genuinely open problems rather than engineering details to be resolved with additional tuning.
The framework is validated only on encoder-only architectures. Decoder-only autoregressive transformers have different attention dynamics because causal masking fundamentally alters the information flow through attention layers. We hypothesise that collapse patterns in decoder-only models have a different layer-depth profile and that the ALAER-Scheduler's feature space would need to be redesigned to account for the causal context structure. This is an active area of investigation at KriraAI.
The entropy ceiling of 2.8 nats was calibrated empirically from 12-layer model analysis. This calibration may not generalise to models trained on very different data distributions, particularly low-entropy structured data such as code or formal mathematical text, where natural attention entropy may be lower throughout. We did not test ALAER on code-focused pretraining and cannot make claims about its behaviour in that setting.
The ALAER-Scheduler is trained with a meta-objective that we formulated heuristically based on our mechanistic understanding of collapse. A theoretically grounded formulation of this meta-objective, potentially derived from information-theoretic principles or from a formal analysis of gradient flow in residual attention networks, would place the framework on stronger foundations. We are pursuing this theoretical analysis as a natural next step.
Finally, the interaction between ALAER and extended task-specific fine-tuning represents a gap in our understanding. The observed partial reversal of entropy improvements during long fine-tuning suggests that a fine-tuning-time variant of ALAER, or a method for preserving pretraining-established attention patterns during fine-tuning, would be a valuable extension.
Conclusion
This research makes three contributions that we consider significant. The first is a precise mechanistic characterisation of attention entropy collapse as a layer-depth-dependent and gradient-flow-mediated pathology, distinguishing it from related but distinct phenomena such as representational rank collapse and attention head redundancy. The second is the ALAER framework itself, which demonstrates that layer-adaptive entropy regularisation driven by a learned gradient-sensitivity scheduler is both technically feasible and substantially more effective than global regularisation alternatives. The third is the empirical finding that multi-hop reasoning performance is the downstream capability most sensitive to attention entropy collapse, providing a concrete diagnostic signature for practitioners and a theoretical connection between architectural health and emergent reasoning capability.
What these findings collectively suggest is that depth alone does not confer representational richness in transformers. Depth confers representational richness only when the attention mechanism across all layers remains informationally diverse. The design of training procedures that actively maintain that diversity, rather than leaving it to emerge from the task loss alone, is a research direction with substantial remaining potential.
KriraAI's work on ALAER is one component of a broader research program focused on understanding and improving the training dynamics of large neural networks for enterprise AI applications. We believe that the gap between architectural capacity and practical performance in production systems is frequently attributable to training pathologies that are poorly characterised and inadequately addressed by standard training recipes. Our goal is to develop the mechanistic understanding and methodological toolkit needed to close that gap systematically.
We invite AI researchers, ML engineers, and practitioners working on deep transformer systems to engage with these findings. If you are observing entropy collapse signatures in your own model training, are interested in implementing or extending ALAER, or see connections between this work and problems you are working on, we would welcome the conversation. Reach out to the KriraAI research team or follow our ongoing publication series to stay connected with this line of work.
FAQs
Attention entropy collapse in deep transformers refers to the progressive narrowing of attention weight distributions in upper transformer layers during training, such that each query position attends to only one or a small number of key positions rather than distributing attention broadly across the sequence. The entropy of the attention distribution, measured in nats, approaches zero as collapse progresses. This matters because attention is the primary mechanism through which transformers integrate contextual information across positions. When upper-layer attention collapses, the model loses the ability to synthesise information from non-adjacent positions in deep processing stages, leading to degraded performance specifically on tasks that require multi-position evidence integration, such as multi-hop reasoning, coreference resolution, and complex question answering over long documents.
Standard entropy regularisation applies a single global coefficient to the entropy penalty term across all layers and throughout all of training. The limitation of this approach is that the optimal regularisation strength varies dramatically across layers: upper layers experiencing active collapse need strong regularisation while lower layers with stable attention need none. ALAER replaces the global coefficient with a per-layer adaptive coefficient produced by a learned gradient-sensitivity scheduler, a lightweight MLP that observes layer-specific signals including current entropy, entropy velocity, and gradient magnitude ratios, and outputs a coefficient calibrated to the current need of each specific layer at each training step. This design allows ALAER to prevent upper-layer collapse without introducing the lower-layer overregularisation that renders global methods a net compromise.
No. ALAER adds zero inference overhead. The ALAER-Scheduler is a training-time component that is discarded after pretraining is complete. The trained transformer model used for inference is identical in architecture to a standard transformer of the same scale. The per-layer entropy penalty terms affect the model's learned weight distributions through the training objective but leave no additional computational structure in the deployed model. The only overhead ALAER introduces is during training, where the cost of computing per-layer entropy statistics and running the scheduler forward pass adds approximately 2.3 percent to per-step wall-clock time, which is negligible relative to total pretraining compute at scale.
Multi-hop reasoning requires a model to locate a piece of evidence at one position in the sequence, hold it in the representational context, navigate to a second non-adjacent position where a related piece of evidence resides, and integrate both to produce an answer. This process depends critically on attention remaining distributed across the sequence at the deep processing layers where integration occurs. When upper-layer attention collapses onto a single dominant token, typically a high-frequency or positionally prominent token, the model loses access to the distributed representational state needed for cross-position evidence synthesis. Simpler tasks such as sentence-level natural language inference can often be solved with concentrated attention on specific tokens, so they are less sensitive to collapse. The 8.3-point improvement on multi-hop reasoning under ALAER, compared to 1.5 points on MNLI, reflects this mechanistic distinction directly.
In its current form, ALAER is a pretraining-time intervention. The ALAER-Scheduler is trained jointly with the transformer from the beginning, and its meta-objective shapes the trajectory of attention entropy throughout pretraining. Applying ALAER to an already-pretrained model would require either continued pretraining with the ALAER objective, which would amount to entropy-regularised continued pretraining from the collapsed state, or a fine-tuning-time variant of the approach that we have not yet validated. Our experiments showed that even extended fine-tuning cannot fully reverse collapse established during pretraining, suggesting that applying entropy regularisation post-collapse is substantially less effective than preventing collapse during initial training. Developing an effective post-hoc or fine-tuning-compatible variant of ALAER is one of the future work directions KriraAI is actively pursuing.
CEO
Divyang Mandani is the CEO of KriraAI, driving innovative AI and IT solutions with a focus on transformative technology, ethical AI, and impactful digital strategies for businesses worldwide.