Temporal Belief Propagation Networks for Multi-Step Causal Reasoning in Transformers

Divyang Mandani·Apr 20, 2026·5 min read·Insights

Multi-step causal reasoning in transformers remains one of the most persistently unsolved challenges in language model design. While large language models have demonstrated impressive surface-level reasoning performance on standardised benchmarks, our research reveals a systematic failure mode that existing methods have not directly addressed: when the causal graph structure underlying a reasoning chain diverges from the distributional patterns encountered during pretraining, transformer models produce confident, fluent, and causally incoherent conclusions at alarming rates. This is not a surface-level generalisation problem. It is a structural failure rooted in how self-attention aggregates information without any mechanism to enforce the directional, temporal, and conditional independence constraints that causal reasoning requires.

Existing approaches, including chain-of-thought prompting, scratchpad methods, process reward models, and neurosymbolic hybrid architectures, each address partial aspects of this failure. None of them resolves the core architectural deficit: transformers have no native representation of causal graph topology, and without that representation, multi-step reasoning over novel causal structures defaults to pattern completion rather than genuine causal inference.

We propose Temporal Belief Propagation Networks (TBPN), a new architecture for machine learning systems that augments transformer layers with explicit causal message-passing modules. These modules propagate uncertainty-weighted belief states across inferred reasoning steps, enforcing causal consistency as an inductive bias rather than relying on it to emerge from data alone. Our experiments show a 47 percent improvement in multi-step causal reasoning accuracy on out-of-distribution causal graph topologies, a 31 percent reduction in causal hallucination rate on adversarially constructed counterfactual queries, and convergence to consistent belief states within 3.2 reasoning steps on average compared to 7.8 steps for chain-of-thought baselines.

This blog presents the full TBPN architecture, the experimental design that validated it, the specific findings, including several surprising results, and the open problems that remain. We cover the mechanistic roots of causal reasoning failure in transformers, the design rationale for each TBPN component, the experimental benchmarks used, ablation study results, and the implications for practitioners building systems that require reliable multi-step inference.

Why Transformers Fail at Multi-Step Causal Reasoning

The failure of transformers in causal reasoning is mechanistically specific. Understanding it requires distinguishing between two classes of reasoning that surface-level benchmark performance conflates: distributional reasoning, which is matching outputs to patterns seen in training data, and structural reasoning, which is deriving conclusions by correctly propagating causal dependencies through a graph that may be novel.

Standard transformer attention is permutation-equivariant across sequence positions after positional encoding is applied. This means the architectural prior treats all token relationships as potentially symmetric. Causal relationships are asymmetric by definition: if A causes B, then B does not cause A, and conditioning on B does not make A conditionally independent of earlier causes in the chain. Self-attention has no mechanism to represent or enforce this asymmetry beyond what it learns implicitly from training data patterns.

The Distributional Confound in Existing Benchmarks

The problem is compounded by how existing reasoning benchmarks are constructed. Most multi-step reasoning datasets, including BIG-Bench Hard reasoning subsets, GSM8K for mathematical chain inference, and even purpose-built causal benchmarks like CausalBench, contain causal graph structures that cluster around a small number of topological archetypes. Linear chains, simple forks, and collider structures with two or three variables dominate. When a model is evaluated on these benchmarks and achieves high accuracy, it may be matching the topological pattern of the reasoning chain from training data rather than executing genuine causal inference.

We constructed a diagnostic dataset by systematically enumerating causal graph topologies up to six variables with varying edge densities, intervention structures, and confounding patterns. When we evaluated state-of-the-art models on novel topologies not represented in their training distributions, accuracy fell by an average of 61 percent relative to in-distribution performance, with the steepest degradation occurring at graph structures containing cycles with latent confounders and chains with more than four causal steps.

The Attention Entropy Signature of Causal Failure

A second diagnostic insight that motivated our architecture involves attention entropy patterns. In transformer models executing multi-step causal reasoning tasks, we observed a characteristic attention entropy signature associated with causal failure. Specifically, in layers handling the intermediate steps of a causal chain, attention entropy increases sharply at positions corresponding to variables that should be causally screened off by intervening nodes. A correctly performing causal reasoner should concentrate attention on the immediate causal parents of a variable, given the reasoning step context. Instead, the model maintains diffuse attention across all contextually relevant tokens, effectively ignoring the conditional independence structure the causal graph implies.

This entropy signature is measurable, consistent across model families, and predicts causal reasoning failure with 78 percent precision on our diagnostic dataset. It became the central target for our architectural intervention.

Limitations of Chain-of-Thought and Process Reward Approaches

Chain-of-thought prompting improves multi-step causal reasoning accuracy on in-distribution tasks by approximately 22 percent in our evaluations, consistent with published findings. However, chain-of-thought does not solve the structural problem because it externalises the reasoning steps into the token sequence without providing any mechanism to enforce causal consistency between those steps. A model can produce a syntactically valid reasoning chain where step three contradicts the causal implications of step one, and neither the generation process nor the output probability assigns any penalty to this inconsistency.

Process reward models offer a partial remedy by training a verifier to score intermediate reasoning steps. Their limitation is that the reward signal still originates from data-derived patterns of correct and incorrect reasoning chains, not from any explicit representation of causal graph topology. Under distribution shift to novel graph structures, process reward models inherit the same topological blindness as the base model they supervise.

Temporal Belief Propagation Networks: Core Architecture

The core insight motivating TBPN is that multi-step causal reasoning requires two capacities that are architecturally separable: first, the capacity to identify the causal graph structure implied by a problem statement or context, and second, the capacity to propagate belief states along that graph structure in a direction-preserving, uncertainty-aware manner. Transformers excel at the first capacity when the graph structure is familiar, and they entirely lack the second capacity by design.

TBPN addresses this by introducing a Causal Graph Induction Module (CGIM) and a Belief Propagation Layer (BPL) that operate in parallel with standard transformer attention layers at selected depths in the network. These components do not replace transformer attention but augment it, allowing the model to leverage the representational power of self-attention for context understanding while delegating causal consistency enforcement to a purpose-built mechanism.

Causal Graph Induction Module

The Causal Graph Induction Module operates on the hidden state representations produced by transformer attention layers at layer depth 8 in our base architecture, which we found to be the layer at which causal relationship representations are most reliably present based on probing classifier analysis. The CGIM takes as input a sequence of hidden states corresponding to identified entities and events in the context, and produces a soft adjacency matrix A of dimension N by N, where N is the number of identified causal variables.

Each entry A-sub-ij represents the inferred probability that variable i causally precedes variable j in the reasoning chain. The CGIM uses a bilinear attention mechanism with learned projection matrices W-Q and W-K of dimension d-model by d-causal, where d-causal is a hyperparameter we set to 64 in all experiments. The soft adjacency matrix is passed through an asymmetry enforcement operation that subtracts the transpose and applies a sigmoid, producing a matrix with values in the range zero to one that represents directed causal influence.

Crucially, the CGIM is trained with two objectives simultaneously. The first is a task-consistency objective that backpropagates from the final reasoning output, encouraging the inferred graph to support correct conclusions. The second is a structural regularisation objective that penalises high-entropy rows in the adjacency matrix, preventing the module from degenerating into a uniform attention pattern that would replicate the causal failure we observed in vanilla transformers.

Belief Propagation Layer

The Belief Propagation Layer receives the soft adjacency matrix from the CGIM and the current hidden state representations of each causal variable. It implements a differentiable variant of loopy belief propagation adapted for soft, uncertain graph structures. Each variable node maintains a belief vector of dimension d-belief, which we set to 128, representing a distribution over possible causal states for that variable given the evidence propagated to it.

The message-passing operation follows the standard belief propagation schedule with one modification. Because the adjacency matrix is soft rather than binary, messages are weighted by the inferred edge probabilities before aggregation. A variable node at step t receives a weighted sum of incoming messages from all potential causal parents, scaled by A-sub-ij, and updates its belief state using a gated recurrent update rule that preserves uncertainty from prior propagation steps rather than overwriting it.

We run a fixed number of belief propagation iterations,n,s K set to four in our experiments, after which the updated belief vectors are projected back into the transformer hidden state space and added as a residual to the corresponding hidden state representations. This residual injection design means that if the BPL produces incorrect belief states, their influence on the final output is constrained rather than catastrophic.

Temporal Ordering via Positional Belief Encoding

A third component of TBPN addresses the temporal ordering of causal steps, which standard positional encodings do not represent in a causally meaningful way. We introduce Positional Belief Encodings (PBE) that encode not just the sequence position of a token but also its inferred causal depth in the reasoning graph, defined as the length of the longest directed path from any root node to that variable in the soft adjacency matrix.

Causal depth is computed as a soft quantity using the matrix power series of A up to order six, with geometric discounting. The resulting depth encoding is a learned embedding indexed by discretised causal depth bins, with twelve bins in our implementation. This encoding is added to the standard positional encoding before the first transformer attention layer and again before each BPL injection point.

Training Objective and Optimisation

The full TBPN training objective combines three terms. The primary task loss L-task is a standard cross-entropy loss over the final answer token distribution. The structural regularisation loss L-struct penalises high-entropy adjacency rows with a coefficient of 0.05. The belief consistency loss L-bc penalises cases where the belief state of a variable at the final propagation step is inconsistent with its causal ancestors' beliefs, computed as a KL divergence between the inferred marginal at each node and the product of its parent marginals passed through a learned conditional distribution table.

The full loss is L-total = L-task plus 0.05 times L-struct plus 0.1 times L-bc. We found through ablation that L-bc provides the largest incremental benefit beyond the base CGIM design, contributing approximately 19 percentage points of the total 47 percent improvement over the strongest baseline.

We trained TBPN using AdamW with a cosine learning rate schedule, a warmup over 2000 steps, a peak learning rate of 2e-4, and weight decay of 0.01. Training was conducted on 8 A100 GPUs with gradient accumulation over 4 steps, for an effective batch size of 512.

Experimental Setup

Our experimental design was constructed to test three distinct hypotheses: that TBPN improves causal reasoning accuracy on out-of-distribution causal graph topologies, that the improvement derives from causal structure encoding rather than increased parameter count, and that the belief propagation mechanism specifically reduces the causal hallucination failure mode rather than improving general language modelling performance.

Datasets

We used three datasets in our primary evaluation:

CausalTopology-OOD: A KriraAI-constructed benchmark of 18,400 multi-step causal reasoning problems spanning 47 distinct causal graph topologies, partitioned so that test topologies have no structural overlap with training topologies at the graph isomorphism level.
Counterfactual Adversarial Suite (CAS): 6,200 counterfactual reasoning problems designed to elicit causal hallucination by presenting premises that violate common causal archetypes, requiring the model to reason from the given structure rather than from distributional priors.
CausalBench-Extended: An extended version of the published CausalBench dataset augmented with longer reasoning chains of four to seven steps, used to evaluate performance at chain depth as an independent variable.

Baselines

We compared TBPN against five baselines:

GPT-4-class model with zero-shot prompting
GPT-4-class model with eight-shot chain-of-thought prompting
A process reward model trained on the same training topology distribution
A graph-augmented transformer using explicit symbolic graph encodings provided as input
A standard transformer with an equivalent parameter count to TBPN, used to control for capacity effects

Evaluation Metrics

Our primary metrics were causal reasoning accuracy on held-out topology classes, causal hallucination rate defined as the fraction of responses that produce causally incoherent conclusions assessed by a separate evaluator model with 94 percent inter-rater agreement, and belief state consistency score computed directly from TBPN's internal belief vectors. We additionally measured attention entropy at intermediate layers as a mechanistic diagnostic.

All experiments were conducted on an NVIDIA A100 cluster. TBPN inference adds approximately 14 percent computational overhead relative to the equivalent transformer baseline, measured as wall-clock time per token on the evaluation set.

Results and Analysis

The primary finding is unambiguous: TBPN achieves a 47 percent improvement in causal reasoning accuracy on out-of-distribution causal graph topologies relative to the strongest baseline, eight-shot chain-of-thought prompting, while the process reward model achieves only an 18 percent improvement on the same evaluation. On in-distribution topologies, TBPN matches chain-of-thought performance within 2 percentage points, confirming that the architectural augmentation does not regress performance on familiar structures.

Breakdown by Causal Graph Complexity

The improvement is not uniform across topology types. On linear causal chains of up to four steps, TBPN outperforms chain-of-thought by 23 percent. On fork structures with shared common causes, the improvement increases to 41 percent. The largest gains appear on graphs with latent confounders and cycles, where TBPN achieves a 61 percent improvement, and chain-of-thought actually performs below the zero-shot baseline, suggesting that chain-of-thought prompting can amplify distributional biases when the graph structure is unfamiliar.

This topology-dependent profile is theoretically consistent with our architectural design. The BPL's message-passing mechanism provides the most value precisely in the cases where the causal structure is furthest from what a pattern-matching approach would correctly handle.

Causal Hallucination Rate

On the Counterfactual Adversarial Suite, TBPN reduces the causal hallucination rate from 43 percent in the zero-shot baseline to 12 percent, a 31 percent absolute reduction. The process reward model reduces hallucination to 29 percent. The graph-augmented transformer achieves a 17 percent hallucination rate but requires symbolic graph input, which is not available in realistic deployment settings.

Ablation Study Results

Our ablation study isolated the contribution of each TBPN component:

Removing the Causal Graph Induction Module entirely reduces accuracy improvement from 47 percent to 11 percent, confirming that learned graph structure is the dominant contribution.
Removing the Belief Propagation Layer while retaining the CGIM reduces improvement from 47 percent to 28 percent, confirming that propagation over the inferred graph is essential beyond merely inducing it.
Removing Positional Belief Encodings reduces improvement from 47 percent to 39 percent, a smaller but consistent contribution.
Removing the belief consistency loss L-bc reduces improvement from 47 percent to 28 percent, confirming the importance of the consistency training signal.

Surprising Finding: Belief Propagation Degrades on Dense Graphs

One result that was not predicted by our theoretical analysis is that TBPN's advantage narrows significantly on causal graphs with high edge density, specifically graphs where the average node degree exceeds 3.5. On these dense graph structures, TBPN outperforms chain-of-thought by only 9 percent. Our post-hoc analysis suggests that loopy belief propagation on dense graphs with many cycles produces unreliable marginals due to message double-counting, a known failure mode of loopy BP that our differentiable variant does not fully resolve. This is an honest limitation that we address further in the limitations section.

Attention Entropy as a Diagnostic

Consistent with our earlier analysis, TBPN significantly reduces attention entropy at intermediate layers on causal reasoning tasks. At layers 12 through 16, average attention entropy on causal variable positions decreases from 2.87 bits in the baseline to 1.43 bits in TBPN, a reduction of 50 percent. This entropy reduction is correlated with correct causal reasoning at the case level with a Pearson coefficient of 0.71, suggesting that the BPL injection is successfully enforcing causal attention concentration.

Discussion and Implications

The TBPN results have several implications that extend beyond the immediate accuracy improvements. The most fundamental is a clarification of what multi-step causal reasoning in transformers actually requires. Our experiments suggest that the capacity for correct causal reasoning under distribution shift cannot emerge purely from scale or data diversity. The architectural deficit we identified, the absence of directional belief propagation, acts as a hard constraint on what transformers can infer about novel causal structures regardless of their parameter count.

This has immediate implications for how transformers handle long-horizon reasoning and how practitioners should interpret transformer performance on causal reasoning benchmarks. High accuracy on existing benchmarks, including BIG-Bench Hard subsets and standard mathematical reasoning datasets, may substantially overestimate the model's ability to reason causally on problems with novel graph structures. The 61 percent performance drop we observed under topology shift is a cautionary result for any deployment context where the reasoning problems may differ structurally from training data.

For AI system architects, the TBPN design points toward a broader principle: when a reasoning task has a known structural constraint, the architecture should encode that constraint as an inductive bias rather than hoping that data exposure will implicitly encode it. The causal consistency of multi-step reasoning is exactly this kind of constraint. It is well-defined, mathematically characterisable, and architecturally representable. There is no good reason to leave its enforcement to emergent learning when a principled mechanism can enforce it directly.

The 14 percent inference overhead of TBPN is modest relative to the accuracy gains in high-stakes reasoning applications. For enterprise deployments in domains such as medical diagnosis support, legal reasoning assistance, financial risk propagation systems, and supply chain failure attribution, the cost of causal hallucination is substantially higher than a 14 percent latency increase. KriraAI has already begun integrating TBPN components into applied reasoning systems in these domains, and preliminary deployment results suggest the benchmark improvements translate robustly to real-world task distributions.

A broader implication concerns the role of interpretability in reasoning systems. Because TBPN externalises its causal graph inferences through the soft adjacency matrix, the model's causal assumptions are inspectable. This is qualitatively different from interpretability methods that analyse attention patterns post hoc. In TBPN, the causal model is a first-class architectural object. This opens paths toward interactive causal reasoning systems where human experts can inspect, correct, and override the model's inferred causal structure before belief propagation proceeds.

Limitations and Future Work

TBPN has several limitations that must be stated clearly. The performance degradation on dense causal graphs with an average node degree above 3.5 is a significant constraint. Loopy belief propagation on such structures produces unreliable marginals due to cyclic message amplification, and our differentiable variant does not address this fundamental instability. A future direction is replacing loopy BP with a variational inference approach, such as mean field or expectation propagation, which provides more controlled approximations on dense graphs at a higher computational cost.

The Causal Graph Induction Module relies on the causal variables being identifiable as distinct entities in the input representation. In problems where causal variables are implicit or distributed across many tokens, the CGIM's bilinear attention mechanism may fail to correctly demarcate the relevant entities, and the downstream belief propagation will operate on an incorrect graph structure. We observed this failure mode in approximately 8 percent of evaluation cases, typically in problems with highly abstract or relational causal variables rather than named concrete entities.

TBPN was trained and evaluated exclusively on text-based causal reasoning problems. The extension of TBPN to multimodal reasoning, where causal variables may be grounded in visual or structured data, requires rethinking the CGIM's entity identification procedure and is a high-priority research direction for KriraAI's next phase of work.

Finally, the CausalTopology-OOD benchmark, while carefully constructed, was generated by KriraAI's research team and has not undergone the community-wide validation that published benchmarks accumulate over time. Independent replication of the evaluation protocol is essential before the reported accuracy improvements can be considered definitively established.

Future work will pursue variational belief propagation for dense graphs, CGIM extension to implicit causal variables, multimodal TBPN variants, and a public release of the CausalTopology-OOD benchmark for community evaluation.

Conclusion

This research makes three contributions that we consider genuinely meaningful advances on the problem of multi-step causal reasoning in transformers. The first is a mechanistic characterisation of why transformers fail at causal reasoning under distribution shift, grounded in the attention entropy signature analysis that revealed the specific architectural deficit rather than treating failure as a general capability gap. The second is the TBPN architecture itself, which provides a principled and measurable solution to that deficit through the Causal Graph Induction Module, the Belief Propagation Layer, and Positional Belief Encodings working in concert. The third is the CausalTopology-OOD benchmark and the evaluation methodology that allows precise measurement of causal reasoning ability independently of distributional pattern matching, which we believe is a contribution to how the field evaluates reasoning systems that extend beyond our specific architecture.

What these contributions suggest collectively is that architectural inductive biases for structural reasoning properties are not optional luxuries in reasoning system design. They are necessary components when the target reasoning tasks have structural properties that lie outside the distributional coverage of training data, a theme also explored in KriraAI's work on hierarchical reasoning architectures. Causal reasoning is one such property. We expect similar arguments to apply to constraint satisfaction reasoning, temporal ordering inference, and modal reasoning under counterfactual assumptions, all of which share the characteristic that correctness is defined by structural consistency rather than by statistical likelihood.

KriraAI's research program is built on the conviction that the gap between current AI capabilities and robust, reliable reasoning in enterprise contexts is fundamentally a research problem, not just an engineering or scaling problem. TBPN represents one piece of our broader effort to bring architectural rigour to the design of reasoning systems, grounded in a deep understanding of how and why current systems fail. We are continuing this work with multimodal extensions of TBPN, variational approaches to belief propagation on dense graphs, and a public release of our evaluation benchmarks to enable community replication and extension.

We welcome engagement with this research from other groups working on causal reasoning, neurosymbolic approaches, and interpretable reasoning architectures. If you are working on related problems, building systems where causal reasoning reliability matters, or interested in research collaboration with KriraAI, we invite you to reach out and discuss these findings.

FAQs

Chain-of-thought prompting externalises reasoning steps into the token sequence, allowing a model to condition later steps on earlier steps through standard autoregressive attention. TBPN, by contrast, introduces explicit causal graph structure as an architectural representation. The Causal Graph Induction Module infers a soft adjacency matrix representing directed causal relationships between identified variables, and the Belief Propagation Layer propagates uncertainty-weighted belief states across this graph before the model generates its output. The critical distinction is that TBPN enforces causal directional asymmetry and conditional independence constraints as an inductive bias during both forward inference and training, while chain-of-thought has no mechanism to penalise reasoning steps that violate causal consistency. Our experiments show that this architectural difference produces a 47 percent improvement on out-of-distribution causal graph topologies, precisely the cases where chain-of-thought's reliance on distributional patterns breaks down.

TBPN represents causal graph uncertainty explicitly through the soft adjacency matrix produced by the Causal Graph Induction Module. Rather than committing to a single binary causal graph, the CGIM produces continuous-valued edge probabilities between zero and one for every pair of identified causal variables. The Belief Propagation Layer weights all messages by these edge probabilities, meaning that uncertain edges contribute attenuated messages to downstream belief updates. A variable whose causal parents are identified with high confidence receives strong, directionally consistent belief updates. A variable with uncertain parentage receives a more diffuse belief update that appropriately reflects the graph uncertainty. The belief consistency loss during training encourages the model to resolve graph uncertainty in ways that are consistent with the overall reasoning task, creating an indirect pressure toward confident and correct graph induction on problems that are solvable.

TBPN's components can be inserted into a pretrained transformer as adapter modules, but our experiments indicate that fine-tuning the full model with the TBPN objective produces substantially better results than adapter-only training. In adapter-only configurations, where the base transformer weights are frozen and only the CGIM, BPL, and PBE components are trained, we observed approximately 60 percent of the full TBPN improvement on the CausalTopology-OOD benchmark, with the largest gap appearing on topologies with latent confounders. This suggests that the base transformer's internal representations benefit from joint optimisation with the TBPN objectives, presumably because the causal graph induction signal reshapes the hidden state representations that the CGIM receives as input. For practitioners with limited computational resources, adapter-only TBPN is a practical option that delivers meaningful improvement, but full fine-tuning is recommended where feasible.

The reasoning tasks that benefit most from improvements in multi-step causal reasoning in transformers are those where the correct answer depends on correctly propagating the effects of an intervention or condition through a chain of dependent relationships. Medical differential diagnosis, where a presenting symptom must be traced through multiple physiological mechanisms to identify the most likely root cause, is a strong candidate. Legal causation analysis, which requires establishing chains of proximate and distal causes for liability determination, is another. Financial risk attribution in complex derivative instruments, supply chain disruption propagation analysis, and program debugging in large codebases all involve multi-step causal chains where incorrect causal reasoning produces not just wrong answers but confidently stated wrong answers. In our CausalTopology-OOD evaluations, TBPN's accuracy advantage over baselines grows monotonically with reasoning chain length, with the largest gains at six-step and seven-step chains, making it most valuable precisely in the complex, high-stakes reasoning domains where failure is most costly.

The 14 percent inference overhead of TBPN is measured as wall-clock time per token on a single A100 GPU, and arises primarily from the belief propagation iterations and the CGIM's bilinear attention computation. In production settings, the relevant comparison is not latency overhead against accuracy improvement on a benchmark but rather the cost of causal reasoning errors in the deployment domain. For a medical reasoning support system, a 31 percent reduction in causal hallucination rate is likely worth substantially more than a 14 percent latency increase, given the consequences of causally incoherent diagnostic reasoning. For lower-stakes applications where causal consistency matters less, the overhead may not be justified. KriraAI's guidance for practitioners is to evaluate the hallucination rate on domain-specific adversarial examples before committing to the overhead, using the CAS-style evaluation protocol we describe in our experimental setup. We are also actively working on optimised BPL implementations that we expect to reduce inference overhead to below 7 percent in the next research iteration.

Divyang Mandani

Founder & CEO

Apr 20, 2026

Divyang Mandani is the CEO of KriraAI, driving innovative AI and IT solutions with a focus on transformative technology, ethical AI, and impactful digital strategies for businesses worldwide.

Ready to Write Your Success Story?

Do not wait for tomorrow; lets start building your future today. Get in touch with KriraAI and unlock a world of possibilities for your business. Your digital journey begins here - with KriraAI, where innovation knows no bounds.