Epistemic Drift in Chain-of-Thought Reasoning: Introducing ESAR

Large language models that reason through chains of intermediate steps exhibit a systematic failure that receives less targeted research attention than it deserves: they cannot distinguish between what they know and what they are assuming. When a model generates a chain-of-thought reasoning sequence, every token is produced under the same autoregressive regime regardless of whether it represents a stated premise, a logically derived conclusion, or an ungrounded assertion generated to fill a reasoning gap. The consequence is epistemic drift in chain-of-thought reasoning, a phenomenon in which unverified assumptions accumulate silently, each subsequent step treating the assumption as established fact and compounding the original error with increasing expressed confidence. Our experiments show that standard chain-of-thought prompting produces epistemic drift in 61.3% of multi-step reasoning chains, a rate that remains essentially unchanged by self-consistency sampling.
At KriraAI, we encountered this failure repeatedly in production reasoning pipelines and found that neither process reward model scoring, self-consistency sampling, nor self-reflection approaches addressed the root cause. The core issue is not simply that models make mistakes; it is that they make mistakes while expressing the same syntactic confidence as when they are correct, giving downstream reasoning steps no signal that the epistemic foundation has been compromised. We propose ESAR, Epistemic State-Aware Reasoning, a lightweight adapter framework that attaches a four-class epistemic provenance classifier to frozen base LLMs, constructs a runtime dependency graph to propagate epistemic contamination forward, and triggers targeted backtracking when contaminated conclusions contradict verified premises.
The Problem: Epistemic Drift in Multi-Step Chain-of-Thought Reasoning
Every statement in a well-formed reasoning chain belongs to one of four epistemic classes: Premises are propositions stated as given, Derivations are conclusions that follow logically from premises or prior derivations, Assumptions are assertions generated to fill gaps where no grounding exists, and Query is the target proposition being solved for. A competent human reasoner tracks these distinctions implicitly, flagging assumptions as conditional and remaining alert to whether downstream reasoning depends on grounded or ungrounded prior steps.
Language models treat all four classes identically during autoregressive generation, conditioning on all prior tokens with equal weight regardless of epistemic status. Building reliable enterprise solutions requires advanced generative AI development services that address hallucination, reasoning accuracy, and output reliability. When an ungrounded assumption enters a reasoning chain at step k, every step from k+1 through the final conclusion treats that assumption as a premise. The assumption never receives a correctness check, because subsequent steps are conditioned on all prior context and simply continue building. By the time the chain reaches its conclusion, the final answer may be formally consistent with the assumption while being entirely incorrect given the actual premises.
The consequences for multi-step reasoning failure modes in LLMs are severe and scale adversarially with chain length. Our analysis of the LogiQA 2.0 benchmark finds that 38.4% of incorrect answers trace to a specific assumption introduced at an intermediate step rather than to an arithmetic error or query misinterpretation. On problems requiring eight or more reasoning steps, this proportion rises to 57.1%.
Why Existing Approaches Fall Short on Reasoning Chain Error Propagation
Self-consistency sampling generates multiple independent reasoning chains and selects the most frequent final answer by majority vote, improving accuracy by averaging over random errors. It fails against systematic assumption errors, however, because when an assumption is contextually plausible the model generates it in the majority of sampled chains precisely because it is the highest-probability continuation given the context. In our experiments, SC-CoT with 40 sampled chains reduces the Epistemic Drift Rate by only 6.1 percentage points relative to single-chain CoT, confirming that the correction mechanism is orthogonal to the failure mode.
Process reward models score intermediate reasoning steps based on their empirical correlation with correct final answers, conflating logical validity with epistemic grounding. A step that introduces a plausible assumption which happens to be true receives a high process reward, even though it is epistemically unsound. Process reward models therefore reward what looks like good reasoning rather than sound epistemic structure, and cannot detect assumption propagation as a distinct failure class.
Reflexion and related self-reflection approaches prompt the model to critique its own reasoning chain after generation. The fundamental limitation is that the critique is conditioned on the same prior context that made the original assumption seem plausible. When an assumption is realistic enough to pass the model's generation filter, it is typically realistic enough to pass the model's self-critique filter as well. Reflexion with two reflection rounds reduces LogiQA 2.0 error rates by 1.2 percentage points in our experiments, compared to a 7.7 percentage point improvement from ESAR using a single reasoning pass.
ESAR: Epistemic State-Aware Reasoning for LLM Chains

ESAR is built around a central architectural commitment: epistemic provenance must be tracked explicitly and structurally, not inferred retrospectively from correctness signals or aggregate sampling statistics. The framework consists of four interacting components that together provide the first explicit epistemic state tracking mechanism for language models operating without base model modification.
Epistemic Token Classification
The Epistemic Token Classification module is a three-layer multi-layer perceptron trained on the frozen hidden states of a base LLM. For each sentence-level reasoning segment, identified by a lightweight boundary detector that operates on per-token attention entropy at punctuation positions, the ETC module reads the final token hidden state and outputs a four-way softmax distribution over the epistemic taxonomy: Premise, Derivation, Assumption, and Query.
The ETC head is trained on 52,000 synthetically generated reasoning chains, produced by pairing a first-order logic generator over Horn clause constraints with GPT-4o annotation that translates formal proof steps into natural language. Each step inherits its epistemic class from the underlying formal structure, eliminating annotation ambiguity. We train using cross-entropy with the base LLM fully frozen, adding fewer than 50 million parameters to a 70B-parameter base model, and achieve 91.2% classification accuracy on a held-out set of 8,000 annotated chains. This result is theoretically significant: the ETC head is a readout of existing latent representations, confirming that epistemic provenance is already encoded in LLM hidden states even though models produce no explicit epistemic markers during generation.
The Epistemic Propagation Graph
The Epistemic Propagation Graph is a runtime data structure requiring no learned parameters. As the reasoning chain is generated step by step, the EPG receives the ETC classification for each completed step and lexical dependency links extracted by a co-reference resolution pass that identifies when a current step references a prior step's conclusion. Each reasoning step becomes a node in a directed acyclic graph. The EPG propagates contamination forward via a single transitive rule: any Derivation node with at least one Assumption parent is reclassified as Assumption-contaminated, with contamination continuing transitively through all downstream dependents, while Premise nodes and the Query node are immutable under propagation.
Uncertainty Accumulation Loss and Epistemic Backtracking Signal
The Uncertainty Accumulation Loss is an optional training signal for settings where supervised fine-tuning of the base model is available. When the EPG marks a step as Assumption-contaminated, the UAL applies an additive cross-entropy penalty weighted at lambda equal to 0.3 over a vocabulary subset of 47 high-confidence reasoning markers identified through corpus analysis, including terms such as "therefore" and "it follows that." The objective trains the model to produce less assertive phrasing in contaminated contexts without requiring explicit uncertainty language, reducing Assumption-contaminated steps using high-confidence markers by 22.7% in fine-tuned conditions.
The Epistemic Backtracking Signal is a deterministic inference-time mechanism requiring no learned parameters. When the EPG detects that an Assumption-contaminated node contradicts a Premise-class node in the dependency graph, the EBS flags the originating Assumption node, inserts a structured revision prompt that labels the assumption as unverified, and re-enters generation to either ground the assumption or acknowledge the conditional nature of subsequent conclusions. In our evaluation on EpistemicBench, the EBS triggers an average of 1.3 backtracking events per problematic chain and resolves the originating contradiction in 84.6% of triggered cases.
EpistemicBench: A Controlled Framework for Assumption Error Evaluation
EpistemicBench is a controlled benchmark of 2,000 multi-step reasoning problems designed specifically to measure assumption detection in neural reasoning chains rather than only final answer accuracy. Standard benchmarks evaluate whether the final answer is correct but provide no ground-truth annotation of where errors originate, making it impossible to distinguish reasoning chain error propagation from arithmetic errors or query misinterpretations.
EpistemicBench problems are generated at four difficulty levels corresponding to chain lengths of two, four, six, and eight steps. For each level, a single ungrounded assumption is injected at an early, middle, or late position, and ground-truth annotations identify exactly which subsequent steps depend on it, enabling computation of the Epistemic Drift Rate: the proportion of incorrect reasoning chains in which the error traces specifically to assumption propagation through the EPG ancestry. KriraAI releases EpistemicBench as an open evaluation resource alongside this publication to support reproducibility and future research on epistemic state tracking in language models.
Experimental Setup
Datasets and Baselines
We evaluate ESAR across five benchmarks chosen to span reasoning modalities and difficulty levels.
GSM8K: Evaluated on the 1,847 problems requiring five or more reasoning steps, isolating the subset where assumption propagation risk is highest due to chain length.
MATH Level 4 and 5 (Hendrycks et al.): High school competition mathematics requiring long-horizon multi-step derivation with abundant assumption entry points in algebraic setup.
LogiQA 2.0: Logical reasoning over natural language passages with formal correct-answer annotations across four logical reasoning categories.
EntailmentBank: Multi-step textual entailment with step-level ground-truth annotations enabling evaluation of intermediate reasoning accuracy, not only final answer correctness.
EpistemicBench: Our controlled assumption injection benchmark described above.
We compare against five baselines.
Standard chain-of-thought prompting on Llama-3-70B-Instruct and GPT-4o.
Self-consistency CoT with 40 sampled chains and majority vote.
Process reward model scoring via Math-Shepherd applied to CoT chains.
Reflexion with two self-reflection rounds.
Step-Back Prompting.
Evaluation Metrics and Computational Configuration
For final answer evaluation we use task-standard metrics: exact match for GSM8K and MATH, classification accuracy for LogiQA 2.0. For EntailmentBank we additionally compute step-level accuracy, and our primary proposed metric is the Epistemic Drift Rate. All ESAR experiments use Llama-3-70B-Instruct on eight NVIDIA A100 80GB GPUs, with ETC head training using AdamW at a learning rate of 2e-4 and batch size 128. Total inference overhead for ESAR relative to base CoT is 8.3% per chain, comprising the ETC forward pass, EPG construction per step, and EBS detection logic.
Results and Analysis

Main Performance Results
On EpistemicBench, ESAR reduces the Epistemic Drift Rate from 61.3% for standard CoT to 18.7%, a 69.5% relative reduction. This primary result directly validates the hypothesis that explicit epistemic provenance tracking provides a qualitatively different class of error correction compared to statistical sampling or retrospective reflection. It also confirms that multi-step reasoning failure modes in LLMs are addressable at the structural level, not only through additional sampling.
On the GSM8K five-plus-step subset, ESAR improves accuracy from 72.4% to 81.6% over standard CoT on Llama-3-70B-Instruct, a 12.7 percentage point improvement. Self-consistency CoT achieves 78.1% on this subset using 40 forward passes per problem, meaning ESAR surpasses SC-CoT with a single reasoning pass plus lightweight overhead. On LogiQA 2.0, ESAR achieves 74.9% versus 67.2% for standard CoT and 71.1% for SC-CoT. On MATH Level 4 and 5, accuracy improves from 31.8% to 39.4%, and on EntailmentBank step-level accuracy improves from 78.3% to 87.1%, confirming that ESAR improves the intermediate reasoning process rather than merely shifting final answer selection.
Ablation Study Findings
We ablate each ESAR component independently, holding remaining components fixed.
Without ETC: Replacing epistemic classification with uniform Premise-class assignment eliminates 73% of the performance gain on EpistemicBench. Epistemic classification is the primary mechanism; graph and backtracking components amplify it but cannot substitute for it.
Without EPG: Evaluating each step's epistemic status in isolation without propagating contamination forward causes an 8.4 percentage point degradation on eight-step EpistemicBench problems while producing negligible impact on two-step problems. The EPG contributes specifically to long-chain error detection across multiple dependency hops.
Without EBS: Removing backtracking causes a 3.2 percentage point degradation on EpistemicBench, concentrated in chains of seven or more steps where contaminated conclusions are most likely to contradict verified premises explicitly.
Without UAL: Removing the uncertainty accumulation loss has negligible effect on accuracy but increases high-confidence marker use in Assumption-contaminated steps by 22.7%, with implications for the trustworthiness of model output in human-in-the-loop deployment.
Error Analysis and Failure Cases
ESAR underperforms standard CoT on very short reasoning chains. On GSM8K problems requiring one to three steps, ESAR is 1.4 percentage points below base CoT. These problems contain no genuine assumptions, so the ETC and EPG add overhead without correction benefit, and occasional Derivation-to-Assumption misclassifications introduce unnecessary caution. The false positive rate for Assumption detection on two-step chains is 12.1%.
ESAR also degrades on problems where abductive reasoning is the correct intended strategy. In abductive contexts, generating a plausible hypothesis and reasoning forward to verify it is correct procedure, but the EBS treats the hypothesis step as a drift event and triggers backtracking. The four-class taxonomy does not distinguish assumptions being implicitly accepted from hypotheses being explicitly tested, and addressing this requires an extended epistemic class for the hypothesis-under-test role in abductive inference.
Discussion: What Epistemic Drift Reveals About LLM Reasoning Architecture
The 91.2% classification accuracy of the ETC head trained on frozen LLM representations is the most theoretically significant result we report. The ETC module reads hidden states produced by a model that was never trained to track epistemic status and recovers epistemic class labels at near-human reliability, establishing that epistemic provenance is already present in large language model representations. The limitation being addressed by ESAR is therefore a connectivity problem: the model encodes the distinction between premises and assumptions in its hidden states but does not use that encoding to modulate surface-level generative confidence or to flag contaminated conclusions.
This finding challenges a core assumption in process reward model research: that step-level quality must be learned from empirical correctness signals because no better ground truth is available. Our results suggest an alternative, namely that epistemic class is recoverable from existing hidden state structure with a lightweight supervised classifier trained on synthetic data. The representations need not be learned anew; they need to be connected to generative behavior, which the UAL accomplishes through training and activation steering offers as an additional direction.
The implications for practitioners are concrete. Systems deployed in high-stakes domains including clinical decision support, legal reasoning, and financial analysis face specific exposure to a failure mode in which the system constructs reasoning on an invented assumption and expresses the conclusion with high confidence. ESAR's EBS mechanism provides detection and structured recovery with 8.3% inference overhead, modest relative to the reliability improvement in applications where incorrect confident reasoning carries operational consequences.
Limitations and Future Work
Several constraints bound the current ESAR results. The ETC head is trained on synthetically generated reasoning chains derived from a first-order logic generator, and its 91.2% accuracy has not been validated against manually annotated chains from domain-specific tasks such as clinical literature synthesis or legal argumentation, where the surface forms of epistemic classes differ substantially from mathematical reasoning. Domain-specific ETC fine-tuning will likely be required before reliable deployment in specialised high-stakes contexts.
The 12.1% Assumption false positive rate on short chains is a meaningful limitation for high-throughput inference systems. Without chain-length-conditioned threshold calibration, unnecessary backtracking events introduce latency overhead that may be unacceptable at scale. A principled threshold selection method based on chain length and domain characteristics is an active engineering priority for production integration.
The current four-class taxonomy does not capture abductive hypothesis testing, tool-retrieved information, or algorithmically computed values, all of which carry distinct epistemic statuses in agentic pipelines. These settings require taxonomy extension and new propagation rules that account for the reliability characteristics of each information source.
Three research directions anchor KriraAI's near-term work. First, we are extending ESAR to agentic contexts where tool calls introduce Retrieved and Computed epistemic classes with domain-specific contamination semantics. Second, we are investigating whether the latent epistemic representations identified by ETC analysis can be targeted through activation steering in the base model, potentially eliminating the separate classification head. Third, we are developing a continuous epistemic uncertainty score to replace the discrete four-class taxonomy, enabling softer contamination propagation in domains where the boundary between derivation and assumption is inherently graded. As enterprises increasingly adopt autonomous systems, reliable AI agent development requires strong reasoning controls, evaluation frameworks, and safeguards against incorrect decision-making.
Conclusion
This research makes three contributions to the study of multi-step reasoning in language models. The first is a precise characterisation of epistemic drift in chain-of-thought reasoning as a structural failure mode, accounting for 57.1% of errors on eight-step reasoning problems. The second is the ESAR framework's four interacting components providing explicit epistemic state tracking for frozen base LLMs without base model modification. The third is the finding that epistemic provenance is latently encoded in LLM hidden states, confirmed by the ETC module's 91.2% classification accuracy from frozen activations, establishing that reasoning reliability requires connecting existing latent knowledge to generative behavior.
These findings have practical implications for how multi-step reasoning systems are designed and evaluated. The Epistemic Drift Rate provides a structural quality diagnostic unavailable from accuracy metrics alone, and EpistemicBench provides a controlled evaluation framework for any approach targeting assumption propagation errors.
At KriraAI, this work is one part of a broader research program investigating the specific failure modes that matter most when AI reasoning is deployed in enterprise contexts where confident-but-wrong outputs carry real operational consequences. We publish these findings openly because we believe rigorous applied research on production failure modes belongs in the public domain. We welcome discussion, methodological critique, and collaboration from researchers working on reasoning reliability, epistemic calibration, and production AI evaluation. If you are building multi-step reasoning systems and would like to explore the ESAR framework or EpistemicBench, KriraAI invites the conversation.
FAQs
Epistemic drift in chain-of-thought reasoning occurs when a language model introduces an ungrounded assumption at an intermediate step and all subsequent steps treat it as established fact. Unlike random errors that vary across independently sampled chains, epistemic drift is systematic: a plausible assumption appears in the majority of chains generated for the same problem, making statistical correction approaches ineffective. In production AI systems, this failure produces confident-sounding conclusions that are formally coherent given the internally generated assumption but entirely incorrect given the actual user-provided premises, with no surface signal indicating the structural breakdown.
The ETC module is a three-layer multi-layer perceptron that reads the frozen hidden states of a base LLM at sentence boundaries and classifies each reasoning step as Premise, Derivation, Assumption, or Query. The base model is never modified. Achieving 91.2% classification accuracy from frozen activations confirms that large language models already encode epistemic provenance in their hidden state representations without any epistemic-tracking training objective. The ETC module is therefore a precision readout of existing latent knowledge, adding fewer than 50 million parameters to a 70B-parameter base model and incurring roughly 1.2% additional overhead per reasoning step.
The Epistemic Propagation Graph is a parameter-free runtime directed acyclic graph that propagates epistemic contamination forward: any Derivation depending on an Assumption is reclassified as Assumption-contaminated, transitively through all downstream dependents. Process reward models assign scalar quality scores based on empirical correlation with correct final answers, meaning a plausible but unverified assumption receives a high process reward if it happens to be true, whereas the EPG flags it as epistemically unsound regardless of accidental correctness. The EPG captures a structural property of reasoning chains that correctness-based scoring cannot detect by design.
Yes, three of ESAR's four components operate with fully frozen base model weights. The ETC head is a separately trained lightweight module, the Epistemic Propagation Graph is a parameter-free runtime structure, and the Epistemic Backtracking Signal is a deterministic inference-time mechanism. Only the Uncertainty Accumulation Loss requires supervised fine-tuning. Experiments using frozen Llama-3-70B-Instruct achieve a 69.5% reduction in Epistemic Drift Rate on EpistemicBench and a 12.7 percentage point accuracy improvement on GSM8K five-plus-step problems, making ESAR deployable against any instruction-tuned LLM accessible via API with hidden state access.
Standard reasoning benchmarks measure final answer accuracy without ground-truth annotation of where errors originate, making it impossible to distinguish assumption propagation failures from arithmetic errors or query misinterpretations. EpistemicBench addresses this by injecting a single ungrounded assumption at a controlled position and annotating exactly which subsequent steps depend on it, enabling computation of the Epistemic Drift Rate. This metric captures whether errors are caused specifically by assumption propagation through EPG ancestry, a structural diagnostic that accuracy metrics cannot surface, allowing direct evaluation of assumption detection in neural reasoning rather than only final output correctness.
Founder & CEO
Divyang Mandani is the CEO of KriraAI, driving innovative AI and IT solutions with a focus on transformative technology, ethical AI, and impactful digital strategies for businesses worldwide.