KriraAI Logo

The Rise of Latent Reasoning Models: Beyond Chain of Thought

Divyang Mandani··5 min read·Insights
The Rise of Latent Reasoning Models: Beyond Chain of Thought

The most consequential shift in reasoning systems over the next two years will not be a larger model or a better reward function. It will be the migration of the reasoning process itself off the discrete token stream and into continuous latent space. Every production reasoning system deployed today externalizes its intermediate computation as natural language tokens, which means the model is forced to compress every intermediate state through a vocabulary bottleneck of roughly one hundred thousand discrete symbols and then re-read that compression on the next step. This is an accident of how we trained these systems, not a property of what reasoning requires, and the research community has already begun to treat it as a temporary constraint rather than a permanent architecture.

The reason this matters to anyone building enterprise AI systems is that latent reasoning models change the fundamental economics of a reasoning call. When a model reasons in tokens, the amount of computation it can spend on a problem is coupled to the number of tokens it emits, and each of those tokens costs a full forward pass, a KV cache write, and an autoregressive dependency that cannot be parallelized. When a model reasons in latent space, that coupling breaks. Compute becomes a dial that can be turned independently of output length, and the intermediate representation stops being a lossy string and becomes a high-dimensional vector that carries far more information per unit of computation.

Recent research has made this trajectory concrete rather than speculative. Chain of continuous thought, often called Coconut, demonstrated that feeding a model's last hidden state directly back as its next input embedding, instead of decoding it to a token first, lets the model reason in a continuous space and search over multiple reasoning paths in superposition. Work on recurrent depth reasoning has shown that a model can iterate a recurrent block at test time to spend variable compute per problem without emitting a single extra token. These are early results, but they point to the same conclusion from different directions, and that convergence is the signal worth paying attention to. This post is a technical forecast of where latent reasoning models are heading, why the token bottleneck is reaching its architectural ceiling, how adaptive test-time compute reshapes serving economics, what interpretability debt this creates, and what engineers building reasoning systems should be doing now to prepare for an inference stack that no longer assumes reasoning equals text.

The Token Bottleneck: Why Chain of Thought Is Reaching Its Architectural Ceiling

The Token Bottleneck Why Chain of Thought Is Reaching Its Architectural Ceiling

Chain of thought prompting delivered enormous gains because it gave models a scratchpad, and that scratchpad happened to be made of tokens because tokens were the only interface we had. The framing that reasoning must be verbalized is now so deeply embedded in tooling, evaluation, and intuition that most practitioners treat it as fundamental. It is not fundamental. It is a design choice with a specific and increasingly expensive set of consequences that become visible the moment you look at reasoning as a computational process rather than as a text generation task.

The core problem is that a token is a terrible container for an intermediate cognitive state. A single forward pass through a frontier model produces a hidden state that lives in a space of several thousand dimensions and encodes a rich, continuous distribution over possible continuations. Chain of thought throws almost all of that away by sampling a single discrete token, and then the next step must reconstruct the relevant context from that impoverished symbol. We are running high-bandwidth computation and then forcing it through a low-bandwidth serial channel on every single step.

Discrete tokens as a lossy reasoning channel

The information theoretic view makes the loss precise. A hidden state vector carries on the order of thousands of continuous values, while a token selection carries at most the log of the vocabulary size in bits, which is under seventeen bits for a hundred thousand-token vocabulary. Every reasoning step in a token-based system therefore passes through a bottleneck that is orders of magnitude narrower than the representation the model actually computed. Latent reasoning removes that bottleneck by keeping the intermediate state in its native continuous form and passing it forward without discretization.

This is why continuous chain of thought can represent something that token-based reasoning structurally cannot, which is a genuine superposition of reasoning paths. When the intermediate state is a token, the model has committed to one branch of the search tree at every step and must backtrack explicitly if it was wrong. When the intermediate state is a continuous vector, the model can carry a weighted blend of several candidate lines of reasoning forward simultaneously and let later computation collapse them, which is closer to a breadth-first search encoded in the geometry of the representation than to a single depth-first trajectory.

The compute per token constant that latent reasoning breaks

The second structural ceiling is that token-based reasoning ties the compute budget to the output length through a fixed constant, which is the per-token forward pass cost. If a hard problem needs more computation than an easy one, the only lever a token-based model has is to emit more tokens, which is why reasoning traces balloon to thousands of tokens and why serving costs for reasoning models have grown faster than parameter counts. The model cannot think harder about a single step. It can only think longer across many steps.

Latent reasoning breaks this constant by decoupling depth of computation from length of output. A recurrent depth architecture can iterate its core block four times on an easy token and forty times on a hard one, spending compute exactly where the problem demands it, and never paying the KV cache and detokenization tax that a token would incur. I expect that by 2027 the leading reasoning systems will report their inference budget in terms of latent compute iterations or effective forward passes rather than reasoning tokens, because token count will have stopped being a meaningful proxy for how much the model actually thought.

What Latent Reasoning Models Actually Are

What Latent Reasoning Models Actually Are

Latent reasoning models are systems in which some or all of the intermediate reasoning computation occurs in the model's continuous representation space rather than being decoded into and re-encoded from natural language tokens. The defining property is that the reasoning trajectory is a sequence of hidden states connected directly, without a discretization step in between, so that the full precision of each intermediate representation is preserved and passed forward. This is a strict generalization of chain of thought, because a token-based trace is simply the special case where you force each latent state to collapse to a vocabulary element before continuing.

There are two broad families emerging under this umbrella, and they will likely converge into hybrid systems rather than one winning outright. The first family keeps the autoregressive loop but replaces some of the emitted tokens with raw latent states, which is the Coconut style approach. The second family keeps a normal token interface but adds internal depth recurrence so the model can loop over its own computation at test time before committing to any output, which is the recurrent depth approach. Both are attempts to give the model somewhere to think that is not the visible text.

Continuous chain of thought and the Coconut mechanism

The Coconut mechanism is mechanically simple, which is part of why it is so promising as a direction. Instead of decoding the last hidden state into a token embedding for the next step, the last hidden state is fed back directly as the input embedding for the next forward pass, so the model reasons through a chain of continuous thoughts before it is asked to produce any words. The model is trained with a curriculum that gradually replaces language reasoning steps with these continuous latent steps, teaching it to use the latent channel as working memory rather than as a place to store a verbalized plan.

What makes this more than a compression trick is the behavior it induces. Because the latent state is continuous, the model does not have to commit to a single next reasoning step, and empirical results show it effectively explores multiple candidate paths and prunes them internally, which is why continuous chain of thought outperforms token chains on tasks that require search and backtracking while using fewer forward passes. The engineering implication is significant, because a system that searches in latent space can solve constraint satisfaction and planning problems that would blow up combinatorially if expressed as explicit token-level tree search. I expect the first widely used open-weight model shipping a continuous chain-of-thought mode to appear within the next twelve to eighteen months, initially as an optional reasoning path rather than the default.

Recurrent depth reasoning and looped transformers

The recurrent depth approach attacks the same problem from the compute side rather than the representation side. Instead of a fixed stack of layers executed exactly once, a looped or recurrent block is iterated a variable number of times, so the model can add effective depth at test time by looping more, which lets it scale reasoning compute without growing the parameter count and without emitting extra tokens. Research on this direction has demonstrated models that meaningfully improve on reasoning benchmarks purely by increasing the number of recurrent iterations at inference, which is a form of test-time compute scaling that lives entirely inside a single forward pass.

This matters because it revives adaptive computation time in a form that actually trains stably at scale, which earlier attempts struggled with. A looped transformer can learn a halting behavior where easy inputs exit after a few iterations and hard inputs recurse deeper, and because the recurrence shares weights, the memory footprint stays roughly constant while the compute varies. The practitioner takeaway is that this architecture family will let a single deployed model span a much wider difficulty range than a fixed depth model, which is exactly the property you want when your traffic mixes trivial and hard queries through one endpoint.

Test Time Compute Scaling Enters Its Second Phase

The first phase of test time compute scaling was external and sampling-based. It worked by generating many candidate solutions, sometimes hundreds, and then selecting among them with a verifier or a majority vote, which is how the current generation of reasoning systems buys accuracy with inference compute. This approach is effective but embarrassingly wasteful, because it discards almost all of the computation it performs, and because the parallel samples share no information and cannot learn from one another during the search.

The second phase of test-time compute scaling will be internal and adaptive, and this is where latent reasoning becomes the enabling substrate. Rather than sampling many full traces and voting, a latent reasoning model will spend a variable amount of continuous computation inside a single trajectory, allocating more iterations to the hard sub-steps and fewer to the easy ones, which is a far more efficient use of the same compute budget. The shift from search over many outputs to depth within one output is the defining move of the next generation, and it reframes what the inference budget is even buying.

From sampling-based search to intrinsic adaptive depth

The economic argument for this transition is straightforward. Sampling-based search has a poor scaling exponent, because to double your effective accuracy on the hard tail you often need to multiply your sample count by a large factor, and each sample is a full independent generation with no compute sharing. Intrinsic adaptive depth amortizes far better, because the model reuses a shared latent state and only pays for additional iterations where the problem is genuinely hard, so the marginal compute goes to the marginal difficulty rather than being sprayed uniformly across redundant samples.

I expect the frontier to move decisively toward intrinsic adaptive depth as the primary reasoning lever within two model generations, with sampling-based search surviving only as an outer loop for the very hardest problems where verifiable diversity still helps. Concretely, I expect leading systems by late 2027 to demonstrate matching or exceeding today's best sampling-based reasoning accuracy at a fraction of the inference FLOPs, on the order of a three times to ten times reduction, by replacing wide sampling with deep latent iteration. This will not eliminate sampling entirely, but it will change the default, and the default is what determines the shape of the serving stack.

Adaptive Compute Transformers and the Death of the Fixed Forward Pass

The single deepest architectural assumption in every inference engine deployed today is that a forward pass is a fixed amount of work. Batching, scheduling, memory planning, and cost models all assume that each token costs the same known quantity of computation, which is what makes static batching and paged attention tractable. Adaptive compute transformers break this assumption at the root, because the whole point is that different tokens and different queries do different amounts of work depending on difficulty, and that variability is dynamic and data-dependent.

This is the change that most practitioners have not yet priced in, and it is the one with the widest blast radius across the stack. When compute per token becomes variable, the neat abstractions that make current serving efficient start to leak, because you can no longer plan a batch as a fixed rectangle of work. A model that halts early on easy tokens and recurses deeply on hard ones will produce a ragged compute profile within a single batch, and the scheduler has to deal with a distribution of work rather than a constant.

The transition will therefore be as much a systems engineering story as a modeling story, and the teams that win will be the ones who treat it that way. Serving infrastructure will need to support variable depth execution, dynamic halting, and continuation state that is neither a clean token nor a clean KV cache entry, which is a genuinely new object in the inference stack. I expect the major inference frameworks to add first-class support for variable depth and latent continuation state within roughly two years, because the modeling side will force it, and the teams that build this capability early will have a durable efficiency advantage that is hard to replicate quickly.

There is a hardware dimension here that reinforces the trajectory. Adaptive depth turns a compute-bound workload into one with far more dynamic control flow, which interacts awkwardly with the wide static dataflow that current accelerators are optimized for, and this tension will push both compiler and kernel work toward supporting efficient conditional and iterative execution. The accelerators and kernels that handle variable iteration counts gracefully, without stalling the whole batch on the deepest recurser, will define the practical efficiency ceiling for latent reasoning models over the next several years.

Latent Reasoning vs Chain of Thought: The Tradeoff Practitioners Underestimate

The comparison of latent reasoning vs chain of thought is usually framed as pure upside: faster and cheaper reasoning with better search, and that framing is dangerously incomplete. The honest version acknowledges a real and consequential tradeoff, which is that you are exchanging a transparent, inspectable, verbalized reasoning process for an opaque, high-dimensional one that no human and no simple monitor can read directly. Chain of thought is legible almost by accident, and that legibility has become load-bearing for debugging, evaluation, and safety in ways the industry now takes for granted.

When reasoning moves into latent space, that legibility does not degrade gracefully. It disappears in a discrete step, because a continuous hidden state has no natural rendering into language, and any attempt to decode it back into words is a lossy and potentially misleading approximation of what the model actually computed. Teams that have built their evaluation and guardrail stack on reading and pattern matching over reasoning traces will find those tools returning nothing useful the day they switch to a latent reasoning path.

The tradeoff is not a reason to avoid latent reasoning, because the efficiency and capability gains are too large to leave on the table, and competitive pressure will force adoption regardless. It is a reason to build the replacement observability layer before you need it, rather than after an incident makes its absence obvious. The practitioners who understand the latent reasoning vs chain of thought tradeoff as an observability migration, and not merely a performance upgrade, are the ones who will deploy these systems without walking into an interpretability wall.

There is also a subtler capability tradeoff that deserves attention. Token-based chain of thought benefits from the fact that natural language is a compressed, structured prior that the model has absorbed from pretraining, so verbalizing a plan can sometimes regularize the reasoning and prevent it from drifting into incoherent states. Pure latent reasoning gives up that linguistic scaffold, which is why I expect the durable production architecture to be hybrid, keeping language for the high-level plan and using latent computation for the dense inner loops where verbalization only wastes bandwidth.

The Serving and Systems Consequences No One Has Priced In

Reasoning models already broke the assumption that output length is short, and latent reasoning models will break several more assumptions that the current serving stack depends on. The most immediate is that the relationship between a request and its resource footprint becomes far harder to predict in advance, because the compute a query consumes now depends on its difficulty rather than only on its length, and difficulty is not visible until the model starts working on it. Capacity planning that assumes a known distribution of token counts will need to be rebuilt around a distribution of compute depths.

This has direct consequences for how these systems are priced and metered, which is a question every platform team will have to answer. Billing by output token stops being a fair or even coherent proxy for cost once a model can spend forty latent iterations producing a three-token answer, so I expect commercial reasoning APIs to move toward billing on effective compute or reasoning depth within the next two to three years. The provider that gets this metering right will be able to serve adaptive depth efficiently and pass through honest costs, while providers stuck on token billing will either overcharge easy queries or subsidize hard ones.

KV cache, batching, and variable depth inference

The KV cache is the specific place where latent reasoning stresses current infrastructure hardest, and it deserves concrete attention. In a recurrent depth model, iterating the core block multiple times raises the question of what state persists across iterations and how it interacts with the attention cache, because the model may need to attend to intermediate latent states that were never emitted as tokens and therefore have no clean cache slot in today's designs. The cache abstraction was built around tokens, and latent reasoning introduces intermediate state that does not fit that abstraction cleanly.

Batching gets harder in a related way, because continuous batching schedulers assume each sequence advances by one token per step, and a variable depth model breaks that lockstep. A well-engineered system will need to schedule iterations rather than tokens, keep deep recursers from starving the batch, and reclaim memory from sequences that halt early, all of which are solvable but none of which today's default schedulers do. I expect the first robust open-source implementations of variable depth continuous batching to appear within about two years, and their arrival will be the practical signal that latent reasoning is ready for mainstream production rather than research demos. This is precisely the kind of systems-level frontier where KriraAI focuses its applied research, because the modeling advance is only realized in production once the serving stack catches up to it.

The Interpretability Debt Latent Reasoning Creates

Every architecture that improves capability by hiding computation incurs an interpretability debt, and latent reasoning takes on a large one. The reason chain-of-thought monitoring works at all is that the reasoning is expressed in a channel humans and classifiers can read, which lets us catch a model reasoning toward a bad outcome before it acts, at least when the trace is faithful. There is already good evidence that token traces are not always faithful to the underlying computation, but a somewhat unfaithful readable trace is still far more useful than no trace at all, and latent reasoning threatens to take us to no trace at all.

Paying down this debt is one of the most important open research directions the field faces, and it will not be solved by wishful thinking. Interpretability of latent reasoning will need to advance from reading tokens to probing continuous state, which means training probes and decoders that can extract a faithful summary of what a latent trajectory is doing, ideally in real time and cheaply enough to run as a monitor in production. This is a genuinely hard problem, because a continuous state is not obligated to organize itself into human-legible concepts, and forcing it to may cost some of the capability that made latent reasoning attractive in the first place.

I expect three parallel approaches to develop over the next few years, and mature systems will likely combine them rather than rely on any single one.

  1. Latent probes will be trained to decode intermediate hidden states into structured summaries of the reasoning in progress, giving monitors a lossy but useful readout of a trajectory the model never verbalized.

  2. Constrained latent architectures will be designed to keep reasoning in a subspace that is deliberately shaped to be more decodable, trading a small amount of raw capability for a large gain in inspectability.

  3. Hybrid reasoning modes will keep a verbalized high-level plan while pushing only the dense inner computation into latent space, preserving a readable skeleton even when the details are opaque.

The teams that treat interpretability of latent reasoning as a first-class engineering requirement, rather than a research nicety to bolt on later, will be the ones able to deploy these systems in regulated and high-stakes settings. The teams that ignore it will find themselves unable to satisfy an auditor, a regulator, or their own incident response process, because they will have no answer to the question of why the model did what it did. KriraAI treats this observability layer as part of the core system design rather than an afterthought, because a reasoning system you cannot inspect is a reasoning system you cannot responsibly ship into production.

What Engineers Should Be Building Toward Now

The right response to a shifting architecture is not to rewrite everything today for systems that are not yet mature, and it is also not to wait until the transition is complete and be caught flat-footed. The right response is to make the specific preparatory investments that pay off regardless of exactly which latent reasoning variant wins, because those investments reduce your migration cost and increase your optionality. The following are the concrete moves that a technical team building reasoning systems should be making in the near term.

  1. Decouple your reasoning budget abstraction from token count now, so that your internal cost model, rate limiting, and capacity planning are expressed in terms of compute or effective forward passes rather than emitted tokens, because that abstraction will survive the transition and a token-based one will not.

  2. Build an observability layer that does not depend on reading chain-of-thought text, so that when reasoning moves into latent space you already have probes, aggregate behavioral signals, and outcome-based monitors that keep working when the trace goes dark.

  3. Instrument your traffic to measure the difficulty distribution of real queries, because adaptive depth only pays off if you understand how much of your traffic is easy versus hard, and most teams have never actually measured this on their production distribution.

  4. Keep your serving layer modular around the forward pass, so that when variable depth execution and latent continuation state arrive in your inference framework, you can adopt them without a ground-up rewrite of your scheduler and memory management.

  5. Treat reasoning faithfulness and monitorability as an explicit design requirement in any system that touches regulated, safety-relevant, or high-value decisions, so that you are not forced to choose between capability and compliance at deployment time.

The strategic point underneath these tactics is that latent reasoning models will not arrive as a single dramatic release that you can react to on the day. They will arrive as a gradual migration of computation off the token stream, one capability and one serving feature at a time, and the teams positioned to benefit are the ones whose abstractions already anticipate that migration. This is the kind of forward-calibrated architecture decision that KriraAI helps technical teams navigate, because the cost of building for where the technology is today is a rewrite tomorrow.

Conclusion

The migration of reasoning off the token stream and into continuous latent space is the architectural shift that will define the next generation of reasoning systems, and it changes three things at once for the people who build them. It changes how AI systems will be architected, because the fixed forward pass gives way to adaptive depth and variable compute, and the reasoning trace stops being text and becomes state. It changes what engineering decisions matter now, because the teams that decouple their cost abstractions from token count and build observability that survives the loss of readable traces will migrate cheaply while others rewrite under pressure. And it opens a capability frontier where reasoning depth scales inside a single trajectory rather than across wasteful parallel samples, which is a fundamentally more efficient use of inference compute than anything deployed today.

The practitioners who internalize latent reasoning models as an observability and serving migration, and not merely a performance upgrade, will be the ones who deploy these systems responsibly and at scale. This is precisely the space where KriraAI operates, at the intersection of applied AI research and production deployment, building reasoning systems designed for where the architecture is heading rather than where it sits today. KriraAI conducts the systems-level research that turns a modeling advance like continuous chain of thought into something that actually serves efficiently, monitors faithfully, and holds up under real production traffic, because the frontier capability is only real once the full stack around it is real.

Technical teams evaluating how to prepare their reasoning infrastructure for adaptive test-time compute scaling and latent reasoning are exactly the peers we build for, and we welcome the conversation about how to architect for this transition deliberately. The token bottleneck was never fundamental, and the systems that recognize that first will be the ones defining what reasoning models can do next.

FAQs

Latent reasoning models keep the intermediate reasoning state as a continuous hidden vector passed directly forward, while chain of thought decodes each step into a discrete token and re-encodes it. This removes the vocabulary bottleneck, preserves far more information per step, and lets the model search multiple reasoning paths in superposition rather than committing to one verbalized branch at every step.

Continuous chain of thought, implemented by the Coconut approach, feeds a model's last hidden state back as its next input embedding instead of decoding it to a token, so reasoning happens in latent space. A training curriculum gradually replaces verbalized reasoning steps with these continuous latent steps, teaching the model to use the latent channel as working memory that carries richer state than any single token could.

Yes, latent reasoning removes the readable text trace that chain-of-thought monitoring depends on, because a continuous hidden state has no natural rendering into language. Recovering monitorability will require trained latent probes, deliberately decodable architectures, or hybrid modes that keep a verbalized high-level plan, and teams should build this observability layer before deploying rather than after an incident.

Adaptive test time compute scaling decouples computation from output length, so a model can spend variable depth per query based on difficulty rather than emitting more tokens. This breaks the fixed forward pass assumption behind current batching and billing, and I expect serving frameworks to add variable depth support and providers to shift toward compute-based metering within roughly two to three years.

Latent reasoning will not fully replace chain of thought, because verbalized plans provide a useful linguistic prior and essential legibility for high-stakes systems. The durable production architecture will be hybrid, keeping language for the high-level plan and pushing dense inner computation into latent space, which captures most of the efficiency gain while retaining a readable reasoning skeleton for monitoring and debugging.

Divyang Mandani

Founder & CEO

Divyang Mandani is the CEO of KriraAI, driving innovative AI and IT solutions with a focus on transformative technology, ethical AI, and impactful digital strategies for businesses worldwide.

Ready to Write Your Success Story?

Do not wait for tomorrow; lets start building your future today. Get in touch with KriraAI and unlock a world of possibilities for your business. Your digital journey begins here - with KriraAI, where innovation knows no bounds.