KriraAI Logo

Inference-Time Compute Scaling: Architecting Reasoning AI

When the published ARC-AGI evaluations of late 2024 demonstrated that scaling inference compute across a single model could move a benchmark from below thirty percent to above eighty percent without changing weights, mainstream coverage treated the result as a curiosity about prompting and sampling. Researchers who track scaling laws read it differently. What had been demonstrated, with public numbers and reproducible methodology, was that the compute budget required to extract a given capability could be paid at inference rather than training, and that the exchange rate was favourable enough to alter the long-run economics of how foundation models will be built. The trajectory from that observation through DeepSeek-R1's pure reinforcement learning recipe, through process reward models reaching parity with outcome reward models on harder benchmarks, through the first production deployments of variable-depth reasoning systems, points toward an architectural reorganisation of foundation model design that is already underway in research and not yet absorbed by most engineering teams.

Inference-time compute scaling is the central technical lever of the next generation of AI systems, and the engineering decisions that depend on it have not been made by most teams currently shipping production AI. The fixed-compute inference assumption that underpins almost every serving stack written before 2024 is no longer the dominant design point in frontier research, and within eighteen months it will not be the dominant design point in production either. Practitioners who continue to architect around it will find their cost models, latency budgets, and capability ceilings calibrated against a paradigm that has already shifted.

This blog is a forward-looking technical analysis of where inference-time compute scaling is heading at the level of architecture, training methodology, and production engineering. It is written for practitioners who already understand transformers, reinforcement learning from human feedback, and chain-of-thought, and who want to think clearly about what comes after the current generation of reasoning models. The analysis covers the migration of compute from training to inference, the architectural emergence of learned compute controllers and verifier-guided search, the closed-loop training paradigms that will dominate the next eighteen months, the production engineering implications for serving infrastructure, and the open research problems that will determine which approaches reach scale first. The thesis throughout is that the systems being built today against the assumption of fixed-compute inference are designed against a paradigm that is already obsolete in research and will be obsolete in production within twenty-four months.

The Research Result That Changed the Scaling Conversation

The pretraining scaling laws established by Kaplan and refined by Hoffmann gave a generation of practitioners a clean mental model. Loss decreased predictably with compute, parameters, and tokens, and the relationship was a power law with consistent exponents. That model has held up for the dominant compute regime since 2020, and it remains useful. What changed in late 2024 and through 2025 was the empirical confirmation that a second, distinct scaling law exists for compute spent at inference, and that its exponents are competitive with the pretraining law on a meaningful range of capabilities.

The o1 release demonstrated that a model trained to use chain-of-thought as an action space, with reinforcement learning over reasoning traces, exhibited capability gains as a function of inference tokens that were power-law on hard reasoning benchmarks. The o3 result extended this to compute regimes four orders of magnitude beyond conventional inference, and the gains continued to follow a clean trajectory. DeepSeek-R1 then demonstrated that the training recipe required to produce such models was less exotic than initially assumed, with a pure outcome-supervised reinforcement learning approach eliciting chain-of-thought from a base model without any supervised fine-tuning on reasoning traces.

The combined implication of these results is the central observation that frames everything that follows. The compute frontier of foundation model capability has split into two distinct dimensions, and the optimal allocation between them is no longer obviously dominated by pretraining. When the inference scaling exponent is favourable, additional capability can be purchased by scaling test-time computation against a fixed-weight model, and this purchase can be made selectively per query rather than uniformly across the entire training run.

Why this is not just better sampling

It would be easy to read the inference scaling result as a more sophisticated version of best-of-N sampling, and many engineering teams have done exactly that. This reading is incorrect in a way that matters for architectural decisions. Naive best-of-N sampling scales linearly in compute and sub-logarithmically in capability for most tasks, which is a poor exchange rate that does not justify infrastructure rebuilds. Inference-time compute scaling under verifier guidance, dynamic depth, and learned exploration policies scales as a power law in compute with exponents that, on the empirical evidence available, sit in the same range as pretraining exponents. The exchange rate is fundamentally different, and the architectural implications are correspondingly different.

Why the Compute Budget Is Migrating From Training to Inference

The economic argument for shifting compute from training to inference is not subtle, but it is often misstated. The argument is not that training is becoming unimportant. The argument is that the marginal compute dollar is increasingly better spent at inference for two reasons. First, pretraining returns are now visibly diminishing for the largest models, with frontier labs reporting that each generation of frontier pretraining requires substantially more compute for a smaller capability lift. Second, inference scaling has not yet exhibited diminishing returns at the compute levels being explored, and the structural reasons to expect diminishing returns are weaker than for pretraining.

The economics of pretraining have reached visible diminishing returns

A frontier pretraining run in 2026 costs at least two orders of magnitude more than the equivalent run in 2022, and the resulting capability gains, measured on the difficult benchmarks that matter, are not two orders of magnitude larger. This is not a controversial observation among researchers at frontier labs, though the public framing has been mixed. The data wall has constrained scaling along the parameters-times-tokens axis, the quality of additional web data has degraded, and the synthetic data approaches that were supposed to relieve the constraint have introduced their own problems with mode collapse and distribution narrowing. The pretraining curve has not flattened entirely, but its second derivative is negative, and the marginal economics are unfavourable for many capability targets.

The economics of test-time search have not

Inference compute, by contrast, has not yet encountered its analogous wall. The reason is structural. Inference compute can be spent in ways that are individually tailored to the difficulty of each query, can be guided by verifiers that themselves improve, and can be allocated across architectures that admit substantially more parallelism than serial autoregressive decoding. The scaling exponent for inference compute under verifier guidance is currently being characterised on increasingly difficult benchmarks, and the published numbers are consistent with continued returns at compute levels two to four orders of magnitude beyond current production deployments. By the end of 2026, at least one major frontier lab will publish results demonstrating verifier-guided search that scales as a power law in compute up to four orders of magnitude beyond current reasoning model deployments, and this will reframe how training compute is allocated at every lab that takes the result seriously.

The migration is not symmetric in time. Pretraining compute is committed in advance, amortised across all future inferences, and largely fixed once the run completes. Inference compute is paid per query, can be scaled adaptively, and admits a fundamentally different cost structure. This distinction matters enormously for how AI systems will be priced, served, and deployed, and it is the subject of the production engineering section below.

The Architecture of Adaptive Inference-Time Compute Allocation

The Architecture of Adaptive Inference-Time Compute Allocation

The architectural pattern that is emerging in reasoning models is not a monolithic transformer with a longer context window, despite surface appearances. What is actually emerging is a composite system in which a policy model generates candidate reasoning trajectories, a verifier model scores those trajectories at varying granularities, and a controller decides how to allocate compute across exploration of the trajectory space. This composite is the new unit of frontier capability, and it has architectural implications that go beyond any individual component.

Process reward models as the routing substrate

Process reward models are the architectural innovation that makes verifier-guided search practical at scale. Outcome reward models, which assign a single score to a complete reasoning trace, provide sparse supervision that is insufficient to guide search efficiently through long deliberative trajectories. Process reward models score each step of a reasoning trajectory, providing dense supervision that allows search to prune unpromising branches early, allocate exploration toward partial solutions that look promising, and accumulate confidence about a candidate answer as the trace develops. The technical challenge in training process reward models at scale is step-level label noise, since the correctness of an intermediate reasoning step is not always well-defined and human annotation does not scale to the data volumes required.

Current research is converging on a hybrid training regime in which process reward models are bootstrapped from outcome-supervised data through a combination of Monte Carlo rollout estimates of step value and automatic verifier generation for domains with checkable answers. The result is a verifier that is approximately calibrated at the step level without requiring step-level human labels at the scale of the policy model's training data. Within twenty-four months, process reward models trained through this hybrid regime will be standard components in frontier model training pipelines, and the open-weights ecosystem will include several competitive process reward model checkpoints suitable for guided inference.

Learned compute controllers and dynamic depth

The second architectural innovation is the controller that decides how much compute to spend on a given query. Fixed sampling parameters such as temperature, top-p, and number of samples are the current crude proxy for compute allocation, and they leave substantial efficiency on the table. The emerging alternative is learned compute controllers, which take the query, optionally a partial trajectory, and the current verifier confidence as input, and output a compute allocation decision. The decision space includes how many parallel trajectories to sample, when to stop expanding a branch, when to terminate the entire reasoning process, and when to escalate to a larger or more specialised model.

Learned compute controllers represent a genuine architectural shift because they break the assumption that inference is a fixed-cost operation per query. The cost per query becomes a function of query difficulty, target confidence, and available compute budget, and the controller is the component that implements this function. Within twenty-four months, the dominant production reasoning model architecture will include a learned compute controller as a first-class component, distinct from the policy and verifier, with its own training objective and its own evaluation methodology.

Verifier-guided search beyond best-of-N

The search algorithms that operate on top of policy and verifier are themselves evolving past the best-of-N baseline. Best-of-N is a degenerate case in which the search is uniform parallel sampling followed by terminal verification. The algorithms that are replacing it include beam search with process reward model scoring at each step, tree search with progressive widening based on verifier confidence, and Monte Carlo tree search variants adapted for the language modelling setting. Each of these algorithms exposes a different exchange rate between compute and capability, and the choice between them is increasingly a function of the query type rather than a fixed system-wide decision.

The deeper architectural implication is that the inference engine itself becomes a learned system rather than a deterministic decoding loop. The search policy, the expansion criteria, the pruning thresholds, and the termination conditions are all parameters that can be optimised through training, and there is no longer a clean separation between the model and the algorithm that runs it. This is the architectural reality that production systems will need to absorb, and KriraAI's applied research practice has been organised around exactly this convergence, building inference stacks that treat the search policy and the model as a single optimisation target.

How Reasoning Models Will Be Trained in the Next Eighteen Months

The training regime for reasoning models in the next eighteen months will be dominated by closed-loop systems in which the policy and verifier improve each other through synthetic data generation. This is a substantive departure from the human-feedback-centric loops that defined the previous generation of post-training, and it is driven by both economic and technical pressures. The economic pressure is that human annotation cannot scale to the data volumes required to train competitive reasoning models in the relevant domains. The technical pressure is that human annotation is also not the highest-quality signal available, since for many reasoning tasks an automated verifier can provide more accurate and more granular feedback than a human annotator.

The closed loop between policy and verifier

The closed-loop training regime works as follows. A policy model generates reasoning trajectories on a distribution of problems. A verifier scores those trajectories at the trajectory level, the step level, or both. High-scoring trajectories are filtered as positive training data for the policy. Disagreements between verifier scores and outcome correctness are filtered as positive training data for the verifier. Both models are updated, and the loop iterates. The interesting technical question is the stability of this loop under repeated iteration, and the published evidence suggests that it is stable for several rounds of iteration before reward hacking or distribution narrowing degrades quality.

The next twelve months will see the first production deployments of policy-verifier closed loops that generate their own training data at scale, with measured compute efficiency gains over single-loop reinforcement learning of two to five times on hard reasoning benchmarks. This will change which capabilities are practically attainable for which organisations, because the closed-loop regime substantially reduces the data acquisition cost of reaching frontier reasoning capability.

Synthetic data trajectories generated under verifier supervision

The second major training methodology is the use of verifier-guided synthetic data generation for domains beyond reasoning. The pattern generalises. Where a verifier exists, whether a code execution sandbox, a formal proof checker, a simulation environment, or a learned process reward model, synthetic data can be generated under verifier supervision and filtered to a high-quality subset that is then used for supervised fine-tuning or preference optimisation. This is the data flywheel that several frontier labs are building, and it is the reason the apparent data wall has not constrained capability growth as severely as the 2023 projections suggested it would.

The engineering reality of running these flywheels at scale involves substantial infrastructure that does not exist as a commodity. The verifier services need to handle high request volumes at low latency. The data generated needs to be deduplicated, decontaminated against evaluation sets, and filtered for distribution coverage. The training runs need to be checkpointed in ways that allow rollback if reward hacking is detected. KriraAI has built production-grade synthetic data pipelines under verifier supervision for client deployments where the domain admits an external verifier, and the engineering lessons from that work consistently show that the verifier infrastructure is the bottleneck before the training compute is.

The Production Engineering Implications of Inference-Time Compute Scaling

The Production Engineering Implications of Inference-Time Compute Scaling

The production engineering implications of inference-time compute scaling are the section of this analysis that has the most direct relevance to teams shipping AI systems today. The implications cluster into three areas. Serving infrastructure becomes responsible for variable-cost inference. Cost models become two-dimensional in latency and quality rather than one-dimensional in throughput. Capacity planning becomes a function of query mix rather than query count.

Variable latency serving and SLO design

The first implication is that the latency SLOs that govern production AI systems will need to be redesigned around variable per-query inference cost. The fixed-latency assumption that allowed simple percentile-based SLOs to function will not hold for reasoning models with adaptive compute allocation. A query that can be answered confidently with a single forward pass will be served in milliseconds. A query that requires deliberative reasoning with verifier-guided search will be served in seconds or longer. The same model, the same endpoint, and the same client may produce both, and the SLO framework needs to accommodate this.

The emerging pattern is tiered SLOs in which the client specifies a quality target or a compute budget, and the serving system commits to a latency distribution conditional on that target. Within thirty-six months, this tiered SLO pattern will be the dominant interface for reasoning model serving, replacing the flat per-query latency SLO that governs current systems. Practitioners who design new AI infrastructure should not be building against the flat model, because the systems they will need to interoperate with in two years will not be flat.

KV cache strategies for tree-structured inference

The second implication is that KV cache management becomes substantially more complex under tree-structured inference. Linear autoregressive decoding admits a simple KV cache strategy. Each query has a single growing cache that is discarded at the end of the query. Tree-structured inference, in which multiple candidate trajectories are explored in parallel and pruned based on verifier feedback, requires a cache strategy that supports sharing the cache across branches that share a prefix, allocating new cache memory for branches that diverge, and reclaiming cache memory when branches are pruned. Within thirty-six months, KV cache management for tree-structured inference will be a first-class concern in serving stacks, on par with batching and quantization as a determinant of throughput economics.

The naive implementation of tree-structured inference, in which each branch allocates its own full cache, is so memory-inefficient that it makes the entire approach uneconomical. The systems that will win in production are those that implement copy-on-write KV cache structures with explicit sharing across branches, and there is substantial engineering work to be done to make this efficient on current accelerator hardware. The serving frameworks that ship this capability first will capture a meaningful share of the production reasoning model deployment market.

Cost models for adaptive compute pricing

The third implication is that the cost models for AI services will need to fragment along the compute axis. Flat per-token pricing is an artifact of fixed-compute inference and does not match the underlying cost structure of adaptive inference. The pricing models that are emerging include per-trajectory pricing with quality tiers, per-confidence-threshold pricing in which the client pays more for higher verifier confidence on the returned answer, and budget-based pricing in which the client allocates a compute budget per query and the system returns the best answer it can produce within that budget. Within two years, inference-time compute pricing will fragment into tiered offerings based on verifier confidence and compute budget, replacing the flat per-token model that dominates current API pricing.

What Replaces Best-of-N Sampling

The comparison with current approaches is worth making explicit because best-of-N is the dominant baseline against which adaptive inference is currently measured, and understanding why it is being replaced clarifies what the replacements actually provide. Best-of-N is a parallel ensemble of independent samples scored by a terminal verifier. It is conceptually clean and easy to implement, and it provided a useful demonstration that verifier-guided selection improves quality. It is also wasteful, because it does not use any information about the partial trajectories during generation.

The replacement is verifier-guided search, in which the verifier scores partial trajectories during generation and the search policy uses those scores to allocate exploration. The compute savings are substantial. On reasoning benchmarks where best-of-N requires several hundred samples to reach a target accuracy, verifier-guided search with process reward models can reach the same accuracy with one to two orders of magnitude less compute, depending on the difficulty distribution of the queries. This is not a marginal improvement, and it changes which capabilities are economically viable at production scale.

The deeper observation is that best-of-N treats inference as a fixed-graph computation, while verifier-guided search treats inference as a learned algorithm. The implications cascade through the entire stack. Profiling tools designed for fixed-graph inference do not capture the relevant performance signals for learned-algorithm inference. Capacity planning tools that assume uniform per-query cost do not produce useful estimates. Monitoring systems that track latency distributions need to track quality-latency joint distributions instead. The transition from one paradigm to the other is the engineering work that will dominate AI infrastructure teams over the next two years.

The Open Problems That Will Define the Next Two Years

The trajectory described above is not without obstacles, and the obstacles are concrete enough to enumerate. The research community has identified the major open problems, and the relative progress on each will determine which deployment patterns reach scale first.

The first open problem is reward hacking under search pressure. Verifier-guided search amplifies any miscalibration in the verifier, because the search policy actively seeks trajectories that the verifier scores highly. If the verifier has systematic blind spots, the search will find them, and the resulting trajectories will be high-scoring but actually incorrect. Current research is addressing this through verifier ensembling, adversarial training of verifiers against search-found exploits, and explicit uncertainty quantification in verifier outputs. None of these approaches is fully solved, and the rate of progress on this problem is one of the determinants of how aggressive production deployment of verifier-guided search can be.

The second open problem is distribution shift between training trajectories and inference-time exploration. The verifier is trained on a distribution of reasoning trajectories that the policy produces during training. At inference time, search policies explore a distribution that may differ substantially from the training distribution, and the verifier's calibration on this shifted distribution is not guaranteed. The technical approaches to this problem include online verifier updates, distributionally robust verifier training, and explicit out-of-distribution detection at inference time.

The third open problem is the calibration of verifier confidence across heterogeneous problem types. A verifier that is well-calibrated on mathematical reasoning may be poorly calibrated on code generation, and a single production system serves both. The approaches under exploration include problem-type-conditional verifiers, mixture-of-verifiers architectures with learned routing, and meta-calibration layers that adjust raw verifier scores based on problem features. The engineering implication is that production reasoning systems will eventually include a verifier orchestration layer that is distinct from any individual verifier, and KriraAI's research on multi-verifier orchestration for enterprise deployment has been organised around this expectation.

The fourth open problem is the stability of closed-loop training over many iterations. Several iterations of policy-verifier improvement have been demonstrated stable. Many iterations have not, and the failure modes include mode collapse, reward hacking, and distribution narrowing toward easy problems. The research approaches include explicit diversity regularisation, curriculum design over problem difficulty, and periodic re-anchoring against human-labelled data. The pace of progress on this problem determines how much capability can be extracted from a fixed compute budget through closed-loop iteration.

What Practitioners Should Build For Now

The forward-looking analysis above has direct implications for engineering decisions being made today. The decisions that will look correct in retrospect are not necessarily the decisions that minimise cost against current workloads, because current workloads will look substantially different in eighteen months. The following preparation steps are concrete and ordered by leverage.

  1. Redesign serving infrastructure around variable per-query compute budgets rather than fixed latency targets. The serving stack should accept a quality target or compute budget as a per-query parameter, and the routing, batching, and caching layers should be aware of this parameter when making scheduling decisions. Teams that defer this redesign will find their infrastructure unable to serve the next generation of reasoning models efficiently.

  2. Implement KV cache sharing across branches in any serving stack that will run tree-structured inference. The naive per-branch cache approach is so wasteful that it makes verifier-guided search uneconomical at production scale. The engineering work to implement copy-on-write KV cache structures is non-trivial but well-defined, and the payoff is substantial.

  3. Build verifier infrastructure as a first-class production service rather than an afterthought attached to training pipelines. Verifiers will be invoked at inference time at request volumes that match or exceed the policy model, and they need to be deployed with the same operational maturity as the policy. This includes monitoring, versioning, canarying, and rollback capabilities.

  4. Instrument quality-latency joint distributions in production monitoring rather than tracking only latency percentiles. The relevant performance signal for adaptive inference is the joint distribution of quality outcomes and latency outcomes conditional on query difficulty, and monitoring stacks that only track latency miss the actual operating regime of the system.

  5. Develop internal expertise on process reward models, learned compute controllers, and verifier-guided search algorithms, because the architectural sophistication required to deploy these systems exceeds what is available in current off-the-shelf inference frameworks. Teams that develop this expertise early will be able to extract substantially more capability per compute dollar than teams that defer.

The common thread across these preparation steps is that the unit of optimisation is shifting from the model to the composite system of policy, verifier, controller, and search algorithm. Teams that continue to optimise individual model performance against fixed inference budgets will find their capability ceiling determined by a paradigm that is no longer competitive.

The Capability Frontier This Opens

The three most important technical implications of inference-time compute scaling, taken together, define the shape of the next generation of AI systems. The first implication is that foundation model architecture will become a composite of policy, verifier, controller, and search algorithm, with each component trained and deployed as a first-class system, and the unit of capability will be the composite rather than any single model. The second implication is that production AI serving infrastructure will need to be rebuilt around variable per-query compute, tiered quality-latency contracts, and KV cache strategies adapted to tree-structured inference, and the rebuild is engineering work that takes months to do well. The third implication is that the capability frontier accessible at a given compute budget will expand substantially through closed-loop policy-verifier training and verifier-guided inference, and teams that adopt these paradigms early will operate at a higher capability frontier than teams that defer.

The engineering decisions that follow from these implications need to be made now, not in eighteen months when the transition is already underway. The infrastructure rebuilds, the verifier service deployments, the monitoring system redesigns, the cost model migrations, and the internal expertise development are all multi-quarter efforts that compound over time, and the teams that start them in 2026 will be operating at a different capability and cost frontier in 2028 than the teams that start them in 2027.

KriraAI operates at the intersection of applied AI research and production deployment, building systems that are designed for where the technology is heading rather than where it is today. The work we do for enterprise clients on reasoning model architectures, verifier infrastructure, and adaptive inference serving is organised around the conviction that the fixed-compute inference paradigm is already obsolete in research and will be obsolete in production within twenty-four months. The teams that recognise this and begin the architectural transition early will operate with a structural advantage that compounds across product cycles. KriraAI's applied research practice and production deployment teams are available to technical leaders who want to engage seriously with these emerging capabilities and the engineering decisions they require, and we welcome conversations with engineering organisations that are thinking carefully about the shape of the next generation of AI systems.

FAQs

Inference-time compute scaling moves cost from amortized training expense to per-query variable compute, forcing serving infrastructure to price latency-quality tradeoffs explicitly rather than treating inference as a flat-rate operation across query types and difficulty levels.

Outcome reward models score only the final answer of a reasoning trajectory, while process reward models score each reasoning step independently, providing dense supervision that enables verifier-guided search and substantially better sample efficiency during both training and inference operations.

Learned verifiers allow targeted exploration of promising reasoning paths rather than uniform parallel sampling, reducing the compute needed to find a correct solution by one to two orders of magnitude on hard reasoning tasks where verifier signal quality is high.

Inference-time compute will not replace pretraining but will complement it, with the optimal allocation shifting toward inference as pretraining returns continue to diminish and verifier-guided search continues to demonstrate favorable scaling exponents on the difficult capability benchmarks that matter most.

The hardest open problems are step-level label noise, reward hacking under search pressure, distribution shift between training trajectories and inference-time exploration policies, and calibration of verifier confidence across heterogeneous out-of-distribution problem types that production systems encounter.

Ridham Chovatiya is the COO at KriraAI, driving operational excellence and scalable AI solutions. He specialises in building high-performance teams and delivering impactful, customer-centric technology strategies.

Ready to Write Your Success Story?

Do not wait for tomorrow; lets start building your future today. Get in touch with KriraAI and unlock a world of possibilities for your business. Your digital journey begins here - with KriraAI, where innovation knows no bounds.