Private working draft. Please do not circulate or reproduce without permission.

Causal Inference for Shared Compute

Evaluating allocation policies in stateful inference systems.

This chapter develops a practical framework for evaluating routing, rate limits, cache policies, fleet allocation, and other decisions in AI inference systems where an intervention affects both an individual request and the shared capacity system around it.

It builds on my prior writing about state-dependent inference cost and fat-tailed demand, together with my practical work as a pricing and product data scientist. An earlier version of these ideas was discussed with Live with Tim O'Reilly.

The central claim is that inference interventions should usually be evaluated as allocation policies under shared system state, rather than as isolated user-level treatments.

01

Shared Compute Is Not a Standard A/B Test

Inference interventions are dynamic allocation policies whose effects propagate through live system state.

In shared compute, the "treatment" is often a policy rather than a single static action. A router, rate limit, cache policy, or capacity allocation rule is dynamic, and chooses an action based on the request-level workload and the live system state.

Consider the example. A router sends latency-sensitive, high-value requests to a premium fleet when capacity is available, then falls back to a cheaper fleet during congestion.
A = π(X, S)
X = pre-decision workload and user information
S = live decision-time system state: queue, cache, region, capacity, incidents, and recent demand

In the router example, premium-fleet assignment is not simply "treated." The same request may be routed to premium capacity when the queue is shallow and sent to a cheaper fleet when the system is congested. The estimand must attach to the allocation rule, not merely to the serving path that happened to execute.

System State Is Both Confounder and Outcome

Shared inference systems make the timing of state measurement central. At time t, system state is an input to the current routing, admission, or fallback decision. At time t + 1, that same class of state variables can be downstream of earlier policy decisions.

At = π(Xt, St)
St+1 = f(St, At, Wt)
Wt = arrivals, failures, provider events, cache eviction, and other external disturbances

This loop is what makes shared compute different from a standard product A/B test. Queue depth before a routing decision can be a confounder. Queue depth after the policy has run can be part of the policy's effect. When an experiment changes the queue, cache, or fleet state that later requests experience, the unit-level treatment label is no longer the whole story.

Fat-Tailed Costs and Capacity Risk

For AI infrastructure and inference products, a plain difference-in-means readout can be fragile when the metric is tokens, cost, or GPU-hours per user. Both of those outcomes can be dominated by a small number of very large users, long-context workflows, or high-concurrency windows.

A note on vantage point. This section extends my earlier analysis of why fat-tailed costs emerge at scale, which was written from outside: I do not operate these systems, and the distributional claims are reasoned from serving architecture and public information rather than from operator telemetry. Treat the shape claims as hypotheses to check against your own logs. The actual distribution of tokens per request, and how strongly cost depends on system state, are empirical questions only an operator can settle.

Tokens per request plausibly follow something like a truncated power law. The mean is finite because the context window bounds every request, but the variance is large and likely growing as use cases become more heterogeneous and context windows expand with each model release. If that shape is roughly right, the distribution is estimable: the mean exists, it just converges slowly, and a handful of long-context or agentic workflows can dominate any short experiment.

True resource cost per request is harder. It depends on live system state: batch concurrency, KV-cache pressure, prefill/decode balance, GPU efficiency, and temporal demand. The effective capacity boundary moves in a way the context-window limit does not, so the conditional mean of resource cost may not settle inside the experiment window.

Tokens per request: bounded support, finite mean, fat tails → mean estimates converge slowly
Resource cost per request: cost = f(request, peak system state), state non-stationary → conditional mean may not converge
Implication: use robust estimation for token cost; treat resource cost as a capacity-state outcome, not a per-request scalar

KV-cache memory scales with sequence length and batch size, so unrelated long contexts arriving concurrently can push a fleet past its memory boundary even when every individual request looks ordinary. The tail event is not only "one expensive user." It can be an overcommitted fleet, cache-pressure spirals, elevated fallback rates, or a crash. A rare peak-concurrency window can erase the margin earned on the other requests served in that window.

Trimming or winsorizing tail users stabilizes a cross-sectional estimate, but it says nothing about behavior in the tail states that determine crash risk. Safety in tail states has to be measured directly, not inferred from a central-tendency readout.
02

Name the Decision First

Start with the action the readout must support, then choose the comparison that frames the answer.

A useful discipline is to ask what you would measure if the experiment were perfect. If every request could be randomly routed to a performant or standard cluster with no interference, no capacity limits, and full observability, what would that experiment tell you? That line of questioning exposes the causal target and the measurement constraints, so the design can approximate the ideal experiment rather than merely summarize the logs.

  1. Name the action pattern. Is the decision to ship broadly, target a segment, change a routing rule, reserve capacity, keep a holdout, pause, roll back, or instrument before acting?
  2. Choose the comparison. The counterfactual might be the old policy, untreated users, a different cluster, a prior model checkpoint, a threshold-adjacent group, or a different time block.
  3. Frame the answer. Broad rollout, treated-population impact, targeting, and policy value answer different operational questions.
  4. Use unit economics as a guardrail. A policy can improve task success while burning too much compute. Pair retention or task completion with latency, errors, GPU-hours, and cost per successful request.
  5. End with a rule. Ship, ramp, target, hold out, pause, roll back, or instrument more. The readout should state the recommendation and the tradeoff that would change it.

Decision Pattern Cheat Sheet

Decision pattern Comparison frame Potential tradeoff
Ship broadly Average Treatment Effect (ATE): what happens across the eligible population? Can hide that only one workload or customer segment benefits enough to justify the cost.
Explain treated traffic Average Treatment Effect on the Treated (ATT): what happened to users or requests that actually received treatment? Can be less useful for future rollout if treated users were unusually selected.
Target scarce compute Conditional Average Treatment Effect (CATE): who benefits enough to receive higher limits or better routing? Can overfit unless the segment is stable, interpretable, and operationally usable.
Change the rule Policy value: what is the expected outcome if the system runs rule π? Needs support in the logs, system guardrails, and a clear cost/value objective.
Policy value, and why it is not ATE or CATE: ATE and CATE are properties of a treatment: what does action A do on average, or for segment x? Policy value is a property of a decision rule: what happens if the system runs rule π, where π looks at context and chooses an action. A dynamic rule only treats some units, in some states, and its value depends on how often those contexts occur and on operational constraints.
03

Experimental Designs

Randomization solves confounding by design, but the unit has to match the shared system.

Experiments still start with the same question: what is the unit that receives the policy, and what other traffic changes because that unit was treated? If the policy does not change shared state, a user or request RCT may be enough. If it changes queue, cache, or fleet state, the experiment has to randomize a larger unit or time block.

User or Workspace RCT

Use for: rate limits, prices, model access, priority tiers, or any change where the same user should keep seeing the same experience.

Why: if the outcome is retention, expansion, or sustained usage, the randomization unit should match how the user actually experiences the product.

Potential downside: randomizing by user usually has less statistical power than randomizing by request, and takes longer to read.

Mitigation: use short-run system metrics as early checks, keep a holdout for long-run outcomes.

Request RCT

Use for: latency, error rate, cost per request, retry rate, or whether a route succeeds.

Why: request-level randomization gives much more statistical power, especially for high-volume traffic. It is a good fit for isolated serving changes where the main outcome is request-level.

Potential downside: it can be a bad user experience if the same workflow gets mixed treatment.

Mitigation: use it only when the change is invisible or low-risk to the user and does not materially alter shared state.

Cluster Randomization

Use for: changes to batching, caching, regional routing, or fleet allocation where the intervention changes a shared capacity pool.

Why: spillovers break user-level randomization. If treated traffic changes queue depth, cache warmth, or capacity pressure, control users in the same pool experience a different system. Randomizing by cluster contains most interference within the randomized unit.

Potential downside: there are fewer independent units, so statistical power is worse. Clusters can also differ by chip generation, geography, user mix, latency, customer value, or incident patterns.

Mitigation: stratify before randomization, check pre-period balance, analyze at the cluster level, and report system-level outcomes instead of only treated-user outcomes.

Switchback Experiment

Use for: cache policy, global routing configuration, fleet allocation, scheduler settings, or any change that has to be turned on for a whole system at once.

Why: when the policy changes queue or cache state for everyone, compare blocks of time under policy A versus policy B. In this design, state after the policy runs is an outcome, not a pre-treatment nuisance variable.

Potential downside: demand changes over time. Hour-of-day, day-of-week, incidents, launches, and seasonal traffic can get mixed into the treatment effect. Carryover is real: cache warmth, queue backlog, retry storms, and provider throttling can persist across blocks.

Mitigation: randomize across comparable time blocks, balance treatment across weekends and weekdays, record incidents, and include washout when cache or queue persistence is material. Blocks must be long enough for the policy to take effect but short enough to avoid conflating treatment with demand cycles.

Design principle: For a shared-fleet intervention, do not pretend requests are independently treated. Use a cluster-level design or a blocked switchback when the policy changes the queue and cache state experienced by other traffic.
04

When You Can't Randomize

In logs, identification depends on assignment, adjustment, support, and the variables you refuse to control for.
[Human rewrite needed: this section is still scaffolding. The structure is now in the right place, but the prose should be rewritten against real examples and operator language.]

Observational work should not oversell the estimator. The central question is whether the historical comparison has the right pre-decision variables, enough overlap, and a credible assignment story. Unlike an experiment, confounding is not solved by design. It is the problem.

Assignment Story for Logs

Workload / user X, before decision System state S, before decision Allocation action A = π(X, S) Mechanism latency, errors Y

X and S are plausible adjustment variables when they are pre-decision common causes. Latency, errors, cache hit, completion, and retry behavior sit after the action, so they should be reported as mechanisms or outcomes unless the question is specifically about a direct effect.

Observational Estimators

Method When to use it Identification risk
Outcome regression Good first pass when the main assignment variables are logged: model, account tier, region, time of day, queue depth, prompt length, and past usage. A model that predicts retention well can still estimate the treatment effect badly if the assignment story is wrong.
Propensity weighting or matching Useful when treated and untreated units overlap but had different probabilities of treatment. Extreme weights or poor overlap mean the result depends on a few unusual records. Show overlap and trim if needed.
Doubly robust estimation Stronger default when both assignment and outcome can be modeled. It still cannot fix missing confounders. Lead with the identification assumption, not the estimator name.
Double ML Useful with rich telemetry and many covariates where flexible models help with nuisance functions. Better prediction is not better causality by itself. Explain cross-fitting plainly and still show diagnostics.

Bad Controls in Shared Compute Logs

Variable How to treat it Why
Queue depth before routing Usually adjust for it. It can drive both assignment and outcome.
Queue depth after routing Report it as an outcome or mechanism metric. It can be downstream of the policy and part of the effect on the shared system.
Latency after routing Do not control for it in a total-effect estimate. It is probably part of the mechanism.
Completed requests only Avoid conditioning on this sample. Treatment may affect completion, so failures disappear from the readout.
Untreated-user latency Use as a spillover guardrail. It shows whether treated traffic harmed the shared system.
If the policy used a field that was not logged, no observational method fixes that. The recommendation should be instrumentation or an experiment, not a fancier model.

Quasi-Experimental Designs

These are strongest when the business already created quasi-random variation: a phased rollout, a threshold, a beta invite, or one region/fleet/provider changing before the rest. Pick the design based on the assignment story, then state the assumption that could fail.

Design Recommendation Tradeoff to check
Difference-in-differences Use when capacity, routing, pricing, or provider changes roll out to some units before others. Pre-trends have to look credible. Also check incidents, launches, or demand shifts at the same time.
Synthetic control Use when one region, fleet, provider, or customer tier changes and there is a long pre-period. A clean post-period gap is not enough. Show pre-period fit, placebo units, and donor weights.
Instrumental variables / LATE Use when assignment nudges treatment but actual take-up is imperfect or self-selected. The estimate is local to compliers, and the instrument has to affect the outcome through treatment only.
Regression discontinuity Use when eligibility, limits, priority, or pricing changes sharply at a known cutoff. The result is local to the cutoff. Check bunching, covariate continuity, and bandwidth sensitivity.
05

Offline Policy Evaluation

Use OPE only when the logs actually support the policy you want to evaluate.

Offline policy evaluation is useful for routers, schedulers, ranking policies, and allocation rules, but only if the logs contain enough examples of the actions the new policy wants to take. Check support before estimating anything.

Logged propensities can support short-horizon contextual routing evaluation, especially when the action affects the current request and the relevant state was logged. They are less sufficient for full dynamic-policy evaluation when actions alter future queue and cache state. In that setting, IPS or doubly robust estimates can help screen policies, but live validation is still needed before broad deployment.

  1. Check support. Did the old policy try the same actions for similar state often enough?
  2. Check propensities. If action probabilities were logged, IPS and doubly robust methods become more credible.
  3. Inspect weights. If a few records drive the answer, the offline estimate is too fragile for a broad rollout.
  4. Validate live. Even a good OPE result should usually become a limited ramp or controlled exploration bucket before full deployment.
Estimator When to use it Identification risk
Direct method Quick baseline when rich state and enough outcomes exist for each action. Low variance, but biased if the outcome model is wrong for rarely chosen actions.
IPS Stochastic routers or experiments where action probabilities are known. Unstable when weights explode, especially if the new policy chooses actions the old one rarely tried.
Doubly robust OPE Logged propensities plus a reasonable reward model. Stronger default, but it still needs support, the right state variables, and no major interference.
Operational principle: Before trusting OPE, verify that the old policy actually tried the actions the new policy wants to take in comparable states. If support is weak, add controlled exploration.
06

Targeting and Heterogeneity

Use segments to change the allocation rule, not to decorate the analysis.

Heterogeneity should change the allocation rule, not merely decorate the analysis. If the result does not change who gets capacity, which workload is routed differently, or which customer segment is worth serving differently, it is mostly decoration.

  1. Start with the average effect. That shows whether there is a real lever worth segmenting.
  2. Pre-specify operational segments. Good segment definitions include workload length, customer tier, model family, region, latency sensitivity, and peak versus off-peak demand.
  3. Use CATE only if targeting is the decision. The segment has to be stable, interpretable, and simple enough to operate.
  4. Convert lift into net value. A high treatment effect can still be a bad allocation if it uses expensive capacity for low-margin traffic.
  5. Keep some exploration alive. If the router only exploits the current best segment, future evaluation loses support for alternatives.
Segment lens What it helps decide What to watch
Workload shape Whether long-context, batch, real-time, or agentic workflows should receive different routing or limits. Aggregate usage can hide very different cost and latency profiles.
Customer tier Whether scarce capacity should be prioritized for customers with higher value or stricter SLAs. Higher value may also come with higher reliability expectations.
System state Whether the policy should change during peak demand, incidents, or low-utilization periods. The same treatment can be worth it off-peak and too expensive during congestion.
Operational principle: The average effect answers whether the lever works. Heterogeneity answers where scarce compute should be spent first.
07

Short-Run Signals vs. Long-Run Outcomes

Use short-term metrics as evidence only after showing why they should predict the delayed outcome.

Short-run metrics are useful, but they should not be treated as business outcomes by default. First state what part of the mechanism they measure, then explain why that mechanism should translate into retention, expansion, or LTV.

Surrogate Ladder

  1. System metric: TTFT, latency, cache hit rate, error rate, retry rate, or cost per successful request.
  2. Session outcome: task completion, abandonment, re-run rate, or successful workflow completion.
  3. User behavior: return rate, deeper usage, broader adoption, or more high-value workflows.
  4. Business outcome: retention, expansion, renewal, LTV, or margin.

The ladder can break. A faster request is not automatically higher retention, and more usage can still hurt margin if the workload is expensive or low value. A metric that predicts retention is not necessarily a valid surrogate for retention. A valid surrogate claim requires evidence that movement in the short-run metric captures the causal path by which the policy changes the long-run outcome.

If the long-run outcome takes too long to observe, the recommendation should be conditional: ramp only if the mechanism metric improves, the cost and reliability guardrails hold, and early behavior moves in the direction past experiments suggest. For policies expected to affect retention or expansion, keep the smallest durable holdout that can answer the long-run question. If no holdout is possible, label the surrogate risk explicitly rather than pretending the short-run metric proves the whole case.

08

Checks That Change the Decision

Checks should tell you whether to ship, ramp, target, pause, roll back, or instrument more.

The check is not there to show diligence for its own sake. It should change the recommendation: ship, ramp, pause, roll back, narrow the target population, or instrument more.

Check What to look for How it changes the recommendation
Assignment SRM, eligibility, exposure time, and whether treatment started when the logs say it did. If assignment is broken, pause the readout.
Execution Intended action, executed action, fallback reason, timeout path, and retry path. If execution diverged, report ITT and separately analyze the operational failure.
Balance and overlap Balance tables, propensity overlap, pre-trends, or pre-period synthetic fit depending on the design. If overlap is poor, narrow the estimand or downgrade the claim.
System spillover Untreated-user latency, errors, queue depth, cache state, and utilization. If spillovers are material, move to cluster or switchback evidence.
Economic significance User value per GPU-hour, cost per successful request, margin impact, and reliability risk. If the practical effect does not clear the cost bar, do not ship just because the p-value is good.
Rollback trigger Latency above X, error rate above Y, cost per success above Z, or no movement in task completion. If those triggers fire, pause or roll back even if the average user metric looks fine. In the router example, elevated fallback rates during congestion should stop the ramp even if premium-routed requests improved.
09

The Decision Readout

A compact structure for a technical decision memo or experiment readout.
Start by naming the counterfactual: for this population, what would this outcome have looked like under the old policy or the best operational alternative?

State how treatment or allocation was assigned. If randomization is available and spillovers are low, use the right RCT unit. If the policy changes shared capacity, use cluster randomization or a switchback. If the evidence is historical, look for rollout timing, a threshold, an instrument, or enough pre-decision controls.

Name the main failure mode: confounding, bad controls, noncompliance, spillovers, weak support, or tail-state risk. Mitigate it with balance checks, pre-trends, full-funnel logging, holdouts, cluster-level uncertainty, washout, or controlled exploration.

Close with the decision: ship, ramp, target a segment, keep a holdout, pause, roll back, instrument more, or run a stronger design, based on user lift, system guardrails, and compute cost.

Common Decision Patterns

Observed situation Decision pattern Contingency
Routed users retained better. Do not compare routed vs not-routed directly. Ask what drove routing. If queue state or workload type drove routing, control for pre-decision state or run an experiment.
The policy can be tested safely. Randomize at the unit that matches the outcome. If shared capacity matters, move up to cluster or switchback.
The policy already rolled out. Use DiD/event study if the rollout timing gives a credible comparison. If pre-trends fail or there was a simultaneous incident, do not lean on DiD.
There is a cutoff. Use RDD near the threshold. Check manipulation and keep the claim local.
Only router logs are available. Use OPE only if propensities and support exist. If support is weak, recommend controlled exploration.
Operational principle: End with the decision, not the method. State what should ship, what should be monitored, and what result should stop the ramp.
10

Appendix: Minimum Logging for a Policy Readout

A practical checklist for evaluating allocation policies after the design is clear.

This table belongs at the end because the preceding sections explain why these fields matter. Decision-time state, intended action, executed action, and system guardrails are not generic telemetry. They are the minimum evidence needed to distinguish the policy effect from the state that caused the policy to act.

Field Why it matters Failure if missing
Decision-time state Queue depth, region, cache state, incidents, and capacity pressure may drive both assignment and outcome. You mistake a hard-to-serve state for a treatment effect.
Intended action This is the policy recommendation: route, admit, defer, cache, limit, or prioritize. You cannot report intent-to-treat.
Executed action Serving paths can change because of timeouts, provider issues, or fallback logic. You analyze the policy as if it actually ran.
System guardrails Latency, errors, queue depth, cost per successful request, and untreated-user experience show spillovers. A user-level lift hides damage to the shared system.
Operational principle: Log the policy's inputs and execution path before trying to estimate its effect. If the routing field, fallback reason, or decision-time state is missing, the recommendation is instrumentation, not a more elaborate model.