Measurement for Shared Compute

In shared compute, the "treatment" is typically a dynamic rule rather than a static action. Hereafter, we refer to any such dynamic rule as a "policy." For example, a router, rate limit, cache, or capacity allocation rule selects an action based on both the request-level workload and the live system state.

Consider the example. A scheduler sends latency-sensitive, high-value requests to a geographically near cluster when capacity is available, then falls back to a farther cluster during congestion.

A = π(X, S)

X = pre-decision workload and user information

S = live decision-time system state: queue, cache, region, capacity, incidents, and recent demand

In the scheduler example, cluster assignment is not simply "treated." The same request may be routed to a nearby capacity when the queue is shallow and sent to a farther cluster when the system is congested.

System State Is Both Confounder and Outcome

Shared inference systems make the timing of state measurement central. At time t, system state is an input to the current routing, admission, or fallback decision. At time t + 1, that same class of state variables can be downstream of earlier policy decisions.

A_t = π(X_t, S_t)

S_t+1 = f(S_t, A_t, W_t)

W_t = arrivals, failures, provider events, cache eviction, and other external disturbances

This loop is what differentiates shared compute from a conventional A/B test. Queue depth before a routing decision can be a confounder. Queue depth after the policy has run can be part of the policy's effect. When an experiment modifies the capacity state that later requests draw from, the unit-level treatment no longer makes sense (see Experimental Designs Under Shared Compute).

Fat-Tailed Costs and Capacity Risk

For AI infrastructure and inference products, a plain difference-in-means readout can be fragile when the metric is tokens, cost, or GPU-hours per user. Both of those metrics can be dominated by a small number of very large users, long-context workflows, or high-concurrency windows.

A note on vantage point. This section extends my earlier analysis of why fat-tailed costs emerge at scale, which was written from outside: I do not operate these systems, and the distributional claims are reasoned from serving architecture and public information rather than from operator telemetry. Treat the shape claims as hypotheses to check against your own logs. The actual distribution of tokens per request, and how strongly cost depends on system state, are empirical questions only an operator can settle.

Tokens per request plausibly follow something like a truncated power law. The mean is finite because the context window bounds every request, but the variance is large and likely growing as use cases become more heterogeneous and context windows lengthen with each model release. If that shape is roughly right, the distribution is estimable: the mean exists, it just converges slowly, and a handful of long-context or agentic workflows can dominate any short experiment.

True resource cost per request is harder. It depends on live system state: batch concurrency, KV-cache pressure, prefill/decode balance, GPU efficiency, and temporal demand. The effective capacity boundary is reactive to this system state in a way the fixed context-window limit does not, so the conditional mean of resource cost may not settle inside the experiment window.

Tokens per request: bounded support, finite mean, fat tails → mean estimates converge slowly

Resource cost per request: cost = f(request, peak system state), state non-stationary → conditional mean may not converge

Implication: use robust estimation for token cost; treat resource cost as a capacity-state outcome, not a per-request scalar

KV-cache memory scales with sequence length and batch size, so unrelated long contexts arriving concurrently can push a resource pool past its memory boundary even when every individual request looks ordinary. Meaning, the tail event is better characterized as overcommiting resources (as opposed to "one expensive spike.") That can cause elevated fallback rates, or even a crash. A rare peak-concurrency window can erase the margin earned on the other requests served in that window.

TK: explain the behavioral tail risk (dominant users) vs. system-state tial risk (peak-concurrency windows) and develop a POV on how to deal with each.

Measurement Implications

Make the time window a unit of analysis. If cost is a function of peak system state, per-user outcomes cannot carry the whole readout. Compute uncertainty at the window, cluster, or fleet level.
Ensure the design observes the boundary. Switchback blocks must include peak-concurrency periods; an experiment run entirely off-peak has zero support in the states where the policy can fail. If the boundary was never observed, the readout should say so.
Report boundary events as treatment effects, not incident noise. Fallback bursts, queue overflow, cache-eviction storms, and error spikes during peak windows are outcomes of the allocation policy under evaluation.
Prefer capacity-impact signals to per-request cost prediction. The marginal resource cost of a request is statistically impractical to predict, but a request's impact on aggregate capacity state — predicted sequence length, compute-time consumed — are measurable and more practical guardrails.

A useful discipline is to ask what you would measure if the experiment were perfect. If every request could be randomly routed to a performant cluster with no interference, no capacity limits, and full observability, what would that experiment tell you? That line of questioning exposes the causal target and the measurement constraints, so the design can approximate the ideal experiment rather than merely summarize the logs.

Name the decision. Is the decision to ship broadly, target a segment, change a routing rule?
Choose the comparison. The counterfactual might be the old policy, untreated users, a different cluster, a prior model checkpoint, or a different time block (peak vs. off-peak hours).
Frame the answer. Broad rollout, treated-population impact, targeting, and policy value answer different operational questions.
Use unit economics as a guardrail. A policy can improve task success while burning too much compute. Pair retention or task completion with latency, errors, GPU-hours, and cost per successful request.
End with a rule. Ship, ramp, target, hold out, pause, roll back, or instrument more. The readout should state the recommendation and the tradeoff that would change it.

Decision Pattern Cheat Sheet

Decision pattern	Comparison frame	Potential tradeoff
Ship broadly	Average Treatment Effect (ATE): what happens across the eligible population?	Can hide that only one workload or customer segment benefits enough to justify the cost.
Explain treated traffic	Average Treatment Effect on the Treated (ATT): what happened to users or requests that actually received treatment?	Can be less useful for future rollout if treated users were unusually selected.
Target segment	Conditional Average Treatment Effect (CATE): who benefits enough to receive higher limits or better routing?	Can overfit unless the segment is stable, interpretable, and operationally usable.
Change the rule	Policy value: what is the expected outcome if the system runs rule π?	Needs support in the logs, system guardrails, and a clear cost/value objective.

Policy value, and why it is not ATE or CATE: ATE and CATE are properties of a treatment: what does action A do on average, or for segment x? Policy value is a property of a decision rule: what happens if the system runs rule π, where π looks at context and chooses an action. A dynamic rule only treats some units, in some states, and its value depends on how often those contexts occur and on operational constraints.

Experiments still start with the same question: what is the unit that receives the policy, and what other traffic changes because that unit was treated? If the policy does not change shared state, a user-level or request-level randomized controlled trial (RCT) may be enough. If it changes queue, cache, or capacity state, the experiment has to randomize a larger unit or time block.

User or Workspace RCT

Use for: rate limits, prices, model access, priority tiers, or any change where the same user should keep seeing the same experience.

Why: if the outcome is retention, expansion, or sustained usage, the randomization unit should match how the user actually experiences the product.

Potential downside: randomizing by user usually has less statistical power than randomizing by request, and takes longer to read.

Mitigation: use short-run system metrics as early checks, keep a holdout for long-run outcomes.

Request RCT

Use for: latency, error rate, cost per request, retry rate, or whether a route succeeds.

Why: request-level randomization gives much more statistical power, especially for high-volume traffic. It is a good fit for isolated serving changes where the main outcome is request-level.

Potential downside: it can be a bad user experience if the same workflow gets mixed treatment.

Mitigation: use it only when the change is invisible or low-risk to the user and does not materially alter shared state.

Cluster Randomization

Use for: changes to batching, caching, regional routing, or fleet allocation where the intervention changes a shared capacity pool.

Why: spillovers break user-level randomization. If treated traffic changes queue depth, cache warmth, or capacity pressure, control users in the same pool experience a different system. Randomizing by cluster contains most interference within the randomized unit.

Potential downside: there are fewer independent units, so statistical power is worse. Clusters can also differ by chip generation, geography, user mix, latency, customer value, or incident patterns.

Mitigation: stratify before randomization, check pre-period balance, analyze at the cluster level, and report system-level outcomes instead of only treated-user outcomes.

Switchback Experiment

Use for: cache policy, global routing configuration, fleet allocation, scheduler settings, or any change that has to be turned on for a whole system at once.

Why: when the policy changes queue or cache state for everyone, compare blocks of time under policy A versus policy B. In this design, state after the policy runs is an outcome, not a pre-treatment nuisance variable.

Potential downside: demand changes over time. Hour-of-day, day-of-week, incidents, launches, and seasonal traffic can get mixed into the treatment effect. Carryover is real: cache warmth, queue backlog, retry storms, and provider throttling can persist across blocks.

Mitigation: randomize across comparable time blocks, balance treatment across weekends and weekdays, record incidents, and include washout when cache or queue persistence is material. Blocks must be long enough for the policy to take effect but short enough to avoid conflating treatment with demand cycles.

This is the main takeaway! For a shared compute intervention, do not pretend requests are independently treated.

[I'll rewrite this section.]

Observational work should not oversell the estimator. The central question is whether the historical comparison has the right pre-decision variables, enough overlap, and a credible assignment story. Unlike an experiment, confounding is not solved by design. It is the problem.

Assignment Story for Logs

X and S are plausible adjustment variables when they are pre-decision common causes. Latency, errors, cache hit, completion, and retry behavior sit after the action, so they should be reported as mechanisms or outcomes unless the question is specifically about a direct effect.

Observational Estimators

Method	When to use it	Identification risk
Outcome regression	Good first pass when the main assignment variables are logged: model, account tier, region, time of day, queue depth, prompt length, and past usage.	A model that predicts retention well can still estimate the treatment effect badly if the assignment story is wrong.
Propensity weighting or matching	Useful when treated and untreated units overlap but had different probabilities of treatment.	Extreme weights or poor overlap mean the result depends on a few unusual records. Show overlap and trim if needed.
Doubly robust estimation	Stronger default when both assignment and outcome can be modeled.	It still cannot fix missing confounders. Lead with the identification assumption, not the estimator name.
Double ML	Useful with rich telemetry and many covariates where flexible models help with nuisance functions.	Better prediction is not better causality by itself. Explain cross-fitting plainly and still show diagnostics.

Bad Controls in Shared Compute Logs

Variable	How to treat it	Why
Queue depth before routing	Usually adjust for it.	It can drive both assignment and outcome.
Queue depth after routing	Report it as an outcome or mechanism metric.	It can be downstream of the policy and part of the effect on the shared system.
Latency after routing	Do not control for it in a total-effect estimate.	It is probably part of the mechanism.
Completed requests only	Avoid conditioning on this sample.	Treatment may affect completion, so failures disappear from the readout.
Untreated-user latency	Use as a spillover guardrail.	It shows whether treated traffic harmed the shared system.

If the policy used a field that was not logged, no observational method fixes that. The recommendation should be instrumentation or an experiment, not a fancier model.

Quasi-Experimental Designs

These are strongest when the business already created quasi-random variation: a phased rollout, a threshold, a beta invite, or one region/fleet/provider changing before the rest. Pick the design based on the assignment story, then state the assumption that could fail.

Design	Recommendation	Tradeoff to check
Difference-in-differences	Use when capacity, routing, pricing, or provider changes roll out to some units before others.	Pre-trends have to look credible. Also check incidents, launches, or demand shifts at the same time.
Synthetic control	Use when one region, fleet, provider, or customer tier changes and there is a long pre-period.	A clean post-period gap is not enough. Show pre-period fit, placebo units, and donor weights.
Instrumental variables / LATE	Use when assignment nudges treatment but actual take-up is imperfect or self-selected.	The estimate is local to compliers, and the instrument has to affect the outcome through treatment only.
Regression discontinuity	Use when eligibility, limits, priority, or pricing changes sharply at a known cutoff.	The result is local to the cutoff. Check bunching, covariate continuity, and bandwidth sensitivity.

Offline policy evaluation is useful for routers, schedulers, ranking policies, and allocation rules, but only if the logs contain enough examples of the actions the new policy wants to take. Check support before estimating anything.

Logged propensities can support short-horizon contextual routing evaluation, especially when the action affects the current request and the relevant state was logged. They are less sufficient for full dynamic-policy evaluation when actions alter future queue and cache state. In that setting, IPS or doubly robust estimates can help screen policies, but live validation is still needed before broad deployment.

Check support. Did the old policy try the same actions for similar state often enough?
Check propensities. If action probabilities were logged, IPS and doubly robust methods become more credible.
Inspect weights. If a few records drive the answer, the offline estimate is too fragile for a broad rollout.
Validate live. Even a good OPE result should usually become a limited ramp or controlled exploration bucket before full deployment.

Estimator	When to use it	Identification risk
Direct method	Quick baseline when rich state and enough outcomes exist for each action.	Low variance, but biased if the outcome model is wrong for rarely chosen actions.
IPS	Stochastic routers or experiments where action probabilities are known.	Unstable when weights explode, especially if the new policy chooses actions the old one rarely tried.
Doubly robust OPE	Logged propensities plus a reasonable reward model.	Stronger default, but it still needs support, the right state variables, and no major interference.

Operational principle: Before trusting OPE, verify that the old policy actually tried the actions the new policy wants to take in comparable states. If support is weak, add controlled exploration.

Heterogeneity should change the allocation rule, not merely decorate the analysis. If the result does not change who gets capacity, which workload is routed differently, or which customer segment is worth serving differently, it is mostly decoration.

Start with the average effect. That shows whether there is a real lever worth segmenting.
Pre-specify operational segments. Good segment definitions include workload length, customer tier, model family, region, latency sensitivity, and peak versus off-peak demand.
Use CATE only if targeting is the decision. The segment has to be stable, interpretable, and simple enough to operate.
Convert lift into net value. A high treatment effect can still be a bad allocation if it uses expensive capacity for low-margin traffic.
Keep some exploration alive. If the router only exploits the current best segment, future evaluation loses support for alternatives.

Segment lens	What it helps decide	What to watch
Workload shape	Whether long-context, batch, real-time, or agentic workflows should receive different routing or limits.	Aggregate usage can hide very different cost and latency profiles.
Customer tier	Whether scarce capacity should be prioritized for customers with higher value or stricter SLAs.	Higher value may also come with higher reliability expectations.
System state	Whether the policy should change during peak demand, incidents, or low-utilization periods.	The same treatment can be worth it off-peak and too expensive during congestion.

Operational principle: The average effect answers whether the lever works. Heterogeneity answers where scarce compute should be spent first.

Short-run metrics are useful, but they should not be treated as business outcomes by default. First state what part of the mechanism they measure, then explain why that mechanism should translate into retention, expansion, or LTV.

Surrogate Ladder

System metric: TTFT, latency, cache hit rate, error rate, retry rate, or cost per successful request.
Session outcome: task completion, abandonment, re-run rate, or successful workflow completion.
User behavior: return rate, deeper usage, broader adoption, or more high-value workflows.
Business outcome: retention, expansion, renewal, LTV, or margin.

The ladder can break. A faster request is not automatically higher retention, and more usage can still hurt margin if the workload is expensive or low value. A metric that predicts retention is not necessarily a valid surrogate for retention. A valid surrogate claim requires evidence that movement in the short-run metric captures the causal path by which the policy changes the long-run outcome.

If the long-run outcome takes too long to observe, the recommendation should be conditional: ramp only if the mechanism metric improves, the cost and reliability guardrails hold, and early behavior moves in the direction past experiments suggest. For policies expected to affect retention or expansion, keep the smallest durable holdout that can answer the long-run question. If no holdout is possible, label the surrogate risk explicitly rather than pretending the short-run metric proves the whole case.

The check is not there to show diligence for its own sake. It should change the recommendation: ship, ramp, pause, roll back, narrow the target population, or instrument more.

Check	What to look for	How it changes the recommendation
Assignment	SRM, eligibility, exposure time, and whether treatment started when the logs say it did.	If assignment is broken, pause the readout.
Execution	Intended action, executed action, fallback reason, timeout path, and retry path.	If execution diverged, report ITT and separately analyze the operational failure.
Balance and overlap	Balance tables, propensity overlap, pre-trends, or pre-period synthetic fit depending on the design.	If overlap is poor, narrow the estimand or downgrade the claim.
System spillover	Untreated-user latency, errors, queue depth, cache state, and utilization.	If spillovers are material, move to cluster or switchback evidence.
Economic significance	User value per GPU-hour, cost per successful request, margin impact, and reliability risk.	If the practical effect does not clear the cost bar, do not ship just because the p-value is good.
Rollback trigger	Latency above X, error rate above Y, cost per success above Z, or no movement in task completion.	If those triggers fire, pause or roll back even if the average user metric looks fine. In the router example, elevated fallback rates during congestion should stop the ramp even if premium-routed requests improved.

Start by naming the counterfactual: for this population, what would this outcome have looked like under the old policy or the best operational alternative?

State how treatment or allocation was assigned. If randomization is available and spillovers are low, use the right RCT unit. If the policy changes shared capacity, use cluster randomization or a switchback. If the evidence is historical, look for rollout timing, a threshold, an instrument, or enough pre-decision controls.

Name the main failure mode: confounding, bad controls, noncompliance, spillovers, weak support, or tail-state risk. Mitigate it with balance checks, pre-trends, full-funnel logging, holdouts, cluster-level uncertainty, washout, or controlled exploration.

Close with the decision: ship, ramp, target a segment, keep a holdout, pause, roll back, instrument more, or run a stronger design, based on user lift, system guardrails, and compute cost.

Common Decision Patterns

Observed situation	Decision pattern	Contingency
Routed users retained better.	Do not compare routed vs not-routed directly. Ask what drove routing.	If queue state or workload type drove routing, control for pre-decision state or run an experiment.
The policy can be tested safely.	Randomize at the unit that matches the outcome.	If shared capacity matters, move up to cluster or switchback.
The policy already rolled out.	Use DiD/event study if the rollout timing gives a credible comparison.	If pre-trends fail or there was a simultaneous incident, do not lean on DiD.
There is a cutoff.	Use RDD near the threshold.	Check manipulation and keep the claim local.
Only router logs are available.	Use OPE only if propensities and support exist.	If support is weak, recommend controlled exploration.

Operational principle: End with the decision, not the method. State what should ship, what should be monitored, and what result should stop the ramp.

This table belongs at the end because the preceding sections explain why these fields matter. Decision-time state, intended action, executed action, and system guardrails are not generic telemetry. They are the minimum evidence needed to distinguish the policy effect from the state that caused the policy to act.

Field	Why it matters	Failure if missing
Decision-time state	Queue depth, region, cache state, incidents, and capacity pressure may drive both assignment and outcome.	You mistake a hard-to-serve state for a treatment effect.
Intended action	This is the policy recommendation: route, admit, defer, cache, limit, or prioritize.	You cannot report intent-to-treat.
Executed action	Serving paths can change because of timeouts, provider issues, or fallback logic.	You analyze the policy as if it actually ran.
System guardrails	Latency, errors, queue depth, cost per successful request, and untreated-user experience show spillovers.	A user-level lift hides damage to the shared system.

Operational principle: Log the policy's inputs and execution path before trying to estimate its effect. If the routing field, fallback reason, or decision-time state is missing, the recommendation is instrumentation, not a more elaborate model.

Measurement for Shared Compute

Shared Compute Is Not a Standard A/B Test

System State Is Both Confounder and Outcome

Fat-Tailed Costs and Capacity Risk

Measurement Implications

Name the Decision First

Decision Pattern Cheat Sheet

Experimental Designs

User or Workspace RCT

Request RCT

Cluster Randomization

Switchback Experiment

When You Can't Randomize

Assignment Story for Logs

Observational Estimators

Bad Controls in Shared Compute Logs

Quasi-Experimental Designs

Offline Policy Evaluation

Targeting and Heterogeneity

Short-Run Signals vs. Long-Run Outcomes

Surrogate Ladder

Checks That Change the Decision

The Decision Readout

Common Decision Patterns

Appendix: Minimum Logging for a Policy Readout