Paired Evaluation Math (log-space, token-weighted)
Plain language: The reported perplexity ratio is just the exponential of the token-weighted mean Δlog-loss, and the confidence interval comes from exponentiating the same paired bootstrap; this note derives both facts in the report's operating context.
Claim
For paired evaluation windows i = 1..n with token counts t_i, the reported
ratio between two arms A and B (e.g., preview/final or edited/baseline)
satisfies
where is the per‑token log‑loss on window , and the weighted mean is
The ratio confidence interval is obtained by exponentiating the paired ΔlogNLL CI computed on the same windows with BCa bootstrap (paired, token‑weighted).
Visual Overview
Derivation (sketch)
For ppl-like primary metrics (perplexity), where . Thus the ratio:
BCa applied to the paired vector (resampled with weights proportional to ) yields CI ; exponentiate to obtain .
Estimation note in log space
Let the token‑weighted mean be . By linearity of expectation,
so, under the stated window-level assumptions, the estimator targets the log of the token‑weighted ratio. Under mild assumptions (ergodicity across windows), the point estimator converges to the population log‑ratio.
Jensen inequality note
Let . Then is the weighted geometric mean of . By AM-GM (equivalently Jensen on ), the weighted geometric mean is the weighted arithmetic mean of . The ratio of mean perplexities is a different quantity and can be larger or smaller; see the counter-example below.
Why log‑space vs ratio of means (counter‑example)
The naive ratio of mean perplexities can be biased toward high‑perplexity windows. A simple two‑window example shows the pitfall:
from math import exp, log
weights = [512, 256]
preview = [40.0, 220.0]
final = [38.0, 260.0] # high-perplexity window regresses strongly
ratio_log = exp(
sum(w * (log(b) - log(a)) for w, a, b in zip(weights, preview, final))
/ sum(weights)
)
ratio_means = (
sum(w * b for w, b in zip(weights, final))
/ sum(w * a for w, a in zip(weights, preview))
)
print(ratio_log, ratio_means) # 1.0217..., 1.12
InvarLock uses the exponential of the token‑weighted mean ΔlogNLL
(exp(weighted_mean(Δlog))), which respects pairing and avoids the bias.
Runtime Contract
-
reports must satisfy:
primary_metric.display_ci == exp(primary_metric.ci)(paired baseline path; ppl-like kinds).dataset.windows.stats.paired_delta_summaryrecords{mean,std,degenerate}for the paired Δ distribution.dataset.windows.stats.window_match_fraction == 1.0anddataset.windows.stats.window_overlap_fraction == 0.0.
-
Runs abort in CI/Release profiles if preview/final counts differ or pairing < 1.0.
Observability
primary_metric.{preview,final}— supports preview→final drift checks for ppl-like kinds.primary_metric.display_ciandprimary_metric.ci— paired ΔlogNLL interval (check both log and exponentiated views).dataset.windows.stats.{window_match_fraction,window_overlap_fraction,paired_windows}.dataset.windows.stats.paired_delta_summary.{mean,std,degenerate}anddataset.windows.stats.bootstrap.{replicates,seed}.dataset.windows.stats.coverage.{preview,final}— confirms both arms honour window/coverage minima.
Edge cases & safeguards
- If all
t_iequal, weighting reduces to simple mean: implementation can short‑circuit. - Degenerate Δ (all equal): mark
degenerate=trueand collapse the CI to[μ, μ]withμ = mean(Δ); report records the fallback. - Label alignment & padding must not contribute to
t_i(masked tokens excluded).
References
- Jurafsky, D., & Martin, J. H. (2023). Speech and Language Processing (3rd ed. draft), chapters on language modeling and perplexity. https://web.stanford.edu/~jurafsky/slp3/
- Manning, C. D., & Schütze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press.