Paired Evaluation Math (log-space, token-weighted)
Plain language: The reported perplexity ratio is just the exponential of the token-weighted mean Δlog-loss, and the confidence interval comes from exponentiating the same paired bootstrap; this note derives both facts in the report's operating context.
Overview
| Aspect | Details |
|---|---|
| Purpose | Derive the paired log-space primary-metric ratio and its displayed confidence interval. |
| Audience | Report verifier maintainers, statistics reviewers, and contributors changing paired metric code. |
| Contract scope | PPL-like metrics on paired evaluation windows with known token counts and non-overlapping schedules. |
| Source of truth | src/invarlock/core/bootstrap.py, report pairing logic, and paired-CI contract tests. |
Claim
For ppl-like metrics on paired evaluation windows i = 1..n with token counts t_i, the reported
ratio between two arms A and B (e.g., preview/final or edited/baseline)
satisfies
where is the per‑token log‑loss on window , and the weighted mean is
The ratio confidence interval is obtained by exponentiating the paired ΔlogNLL CI computed on the same windows with BCa bootstrap (paired, token‑weighted).
Visual Overview
Derivation (sketch)
For ppl-like primary metrics (perplexity):
Thus the ratio:
BCa applied to the paired vector (resampled with weights proportional to ) yields CI ; exponentiate to obtain .
Estimation note in log space
Let the token‑weighted mean be . By linearity of expectation,
so, under the stated window-level assumptions, the estimator targets the log of the token‑weighted ratio. Under mild assumptions (ergodicity across windows), the point estimator converges to the population log‑ratio.
Jensen inequality note
Let
Then
is the weighted geometric mean of . By AM-GM (equivalently Jensen on ), the weighted geometric mean is the weighted arithmetic mean of . The ratio of mean perplexities is a different quantity and can be larger or smaller; see the counter-example below.
Why log‑space vs ratio of means (counter‑example)
The naive ratio of mean perplexities can be biased toward high‑perplexity windows. A simple two‑window example shows the pitfall:
from math import exp, log
weights = [512, 256]
preview = [40.0, 220.0]
final = [38.0, 260.0] # high-perplexity window regresses strongly
ratio_log = exp(
sum(w * (log(b) - log(a)) for w, a, b in zip(weights, preview, final))
/ sum(weights)
)
ratio_means = (
sum(w * b for w, b in zip(weights, final))
/ sum(w * a for w, a in zip(weights, preview))
)
print(ratio_log, ratio_means) # 1.0217..., 1.12
InvarLock uses the exponential of the token‑weighted mean ΔlogNLL
(exp(weighted_mean(Δlog))), which respects pairing and avoids the bias.
Runtime Contract
-
reports must satisfy:
primary_metric.display_ci == exp(primary_metric.ci)(paired baseline path; ppl-like kinds).dataset.windows.stats.paired_delta_summaryrecords{mean,std,degenerate}for the paired Δ distribution.dataset.windows.stats.window_match_fraction == 1.0anddataset.windows.stats.window_overlap_fraction == 0.0.
-
Runs hard-fail in CI/Release profiles when a baseline pairing context exists and preview/final counts differ, pairing is incomplete, or windows overlap. Verification also rejects invalid pairing in generated reports.
Observability
primary_metric.{preview,final}— supports preview→final drift checks for ppl-like kinds.primary_metric.display_ciandprimary_metric.ci— paired ΔlogNLL interval (check both log and exponentiated views).dataset.windows.stats.{window_match_fraction,window_overlap_fraction,paired_windows}.dataset.windows.stats.paired_delta_summary.{mean,std,degenerate}anddataset.windows.stats.bootstrap.{replicates,seed}.dataset.windows.stats.coverage.{preview,final}— confirms both arms honour window/coverage minima.
Edge cases & safeguards
- If all
t_iequal, weighting reduces to simple mean: implementation can short‑circuit. - Degenerate Δ (all equal): mark
degenerate=trueand collapse the CI to[μ, μ]withμ = mean(Δ); report records the fallback. - Label alignment & padding must not contribute to
t_i(masked tokens excluded).
References
- Jurafsky, D., & Martin, J. H. (2023). Speech and Language Processing (3rd ed. draft), chapters on language modeling and perplexity. https://web.stanford.edu/~jurafsky/slp3/
- Manning, C. D., & Schütze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press.
- Hugging Face Transformers. “Perplexity of fixed-length models.” https://huggingface.co/docs/transformers/perplexity