Paired Evaluation Math (log-space, token-weighted)

Plain language: The reported perplexity ratio is just the exponential of the token-weighted mean Δlog-loss, and the confidence interval comes from exponentiating the same paired bootstrap; this note derives both facts in the report's operating context.

Claim

For paired evaluation windows i = 1..n with token counts t_i, the reported ratio between two arms A and B (e.g., preview/final or edited/baseline) satisfies

ratio=exp ⁣(Δw),Δi=i(B)i(A),\text{ratio} = \exp\!\Big(\overline{\Delta \ell}_{\text{w}}\Big),\quad \Delta \ell_i = \ell^{(B)}_i - \ell^{(A)}_i,

where i\ell_i is the per‑token log‑loss on window ii, and the weighted mean is

Δw=itiΔiiti.\overline{\Delta \ell}_{\text{w}} = \frac{\sum_i t_i \, \Delta \ell_i}{\sum_i t_i}.

The ratio confidence interval is obtained by exponentiating the paired ΔlogNLL CI computed on the same windows with BCa bootstrap (paired, token‑weighted).

Visual Overview

┌─────────────────────────────────────────────────────────────────────────┐
│               PAIRED EVALUATION MATH (log-space, token-weighted)        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   WINDOW PAIR i    ┌─────────────────────────────────────────────────┐  │
│   ────────────────▶│  Arm A (baseline)    Arm B (subject)            │  │
│                    │  ────────────────    ────────────────           │  │
│                    │  ℓᵢ⁽ᴬ⁾ = log-loss    ℓᵢ⁽ᴮ⁾ = log-loss           │  │
│                    │  tᵢ   = token count  tᵢ   = token count         │  │
│                    └──────────────────────┬──────────────────────────┘  │
│                                           │                             │
│                                           ▼                             │
│                    ┌─────────────────────────────────────────────────┐  │
│                    │  Δℓᵢ = ℓᵢ⁽ᴮ⁾ − ℓᵢ⁽ᴬ⁾   (per-window Δlog-loss)   │  │
│                    └──────────────────────┬──────────────────────────┘  │
│                                           │                             │
│   FOR ALL WINDOWS i=1..n                  ▼                             │
│                    ┌─────────────────────────────────────────────────┐  │
│                    │      Σᵢ tᵢ · Δℓᵢ                                │  │
│                    │  Δℓ̄ₓ = ─────────────   (token-weighted mean)    │  │
│                    │         Σᵢ tᵢ                                   │  │
│                    └──────────────────────┬──────────────────────────┘  │
│                                           │                             │
│                                           ▼                             │
│            ┌──────────────────────────────┴─────────────┐               │
│            │                                            │               │
│            ▼                                            ▼               │
│   ┌─────────────────┐                       ┌────────────────────────┐  │
│   │     RATIO       │                       │   BCa BOOTSTRAP (CI)   │  │
│   │ ────────────────│                       │ ────────────────────── │  │
│   │ exp(Δℓ̄ₓ)        │                       │ Resample {Δℓᵢ} with    │  │
│   │ = PPL⁽ᴮ⁾/PPL⁽ᴬ⁾ │                       │ weights ∝ tᵢ → [L,U]   │  │
│   │                 │                       │ CI = [exp(L), exp(U)]  │  │
│   └────────┬────────┘                       └───────────┬────────────┘  │
│            │                                            │               │
│            └────────────────────┬───────────────────────┘               │
│                                 ▼                                       │
│            ┌─────────────────────────────────────────────┐              │
│            │                   report                    │              │
│            │  ratio_vs_baseline = exp(Δℓ̄ₓ)               │              │
│            │  display_ci       = [exp(L), exp(U)]        │              │
│            └─────────────────────────────────────────────┘              │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Derivation (sketch)

For ppl-like primary metrics (perplexity), PPL=exp(ˉ)\text{PPL} = \exp(\bar{\ell}) where ˉ=tii/ti\bar{\ell} = \sum t_i \ell_i / \sum t_i. Thus the ratio:

PM(B)PM(A)(ratio in display space for ppl-like metrics)=exp(ˉ(B)ˉ(A))=exp(Δw).\frac{\text{PM}^{(B)}}{\text{PM}^{(A)}} \quad \text{(ratio in display space for ppl-like metrics)} = \exp\Big(\bar{\ell}^{(B)} - \bar{\ell}^{(A)}\Big) = \exp\Big(\overline{\Delta \ell}_{\text{w}}\Big).

BCa applied to the paired vector {Δi}\{\Delta \ell_i\} (resampled with weights proportional to tit_i) yields CI [L,U][L, U]; exponentiate to obtain [exp(L),exp(U)][\exp(L), \exp(U)].

Estimation note in log space

Let the token‑weighted mean be Δw=itiΔi/iti\overline{\Delta \ell}_{\text{w}} = \sum_i t_i\,\Delta \ell_i / \sum_i t_i. By linearity of expectation,

E[Δw]=itiE[Δi]iti=log(i(pi(B)pi(A))ti/jtj),\mathbb{E}\big[\overline{\Delta \ell}_{\text{w}}\big] = \frac{\sum_i t_i\, \mathbb{E}[\Delta \ell_i]}{\sum_i t_i} = \log\Bigg(\prod_i \Big(\tfrac{p_i^{(B)}}{p_i^{(A)}}\Big)^{\,t_i/\sum_j t_j}\Bigg),

so, under the stated window-level assumptions, the estimator targets the log of the token‑weighted ratio. Under mild assumptions (ergodicity across windows), the point estimator converges to the population log‑ratio.

Jensen inequality note

Let ri=exp(Δi)=PPLi(B)/PPLi(A)r_i = \exp(\Delta \ell_i) = \mathrm{PPL}^{(B)}_i / \mathrm{PPL}^{(A)}_i. Then exp(Δw)\exp\big(\overline{\Delta \ell}_{\text{w}}\big) is the weighted geometric mean of rir_i. By AM-GM (equivalently Jensen on log\log), the weighted geometric mean is \le the weighted arithmetic mean of rir_i. The ratio of mean perplexities is a different quantity and can be larger or smaller; see the counter-example below.

Why log‑space vs ratio of means (counter‑example)

The naive ratio of mean perplexities can be biased toward high‑perplexity windows. A simple two‑window example shows the pitfall:

from math import exp, log

weights = [512, 256]
preview = [40.0, 220.0]
final = [38.0, 260.0]  # high-perplexity window regresses strongly

ratio_log = exp(
    sum(w * (log(b) - log(a)) for w, a, b in zip(weights, preview, final))
    / sum(weights)
)

ratio_means = (
    sum(w * b for w, b in zip(weights, final))
    / sum(w * a for w, a in zip(weights, preview))
)

print(ratio_log, ratio_means)  # 1.0217..., 1.12

InvarLock uses the exponential of the token‑weighted mean ΔlogNLL (exp(weighted_mean(Δlog))), which respects pairing and avoids the bias.

Runtime Contract

  • reports must satisfy:

    • primary_metric.display_ci == exp(primary_metric.ci) (paired baseline path; ppl-like kinds).
    • dataset.windows.stats.paired_delta_summary records {mean,std,degenerate} for the paired Δ distribution.
    • dataset.windows.stats.window_match_fraction == 1.0 and dataset.windows.stats.window_overlap_fraction == 0.0.
  • Runs abort in CI/Release profiles if preview/final counts differ or pairing < 1.0.

Observability

  • primary_metric.{preview,final} — supports preview→final drift checks for ppl-like kinds.
  • primary_metric.display_ci and primary_metric.ci — paired ΔlogNLL interval (check both log and exponentiated views).
  • dataset.windows.stats.{window_match_fraction,window_overlap_fraction,paired_windows}.
  • dataset.windows.stats.paired_delta_summary.{mean,std,degenerate} and dataset.windows.stats.bootstrap.{replicates,seed}.
  • dataset.windows.stats.coverage.{preview,final} — confirms both arms honour window/coverage minima.

Edge cases & safeguards

  • If all t_i equal, weighting reduces to simple mean: implementation can short‑circuit.
  • Degenerate Δ (all equal): mark degenerate=true and collapse the CI to [μ, μ] with μ = mean(Δ); report records the fallback.
  • Label alignment & padding must not contribute to t_i (masked tokens excluded).

References

  • Jurafsky, D., & Martin, J. H. (2023). Speech and Language Processing (3rd ed. draft), chapters on language modeling and perplexity. https://web.stanford.edu/~jurafsky/slp3/
  • Manning, C. D., & Schütze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press.