Tier v1.0 Calibration (Pilot + Method)
Plain language: This appendix has two roles: (1) the pilot numbers we measured for GPT-2 small and BERT base (Nov 2025) that underpin the Balanced and Conservative tiers; and (2) the exact recipe to recalibrate from scratch on your setup (weight-based Spectral κ, activation-based RMT ε, VE min-effect, and window sizing). Every knob is surfaced in run reports and reports so reviewers can audit or recompute.
For a key-by-key explanation of every value in the packaged tier file (
runtime/tiers.yaml), see Tier Policy Catalog.
Spectral κ (z-caps) — Targets and Method
What the tier ships with (pilot)
- Balanced per-family κ caps:
ffn: 3.849,attn: 3.018,embed: 1.05,other: 0.0with Benjamini–Hochberg (BH) FDR control (α=0.05,m=4families), deadbandδ=0.10, scope: all 2-D weight matrices (LayerNorm excluded), no absolute clamp, and per-run WARN budgetmax_caps = 5. - Conservative tightens caps and budget:
ffn: 3.849,attn: 2.6,embed: 2.8,other: 2.8, Bonferroni (α=0.000625), andmax_caps = 3.
Runtime visibility. reports record per-family WARNs and effective caps under spectral.* (summary, multiple_testing, families, family_caps) and the resolved policy under resolved_policy.spectral.
Window Minima Rationale (counts/power)
- The CI profile targets 200×200 non‑overlapping, paired windows with BCa replicates ≈ 1.2k. The Release profile targets 400×400 with ≈ 3.2k replicates. These counts follow a half‑width sizing rule on the paired Δlog‑loss CI (power ≈ 50% at the boundary for the chosen
min_effect_lognll), verified on pilot runs. - Release evidence must meet the requested counts; runs that under‑cover preview/final windows or bootstrap replicates fail evaluation in CI/Release profiles (see Coverage & Pairing Plan).
Spectral calibration provenance. Aggregated null-run stats are derived from
calibration runs. Local tooling can parse evaluation report JSON files (glob pattern
**/evaluation.report.json) to extract per-family z-scores and compute summary statistics
(mean, stdev, quantiles). Persist results in CSV format for reproducibility and
attach calibration reports to change proposals.
How to recalibrate κ on your machine (budget-aware)
Key idea. Keep the budget
max_capsfixed (e.g., 5 for Balanced); tune per-family κ so a clean baseline produces ≤ that many WARNs per run under BH. Do not enable an absolute clamp in Balanced.
-
Gather per-module |z| by family. From a baseline run, collect spectral z-scores for each 2-D weight in family . (Tip: ensure the guard emits
final_z_scoresso you have module-level |z|.) -
Allocate the WARN budget across families. Let be the module count in family and the total across families. With budget (Balanced: 5), assign
-
Order-statistic recipe (recommended). Sort in descending order; set with a small safety margin for robustness.
-
Parametric alternative. With two-sided tail and target , then add the same small margin.
-
Keep these fixed (Balanced).
multiple_testing: {method: bh, alpha: 0.05, m: 4},deadband: 0.10,scope: all,max_caps: 5,max_spectral_norm: null.
Spectral is weight-based. z-tails are driven by weights, not evaluation windows; changing dataset seeds/windows does not move |z|. Prefer pooling per-module z across related baselines (e.g., 1B/3B/7B) rather than re-sampling windows.
Worked Example: Recalibrating Spectral κ for a Custom GPT-2 Run
Suppose you ran a baseline and extracted z-scores from the report:
# 1. Run baseline
invarlock run -c configs/presets/causal_lm/wikitext2_512.yaml \
--profile ci --tier balanced --out runs/baseline_calib
# 2. Extract z-scores (example using jq)
jq '.guards[] | select(.name == "spectral") | .metrics.final_z_scores' \
runs/baseline_calib/*/report.json > z_scores.json
With 120 total modules distributed as: FFN=40, Attn=40, Embed=8, Other=32.
Step-by-step κ calculation:
-
Allocate budget. With budget B=5 and M=120 total modules:
- B(ffn) = ⌊5 × 40/120 + 0.5⌋ = ⌊2.17⌋ = 2
- B(attn) = ⌊5 × 40/120 + 0.5⌋ = 2
- B(embed) = ⌊5 × 8/120 + 0.5⌋ = 1
- B(other) = ⌊5 × 32/120 + 0.5⌋ = 1
-
Sort |z| per family. Suppose FFN z-scores sorted descending are: [2.1, 1.8, 1.6, 1.5, 1.4, 1.3, ...]
-
Set κ using order statistic. κ(ffn) = Z(ffn)^(B(ffn)) + margin = 1.8 + 0.1 = 1.9
-
Repeat for other families. If Attn's 2nd-largest |z| is 2.6, κ(attn) = 2.7.
-
Write local override:
# configs/overrides/spectral_local.yaml guards: spectral: family_caps: ffn: 1.9 attn: 2.7 embed: 1.5 other: 1.2 -
Re-run with override:
invarlock run -c configs/presets/causal_lm/wikitext2_512.yaml \ -c configs/overrides/spectral_local.yaml \ --profile ci --tier balanced -
Verify. Check
report.guards[spectral].metrics.warnings_count ≤ 5on clean baselines.
RMT ε (acceptance bands)
What the tier ships with (pilot)
- Balanced ε per family:
{ffn: 0.01, attn: 0.01, embed: 0.01, other: 0.01} - Conservative:
{ffn: 0.01, attn: 0.01, embed: 0.01, other: 0.01}
Acceptance rule per family : with baseline edge‑risk and current edge‑risk , .
Runtime visibility. report fields under rmt.* report baseline/current edge‑risk, ε (default and by family), status, and validation.rmt_stable.
RMT calibration provenance. Aggregated null-run stats are derived from
calibration reports. Local tooling can parse report JSON files to
extract rmt.families.*.{edge_base,edge_cur,delta} per family, and report
quantile summaries of Δ(f) = r_cur(f)/r_base(f) − 1 (skip cases with missing or
zero baseline).
How to recalibrate ε
- Run null baselines (no edit) and compute per-family deltas (skip cases with ).
- Set with .
- Use a slightly larger ε for tiny families (discreteness: matters).
Variance Equalization (VE) — minimum effect
What the tier ships with (pilot)
- Balanced (one-sided, improvement-only):
min_effect_lognll = 0.0 - Conservative (two-sided, improvement-only):
min_effect_lognll = 0.016
Runtime visibility. Recorded in reports under variance.predictive_gate (CI, mean Δ, pass/fail reason) and under resolved_policy.variance.{predictive_one_sided,min_effect_lognll} (tier knobs).
VE calibration provenance. Summary stats are derived from calibration
reports. Local tooling can parse report JSON files to extract
variance.predictive_gate.{delta_ci,mean_delta} and compute the paired Δ
standard deviation across runs.
How to recalibrate min-effect
For paired ΔlogNLL with stdev over windows, with Balanced using one-sided and Conservative two-sided . VE enables only if the predictive CI upper bound ≤ −min_effect_lognll and the mean Δ ≤ −min_effect_lognll; a CI entirely above +min_effect_lognll is treated as regression (VE stays off).
Evaluation window sizing (coverage)
Pick preview/final counts so the BCa half-width on ΔlogNLL is within target:
- Balanced pilot target: ±0.001 on GPT-2 release profile (CI profile uses fewer windows).
- Sweep to find the “coverage vs cost” knee; enforce non-overlap (
stride = seq_len) and reuse baseline window IDs for perfect pairing.
Window sizing provenance. Window counts are controlled by the selected runtime
profile (--profile ...), defined under src/invarlock/_data/runtime/profiles/.
Repo-only runnable presets under configs/presets/ set small defaults for
unprofiled runs.
Runtime visibility. reports expose window counts, coverage flags, and CI digests under dataset.windows.stats and primary_metric.
“Fast path” recalibration (summary)
- Baseline (release, Balanced). Run once and collect
final_z_scores. - Spectral κ. Allocate budget () → per-family (); compute via order-statistic (or parametric) + margin; keep BH, deadband, scope,
max_caps, and no clamp. - RMT ε. From null runs, set to the q95–q99 quantile of per family (adjust for small ).
- VE min-effect. () with tier-appropriate sidedness.
- Windows. Size to hit the half-width target; enforce non-overlap and pairing.
- Trial via override. Write calibrated values to a local override YAML (e.g.,
configs/overrides/spectral_balanced_local.example.yaml, copied locally and edited) and merge it into a local run preset underguards:instead of editing the global tier. Re-run baseline + edits; pre-screen gates; then build reports.
Note. These pilot numbers are defaults. Teams are encouraged to re-run calibration on their models/datasets/hardware and attach the resulting reports and summary statistics to change proposals. The report fields make such updates auditable end-to-end.
See Also
- Tier Policy Catalog — Policy keys and where they appear in reports
- Guards Reference — Guard configuration options
References
- Benjamini, Y., & Hochberg, Y. (1995). “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 289–300. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x