Tier Policy Tuning CLI (Calibration)

Scope note: this page covers Tier Policy Tuning via invarlock calibrate .... It outputs tiers_patch_*.yaml recommendations for runtime/tiers.yaml. For proof-pack run-scoped preset derivation (CALIBRATION_RUN -> GENERATE_PRESET), see Proof Pack Internals.

Overview

AspectDetails
PurposeRun policy-tuning sweeps to empirically derive guard thresholds and tier policy recommendations.
AudienceOperators recalibrating tier policies for new model families or updated guard contracts.
Primary commandsinvarlock calibrate null-sweep, invarlock calibrate ve-sweep.
Requiresinvarlock[hf] for HF workflows; base config YAML for each sweep type.
NetworkOffline by default; enable per command with INVARLOCK_ALLOW_NETWORK=1.
Source of truthsrc/invarlock/cli/commands/calibrate.py, src/invarlock/calibration/.

Quick Start

# Run spectral null-sweep (noop edit) to calibrate κ/alpha
invarlock calibrate null-sweep \
  --config configs/calibration/null_sweep_ci.yaml \
  --out reports/calibration/null_sweep \
  --tier balanced --tier conservative \
  --n-seeds 10

# Run VE sweep (quant_rtn edit) to calibrate min_effect_lognll
invarlock calibrate ve-sweep \
  --config configs/calibration/rmt_ve_sweep_ci.yaml \
  --out reports/calibration/ve_sweep \
  --tier balanced --tier conservative \
  --n-seeds 10

Concepts

  • Policy-tuning sweeps: Run multiple seeds/tiers to build empirical distributions for threshold recommendations.
  • Null sweep: Uses a no-op edit to measure baseline spectral behavior and derive false-positive-controlled κ caps and α levels.
  • VE sweep: Uses a real edit (e.g., quant_rtn) to measure variance guard predictive gate behavior and recommend min_effect_lognll.
  • Artifacts: Each sweep emits JSON (machine), CSV (spreadsheet), Markdown (human), and a tiers_patch_*.yaml recommendation file.
  • Artifact contract: The file names above are treated as stable public outputs and may be consumed directly by verification, review, and policy-pack workflows.

Published Basis vs Shipped Configs

Published assurance basis currently covers GPT-2 and BERT profiles. The repo also ships pilot calibration configs for additional families such as Mistral 7B and Qwen2 7B under configs/calibration/, but those configs are not part of the published assurance basis until supporting artifacts are attached.

Policy-Tuning Sweep → Tier Policy Flow

  ┌──────────────────┐
  │ Base Config YAML │
  └────────┬─────────┘
           │
           ▼
  ┌──────────────────┐
  │ policy tuning CLI│
  │ (null/ve sweep)  │
  └────────┬─────────┘
           │
           ▼
  ┌──────────────────┐
  │ Per-seed reports │
  │ (runs/<tier>/...)│
  └────────┬─────────┘
           │
           ▼
  ┌──────────────────┐      ┌─────────────────────┐
  │ Sweep artifacts  │ ───► │ tiers_patch_*.yaml  │
  │ (JSON/CSV/MD)    │      │ (copy → tiers.yaml) │
  └──────────────────┘      └─────────────────────┘

Reference

Policy-Tuning Commands

CommandPurposeKey outputs
invarlock calibrate null-sweepCalibrate spectral κ/alpha from null (noop) runs.null_sweep_report.json, tiers_patch_spectral_null.yaml
invarlock calibrate ve-sweepCalibrate VE min_effect_lognll from real edit runs.ve_sweep_report.json, tiers_patch_variance_ve.yaml

null-sweep

Runs a null (no-op edit) sweep and calibrates spectral κ/alpha empirically.

Usage: invarlock calibrate null-sweep --config <CONFIG> --out <OUT> [options]

OptionDefaultDescription
--configconfigs/calibration/null_sweep_ci.yamlBase null-sweep YAML (noop edit).
--outreports/calibration/null_sweepOutput directory for calibration artifacts.
--tierAll tiersTier(s) to evaluate (repeatable).
--seed--seed-start + rangeSeed(s) to run (repeatable). Overrides --n-seeds/--seed-start.
--n-seeds10Number of seeds to run.
--seed-start42Starting seed.
--profileciRun profile (ci, release, ci_cpu, dev).
--deviceAutoDevice override.
--safety-margin0.05Safety margin applied to κ recommendations.
--target-any-warning-rate0.01Target run-level spectral warning rate under the null.

Outputs:

  • null_sweep_report.json — Machine-readable sweep summary with per-tier recommendations.
  • null_sweep_runs.csv — Per-run metrics (max z-scores, candidate counts, etc.).
  • null_sweep_summary.md — Human-readable Markdown summary.
  • tiers_patch_spectral_null.yaml — Recommended spectral_guard settings for tiers.yaml.

ve-sweep

Runs VE predictive-gate sweeps and recommends min_effect_lognll per tier.

Usage: invarlock calibrate ve-sweep --config <CONFIG> --out <OUT> [options]

OptionDefaultDescription
--configconfigs/calibration/rmt_ve_sweep_ci.yamlBase VE sweep YAML (quant_rtn edit).
--outreports/calibration/ve_sweepOutput directory for calibration artifacts.
--tierAll tiersTier(s) to evaluate (repeatable).
--seed--seed-start + rangeSeed(s) to run (repeatable). Overrides --n-seeds/--seed-start.
--n-seeds10Number of seeds to run.
--seed-start42Starting seed.
--window6, 8, 12, 16Variance calibration window counts (repeatable).
--target-enable-rate0.05Target expected VE enable rate (predictive-gate lower bound).
--profileciRun profile (ci, release, ci_cpu, dev).
--deviceAutoDevice override.
--safety-margin0.0Safety margin applied to min_effect recommendations.

Outputs:

  • ve_sweep_report.json — Machine-readable sweep summary with per-tier recommendations.
  • ve_sweep_runs.csv — Per-run metrics (predictive gate deltas, CI widths, etc.).
  • ve_power_curve.csv — Mean CI width per (tier, windows) for power analysis.
  • ve_sweep_summary.md — Human-readable Markdown summary.
  • tiers_patch_variance_ve.yaml — Recommended variance_guard settings for tiers.yaml.

Applying recommendations

After a sweep, merge the tiers_patch_*.yaml into your runtime/tiers.yaml:

# Review recommendations
cat reports/calibration/null_sweep/tiers_patch_spectral_null.yaml

# Merge into tiers.yaml (manual review recommended)
# The patch contains only the keys being updated:
#   balanced:
#     spectral_guard:
#       family_caps: { ... }
#       multiple_testing: { alpha: ... }

Troubleshooting

  • Missing config files: Ensure calibration configs exist under configs/calibration/.
  • Sweep failures: Check individual run reports under <out>/runs/<tier>/seed_*.
  • Unexpected recommendations: Review the safety margin and target rate parameters.

Observability

  • Sweep artifacts include full provenance (config, profile, tiers, run count).
  • Per-run reports are preserved under <out>/runs/ for debugging.
  • Power curves (VE sweep) help assess sample size requirements.