Null Sweeps as Threshold Derivation, Not Tuning Folklore
Thresholds are stronger when they come from measured null behavior and end in a policy patch, not from knob-tuning folklore.
Research Note: thresholds should be measured before they are defended
Highlights
- A null sweep is a no-op edit run that measures clean spectral behavior under a declared target run-level warning rate.
- Outputs: JSON + CSV (machine), Markdown (human), and
tiers_patch_spectral_null.yaml(reviewable policy proposal). - The patch lands as calibrated
family_caps.*andmultiple_testing.alphakeys; floors, deadbands, and other caps remain explicit policy choices.
A lot of threshold tuning in machine learning is socially legible but methodologically weak. Someone adjusts a knob until the warnings "look about right," perhaps on a small pilot, then the result hardens into a default without a clear derivation story.
InvarLock's null-sweep surface goes further than that.
The public calibration docs describe a specific workflow: run no-op sweeps, collect empirical behavior under the null, summarize the resulting distribution, and emit a policy patch that can be reviewed before it is merged into tier defaults. That is a better story than folklore because it makes threshold setting observable and contestable.
What A Null Sweep Actually Measures
The calibration CLI reference is concrete here. A null sweep is not a vague "baseline test." It is a no-op edit sweep intended to measure baseline spectral behavior and derive false-positive-controlled kappa caps and alpha settings.
That distinction matters. The goal is not to show that nothing ever warns under clean conditions. The goal is to measure what warning behavior looks like under clean conditions, then choose policy values that keep false positives under a declared budget.
The CLI makes that budget visible too. Null sweeps expose a target run-level warning rate instead of forcing readers to infer the desired false-positive behavior from a chart after the fact.
This is why the calibration note focuses on null runs and target warning rates instead of only on downstream edited-model outcomes.
Why Null Behavior Is The Right Source For Spectral Thresholds
The tier-calibration assurance note makes the logic explicit: keep the warning budget fixed, measure clean baseline behavior, and derive per-family kappa thresholds against that budget. In other words, the null is not decorative. It is the reference surface for deciding how much spectral instability should count as unusual.
That is a much more defensible posture than "these values felt reasonable on a few runs." It connects the threshold to an empirical error budget.
The null-sweep guarantee is deliberately narrow: it gives evidence about guard behavior under clean conditions. It does not prove the entire acceptance policy is globally optimal.
Why The Output Patch Matters
The strongest part of the calibration surface is that it does not end in prose.
The public calibration reference says each sweep emits JSON, CSV, Markdown, and a tiers_patch_*.yaml file. That patch is the key artifact because it connects the measured result to the actual policy surface operators use. The tier-policy catalog and guards reference explain why that matters: resolved guard behavior shows up in resolved_policy.* and downstream report evidence, so calibration is not just analysis. It is policy derivation.
A reviewed null-sweep patch is intentionally narrow. A representative patch only has to carry the keys under review:
balanced:
spectral_guard:
family_caps:
ffn: 2.35
attn: 2.10
multiple_testing:
alpha: 0.05
That machine-consumable patch is what keeps the story from collapsing into "trust the calibration author." Readers can inspect the proposed keys, understand where they land, and decide whether the recommendation deserves to be adopted.
Why This Is Better Than Threshold Folklore
Null sweeps do not make calibration perfect. They do make it legible.
A folklore threshold often has no stable artifact, no target error budget, and no clean way to revisit the recommendation later. A null-sweep threshold has a run recipe, explicit outputs, and a patch-shaped recommendation. That means someone else can argue with it in the right place: the evidence and the policy diff, not the author's intuition.
This is the value of calibration work. It is not that thresholds become unquestionable. It is that they become reviewable.
What Null Sweeps Still Do Not Tell You
The limitations matter.
Null sweeps are one part of the policy story. They help derive spectral thresholds from clean behavior, but they do not by themselves settle every guard choice, every family transfer, or every future window budget. The public docs are careful about this: published assurance basis is still narrower than the full runnable surface, and teams are encouraged to recalibrate on their own models, data, and hardware.
So the correct claim is not "null sweeps discover the right thresholds once and for all." It is smaller: null sweeps are a disciplined way to turn clean empirical behavior into reviewable threshold recommendations.
Claim Map
The practical path is:
- run null sweeps under a declared profile and tier
- collect machine-readable and human-readable calibration artifacts
- emit
tiers_patch_spectral_null.yaml - review the patch before merging it into tier policy
That is a far stronger calibration story than hand-tuning values until the warnings feel acceptable.
Limitations
- Null sweeps cover the spectral threshold surface; family transfers, profile changes, and window-budget shifts still warrant fresh sweeps.
- Patch keys are a recommendation, not a verdict — review is the part of calibration this post argues should not be skipped.
- Command syntax, emitted files, and merge guidance live in the calibration reference; this post is about why the workflow is shaped that way.
Sources
More in Research Note
Continue through nearby posts in the same reading thread.
Research Note
Variance Enablement Should Be Evidence-Gated
Variance equalization is stronger when it must earn enablement through predictive evidence, explicit tier knobs, and report-visible provenance.
Research Note
The Minimum Evidence Surface for Trustworthy Weight-Edit Results
A trustworthy weight-edit result needs more than a benchmark delta. It needs a bounded claim, an exactly paired comparison, and verification that rejects incomplete evidence.
Research Note
From Sweep Outputs to Tier Policy
Calibration becomes operational when sweep artifacts end in reviewable YAML patches that later appear as resolved runtime policy in reports.