Public Evidence Walkthrough

Purpose

This walkthrough shows the shipped public evidence floor that reviewers can verify without downloading model weights. It is deliberately BYOE-oriented: InvarLock validates baseline/subject comparison artifacts for externally materialized subjects; deployable quantized checkpoint production is outside this public evidence floor.

public_evidence/README.md defines the evidence taxonomy. In short, fixture artifacts validate verifier contracts, while real-run artifacts are produced by invarlock evaluate against materialized baseline and subject checkpoints. Every public evidence artifact carries evidence.meta.json so reviewers can see whether they are looking at a fixture or a real run.

Published-basis pass

The repository ships strict-pass public-basis examples for GPT-2-style causal LM and BERT-style masked LM lanes:

invarlock verify --profile release --assurance strict \
  public_evidence/published_basis/gpt2/evaluation.report.json

invarlock verify --profile release --assurance strict \
  public_evidence/published_basis/bert/evaluation.report.json

Each directory includes:

File	Role
`evaluation.report.json`	Canonical verifier input with primary metric, guard evidence, policy digest, and assurance section.
`runtime.manifest.json`	Container runtime provenance manifest bound to the report by SHA-256.
`evidence_pack_recipe.json`	Recipe pointer for rebuilding a full validation evidence pack.
`artifact_package/`	Checkpoint references, report/runtime paths, signed-pack path, and exact verifier commands.
`evidence_pack/`	Signed, checksum-bound GPT-2 public evidence pack that verifies under strict release policy.

The support matrix records these paths under contracts/support_matrix.json as the published_basis evidence floor.

The GPT-2 artifact_package/ is intentionally a checkpoint-reference package, not a weight dump. It names the baseline and subject checkpoint references, binds them to the report, runtime manifest, and signed pack, and keeps the exact verification commands in artifact_package/artifact_package.json. Large model weights remain external to the repository; the rebuild recipe is the source of truth for materializing a fresh BYOE evidence drop.

The GPT-2 lane also ships a small signed pack so reviewers can exercise the full offline evidence-pack verifier without rebuilding the suite:

FPR=$(python - <<'PY'
import json
from pathlib import Path

manifest = json.loads(
    Path("public_evidence/published_basis/gpt2/evidence_pack/manifest.json")
    .read_text(encoding="utf-8")
)
print(manifest["signing_key_fingerprint"])
PY
)

invarlock advanced evidence-pack verify \
  public_evidence/published_basis/gpt2/evidence_pack \
  --strict \
  --profile release \
  --report-assurance strict \
  --expected-fingerprint "$FPR"

The expected pack result is ok=true with authenticity=pinned. Without --expected-fingerprint, the signature still confirms integrity but not signer authenticity.

Real model runs

The repository includes small real runs generated by the CLI on GPT-2-family checkpoints. They verify under the release/strict profile and ship signed evidence packs.

The first run uses sshleifer/tiny-gpt2 as the baseline and subject, then applies the built-in quant_rtn RTN dequantized weight-edit simulation:

uv run invarlock verify \
  public_evidence/real_runs/tiny_gpt2_quant_rtn/evaluation.report.json \
  --profile release \
  --assurance strict

uv run invarlock advanced evidence-pack verify \
  public_evidence/real_runs/tiny_gpt2_quant_rtn/evidence_pack \
  --strict \
  --profile release \
  --report-assurance strict \
  --expected-fingerprint sha256:cc17b2af6579f5de01e74d91e93528b04670ff89f907ec3ba786a69065435605

The exact invarlock evaluate command is in public_evidence/real_runs/tiny_gpt2_quant_rtn/run_command.txt. The pack remains small enough for the repo because it references model checkpoints rather than vendoring weights. The built-in edit remains a demo/smoke edit, not a deployable quantization backend.

The second run is a real external BYOE path. The subject checkpoint is materialized outside InvarLock by public_evidence/real_runs/tiny_gpt2_external_magnitude_prune/external_edit_recipe.py, then consumed by invarlock evaluate with --edit-label custom:

uv run invarlock verify \
  public_evidence/real_runs/tiny_gpt2_external_magnitude_prune/evaluation.report.json \
  --profile release \
  --assurance strict

uv run invarlock advanced evidence-pack verify \
  public_evidence/real_runs/tiny_gpt2_external_magnitude_prune/evidence_pack \
  --strict \
  --profile release \
  --report-assurance strict \
  --expected-fingerprint sha256:e01c40a94c89b22306a2670b032f623aa5428351d06e18f9b3e9e6a39b42c41b

That artifact is the concrete real-run evidence for BYOE/custom subjects: the checkpoint weights are not vendored, checkpoint_refs.json records the external edit type and file hashes, and the report records edit_name = custom.

BYOE edit examples

The repository also ships small strict-verifiable BYOE examples for multiple external edit workflows. These fixtures make the verifier boundary explicit: the subject checkpoint is an external reference, plugins.edits is empty, and the report is verified as a baseline-vs-subject comparison rather than as output from a built-in edit plugin.

invarlock verify --profile release --assurance strict \
  public_evidence/byoe_examples/magnitude_prune_byoe/evaluation.report.json

invarlock verify --profile release --assurance strict \
  public_evidence/byoe_examples/lora_merge_byoe/evaluation.report.json

Each example includes checkpoint_refs.json beside the report. The pruning fixture is a dense magnitude-pruned subject reference, and the LoRA fixture is a merged-adapter/fine-tune style subject reference. Both are validation-subject fixtures only; sparse runtime speedups, packed quantized storage, and deployable optimized backend behavior are outside their scope.

Caught regressions

The caught-regression fixtures keep the naive primary metric acceptable (ratio_vs_baseline = 1.0) while one guard fails. They cover the three non-primary guard families exposed in strict verification:

invarlock verify --profile release --assurance strict \
  public_evidence/caught_regressions/spectral_guard_failure/evaluation.report.json

invarlock verify --profile release --assurance strict \
  public_evidence/caught_regressions/rmt_guard_failure/evaluation.report.json

invarlock verify --profile release --assurance strict \
  public_evidence/caught_regressions/variance_guard_failure/evaluation.report.json

Expected outcome: verification fails. The failure is not a perplexity failure; it is a guard/policy failure. For the spectral case, the verifier reports:

Release verification requires validation.spectral_stable == true
spectral did not pass

That is the intended strict-verification behavior: guard stability is required even when the summary metric is clean.

Policy failures

The policy-failure fixtures show non-guard and provenance predicates that can block a strict release:

invarlock verify --profile release --assurance strict \
  public_evidence/policy_failures/invariants_failure/evaluation.report.json

invarlock verify --profile release --assurance strict \
  public_evidence/policy_failures/primary_metric_failure/evaluation.report.json

invarlock verify --profile release --assurance strict \
  public_evidence/policy_failures/runtime_provenance_failure/evaluation.report.json

Expected outcome: each verification fails for its named policy predicate: invariant evidence, primary-metric acceptance, or container runtime provenance.

Applying this to your checkpoint

Use your own edited checkpoint from a quantization, pruning, distillation, or fine-tuning pipeline, then run invarlock evaluate or generate an evaluation.report.json from paired run reports:

invarlock report generate \
  --run runs/subject/report.json \
  --baseline-run-report runs/baseline/report.json \
  --format report \
  -o reports/eval

invarlock verify --profile release --assurance strict \
  reports/eval/evaluation.report.json

Keep evaluation.report.json and runtime.manifest.json together. Use invarlock advanced runtime-verify only when you specifically want to inspect the manifest/report binding; use invarlock verify for the full strict verification result.