ADR: ProofGraphIR for Cross-Assistant Alignment
Context
Cross-assistant alignment in wonton-soup compares theorem-level artifacts between Lean and Coq runs.
Before ProofGraphIR, graph distance directly mixed:
- Lean wild-type tactic-search traces (small, shallow, tactic-labeled)
- Coq proof-term DAGs (large, deep, structural-edge-labeled)
This produced a high risk of invalid structural conclusions because the compared graph families were not semantically aligned.
Research Findings (March 2, 2026)
From paired Lean↔︎Coq logic micro runs (84 pairs):
- pre-IR alignment collapse:
top1_unique_rate = 0.023810- reciprocal top-1 rate
= 0.011905 - top-1 targets concentrated on two Coq theorems
- strong size confound:
- correlation of pair rank with log Coq graph size
~0.918
- correlation of pair rank with log Coq graph size
- graph-family mismatch:
- Lean run classified as search traces
- Coq run classified as proof-term DAGs
After first ProofGraphIR integration:
- retrieval quality improved:
MRR: 0.122338 -> 0.321572Recall@1: 0.023810 -> 0.226190Recall@10: 0.309524 -> 0.500000
- collapse reduced:
top1_unique_rate: 0.023810 -> 0.392857- reciprocal top-1 rate:
0.011905 -> 0.380952
- size confound reduced:
- rank/log-size correlation:
0.918 -> 0.084
- rank/log-size correlation:
After enabling same-kind benchmark mode (proof_term_graph vs proof_term_graph, solved-only):
- same-kind support became explicit (
same_kind pairs = 36,cross_kind pairs = 0) - same-kind retrieval on this slice:
MRR = 0.358735Recall@1 = 0.25Recall@10 = 0.583333
The solved-only benchmark still missed gate at Recall@10 = 0.277778 (threshold 0.30), so this is a major improvement but not final.
Decision
Adopt ProofGraphIR as the required graph abstraction for cross-assistant alignment.
ProofGraphIR introduces:
- explicit graph kind (
search_trace,term_dag,unknown) - edge-role normalization
- shared operator-ontology profile (
apply,bind,branch, …) - run-relative rank normalization by graph kind
- kind-aware graph distance and weighting
- report-time diagnostics:
cross_kind_ratekind_pair_distribution- run-level graph-kind distributions
For cross-kind pairs, suppress role/shape channels and rely on normalized structural ranks plus lexical/connective channels.
Paired benchmark now supports:
- deterministic name-obfuscated evaluation mode (
--name-obfuscation names) by_kindsummary and gate thresholds (same_kind,cross_kind)- theorem-level proof aggregation modes (
single,best_of,consensus) - split gate axes (
coverage,quality) with explicit cohort denominators
Options Considered
- Keep raw graph-size + family matching
- Disable graph channel and use text-only matching
- Add kind-aware IR with explicit comparability boundaries (chosen)
Consequences
Positive:
- removes major size-driven artifacts from ranking
- makes comparability assumptions explicit in artifacts
- improves paired benchmark performance without hiding mismatch semantics
Costs and limits:
- cross-kind matching still depends heavily on statement/name lexical signal
- current runs are still mostly
search_trace -> term_dagcomparisons, not same-family structure comparisons - cannot yet claim theorem-level proof-shape equivalence across assistants
Claim Boundaries
Allowed:
- “We can compare cross-assistant theorem artifacts under explicit cross-kind semantics.”
- “Retrieval collapse from raw size mismatch is substantially reduced.”
Not allowed:
- “We are measuring identical proof shape across Lean and Coq.”
- “High retrieval alone implies proof-structure invariance.”
Verification Requirements
Cross-assistant reports must include:
- graph-kind distributions for each run
cross_kind_ratekind_pair_distribution- paired retrieval metrics (
Recall@k,MRR)
Any structural claim across assistants must report whether it is same-kind or cross-kind.
Follow-On Decisions Needed
- Expand operator-ontology v1 coverage for shared structural semantics.
- Expand motif ontology coverage so cross-assistant action channels are not dominated by lexical spillover.
- Build same-kind benchmark support (for example term-dag vs term-dag) so structural claims are not only cross-kind.
References
analysis/proof_graph_ir.pyanalysis/cross_assistant_alignment.pyanalysis/cross_assistant_paired_benchmark.pygenerated/cross_assistant_paired_benchmark_2026-03-02_v3_gated.jsongenerated/cross_assistant_paired_benchmark_2026-03-02_v3_ir1.jsongenerated/cross_assistant_alignment_diagnostic_2026-03-02_v3.jsongenerated/cross_assistant_alignment_diagnostic_2026-03-02_v3_ir1.json- Cross-Assistant Alignment (Diagnostic)
- Cross-Assistant Paired Benchmark (Primary Gate)