ADR: ProofGraphIR for Cross-Assistant Alignment

Context

Cross-assistant alignment in wonton-soup compares theorem-level artifacts between Lean and Coq runs.

Before ProofGraphIR, graph distance directly mixed:

This produced a high risk of invalid structural conclusions because the compared graph families were not semantically aligned.

From paired Lean↔︎Coq logic micro runs (84 pairs):

pre-IR alignment collapse:
- top1_unique_rate = 0.023810
- reciprocal top-1 rate = 0.011905
- top-1 targets concentrated on two Coq theorems
strong size confound:
- correlation of pair rank with log Coq graph size ~0.918
graph-family mismatch:
- Lean run classified as search traces
- Coq run classified as proof-term DAGs

After first ProofGraphIR integration:

retrieval quality improved:
- MRR: 0.122338 -> 0.321572
- Recall@1: 0.023810 -> 0.226190
- Recall@10: 0.309524 -> 0.500000
collapse reduced:
- top1_unique_rate: 0.023810 -> 0.392857
- reciprocal top-1 rate: 0.011905 -> 0.380952
size confound reduced:
- rank/log-size correlation: 0.918 -> 0.084

After enabling same-kind benchmark mode (proof_term_graph vs proof_term_graph, solved-only):

same-kind support became explicit (same_kind pairs = 36, cross_kind pairs = 0)
same-kind retrieval on this slice:
- MRR = 0.358735
- Recall@1 = 0.25
- Recall@10 = 0.583333

The solved-only benchmark still missed gate at Recall@10 = 0.277778 (threshold 0.30), so this is a major improvement but not final.

Adopt ProofGraphIR as the required graph abstraction for cross-assistant alignment.

ProofGraphIR introduces:

explicit graph kind (search_trace, term_dag, unknown)
edge-role normalization
shared operator-ontology profile (apply, bind, branch, …)
run-relative rank normalization by graph kind
kind-aware graph distance and weighting
report-time diagnostics:
- cross_kind_rate
- kind_pair_distribution
- run-level graph-kind distributions

For cross-kind pairs, suppress role/shape channels and rely on normalized structural ranks plus lexical/connective channels.

Paired benchmark now supports:

Positive:

Costs and limits:

cross-kind matching still depends heavily on statement/name lexical signal
current runs are still mostly search_trace -> term_dag comparisons, not same-family structure comparisons
cannot yet claim theorem-level proof-shape equivalence across assistants

Allowed:

“We can compare cross-assistant theorem artifacts under explicit cross-kind semantics.”
“Retrieval collapse from raw size mismatch is substantially reduced.”

Not allowed:

Cross-assistant reports must include:

Any structural claim across assistants must report whether it is same-kind or cross-kind.

Expand operator-ontology v1 coverage for shared structural semantics.
Expand motif ontology coverage so cross-assistant action channels are not dominated by lexical spillover.
Build same-kind benchmark support (for example term-dag vs term-dag) so structural claims are not only cross-kind.