lean-corpus-extractor
Lean-native corpus extraction for building deterministic items.jsonl inputs for dossiers/wonton-soup corpus artifacts.
Imports requested modules at runtime without depending on Mathlib at build time. Must run under a project environment providing the required .olean files (e.g., via lake env ...).
Environment
Enter the local project shell from addenda/lean-corpus-extractor/:
nix developIf you use direnv, direnv allow is equivalent. The shell auto-runs lake update.
Output format
JSONL, one object per line:
item_id: Lean identifier-safe stable id (no dots)display_name: original fully qualified name (with dots)payload.statement: a Lean declaration template containing{name}wonton-soupreplaces{name}with a unique name per run/seed
payload.source: minimal provenance for the extracted decl
Example:
{"item_id":"Nat_x2eadd_x5fcomm","display_name":"Nat.add_comm","payload":{"statement":"theorem {name} : forall (a b : Nat), a + b = b + a := by\n sorry","source":{"kind":"lean_env","module":"Mathlib.Data.Nat.Basic","qualname":"Nat.add_comm"}}}Usage (Mathlib)
- Build this tool:
cd addenda/lean-corpus-extractor
lake build- Run it under a Lean project that has Mathlib available (for
wonton-soup, that’s the locallean_project/created bysetup_lean.py):
cd dossiers/wonton-soup/lean_project
lake env ../../addenda/lean-corpus-extractor/.lake/build/bin/lean_corpus_extract \
--import Mathlib.Data.Nat.Basic \
--module-prefix Mathlib.Data.Nat.Basic \
--limit 100 \
--out /tmp/mathlib_items.jsonlIf you omit --import, the extractor defaults to Mathlib (slow; loads all of mathlib).
Targeted extraction by theorem name
Use repeatable --name flags to extract only specific declarations:
cd dossiers/wonton-soup/lean_project
lake env ../../addenda/lean-corpus-extractor/.lake/build/bin/lean_corpus_extract \
--import Mathlib \
--name ContinuousLinearMap.isOpenMap \
--name QuotientGroup.quotientKerEquivRange \
--out /tmp/named_items.jsonlIf a requested name is missing in the current Lean environment, the extractor keeps running and prints one stderr line per missing name.