Lake Jobs (Materialized Datasets)
Lake jobs are a thin, explicit way to:
- select a set of runs from the lake registry,
- optionally build or attach a reference model,
- materialize one or more SQL datasets into files (JSONL or Parquet),
- write a manifest capturing selection + reference provenance.
They are intended as the stable bridge from “raw lake tables” to downstream reporting/viz (for example, addenda/specter-viz).
Implementation: dossiers/wonton-soup/analysis/lake/job.py.
Presets
Preset job configs live in dossiers/wonton-soup/analysis/lake/presets/*.json.
List them:
uv run python wonton.py lake job presetsRun one:
uv run python wonton.py lake job run --config dossiers/wonton-soup/analysis/lake/presets/01_runs_overview.json --logs-dir logsRelease a compiled site dashboard cohort (quality-gated superset):
uv run python wonton.py lake job release-site-dashboard \
--config dossiers/wonton-soup/analysis/lake/presets/72_site_dashboard_superset_v1.json \
--logs-dir logsThis command:
- materializes dashboard payloads with an explicit selection preset,
- writes a versioned snapshot under
site/dashboards/wonton-soup/data/releases/<release_id>/, - promotes the release into active site data at
site/dashboards/wonton-soup/data/manifest.json.
Use --no-promote to stage a release snapshot without switching active site data.
Job Config Schema (v2)
Job configs are JSON objects with schema_version: 2:
{
"schema_version": 2,
"name": "runs_overview",
"selection": {
"provider": ["reprover", "deepseek"],
"backend": "lean",
"mode": ["dev", "research"]
},
"reference": null,
"datasets": [
{
"name": "runs",
"query": "SELECT * FROM runs WHERE run_key IN (SELECT run_key FROM selected_runs)",
"format": "jsonl"
}
]
}selection
selection filters the runs table. Supported keys:
root_id(string)provider,backend,mode,corpus,goal_sig_scheme(string or list of strings)
When you pass --logs-dir to wonton.py lake job run, the job will set selection.root_id to that directory’s root_id unless you already provided one.
The selection is applied as an AND across fields.
datasets
Each dataset is:
name(string, required)query(SQL string, required)format(jsonlorparquet, defaultjsonl)file(optional base name; no directories)
Jobs create a temporary selected_runs(run_key) table that dataset SQL can use.
reference
reference controls reference building and optional K scoring.
Supported patterns:
- Attach an existing reference by id:
{"ref_id": "abc123", "score_k": true}- Build a new goal-outcomes reference (from goal_cache aggregates), with an explicit run selection:
{
"build_outcomes": {"alpha": 1.0, "meta": {"note": "baseline reference"}},
"selection": {"provider": "reprover", "backend": "lean"},
"score_k": true
}Important invariant:
- If
reference.build_outcomesis present,reference.selectionmust be a non-empty object. Enforced inanalysis.lake.job.load_job_config()to prevent accidental in-sample leakage. - Rationale for this guardrail: ADR: Explicit Reference Selection for Lake Jobs.
If score_k is true, --logs-dir must be provided so the scorer can locate per-run directories and read MCTS traces.
Output Contract
Each job run writes:
manifest.json: resolved selection + resolved reference (including ref id and reference members)job_config.json: the original config snapshot (schema v2)inputs.json: the selected run keys- one file per dataset
Jobs also record a row in the lake_job_runs table (for cross-run provenance).
SQL Convenience: job_context(ref_id)
Jobs create a temporary job_context(ref_id) table:
- If a reference is present, it contains a single row with that
ref_id. - If no reference is present, the table exists but is empty.
This lets dataset SQL bind deterministically to the job’s chosen reference, e.g.:
SELECT *
FROM k_reference_score
WHERE ref_id IN (SELECT ref_id FROM job_context)