# `aom_staged_chain_campaign` - staged strict-chain cartesian screen/refit

_Group_: **AOM / moment campaign** · _Catalog_: `aom_pop.aom_staged_chain_campaign`
· _Backed by_: `aom_chain_screen_refit_campaign` (pure Python orchestration over
the single `libn4m` runtime)

## Description

`aom_staged_chain_campaign` is the first-class staged-cartesian workflow for AOM /
moment preprocessing selection. It runs several score-only strict-linear
preprocessing screens in sequence — the `compact`, `wide` and `lab` profiles,
focused family plans such as `savgol_focus` / `strict_family_focus`, or an
explicit `stages` list mixing profiles and Ridge / PLS / mixed head plans —
then:

1. merges the per-stage retained candidates (deduplicated by chain/head/param,
   keeping the best screen score and recording every stage a candidate came
   from);
2. keeps the **top global** and **top per-head** rows across all stages;
3. exact-CV **refits** that retained union exactly once; and
4. attaches **preprocessing-impact** and **screen-vs-refit rank** diagnostics, and
   an optional **offline holdout audit**.

It does not add any new numerical kernel: every fit flows through the existing
`aom_chain_score_campaign` (screen), `aom_refit_candidates` (exact-CV refit),
`aom_candidate_preprocessing_impact`, and `aom_candidate_rank_diagnostics`
helpers, which all call the single `libn4m` C/CUDA runtime. The orchestrator is
the staged equivalent of the lab proto `cartesian.py` / `impact.py`, expressed
over the catalogued n4m moment routes.

## Constraints (load-bearing)

* **Strict-linear only.** Stages screen the strict-linear AOM chain grids
  (`build_aom_strict_chain_grid`). No hors-moment nonlinear or supervised lifts
  are introduced.
* **No identity selection.** The campaign consumes only `X` / `y` arrays. Stage
  `name` values are cosmetic labels; no dataset, source, id or name is ever read
  or used to pick chains, heads or the winner. Renaming stages cannot change the
  selected model.
* **Train-only production selection.** The production winner is the minimum
  exact-CV refit row (`selection_metric="refit_cv_rmse"`,
  `selection_uses_test_set=False`). A held-out set, when supplied, is scored only
  for the offline audit and is never used for selection.
* **Single infrastructure.** This is Python orchestration; `libn4m` stays the one
  C/binding engine.

## Parameters

| Name | Type | Default | Notes |
|------|------|---------|-------|
| `X`, `y` | arrays | — | Training spectra and target(s); `y` 1-D or `(n, 1)` |
| `stages` | list \| None | `None` | Explicit stage list (profile name or override dict). `None` → use `plan` |
| `plan` | str | `"compact_wide_lab"` | One of `compact`, `wide`, `lab`, `compact_wide`, `compact_lab`, `wide_lab`, `compact_wide_lab`, `savgol_focus`, `strict_family_focus` |
| `cv` | int | `5` | Exact CV folds (screen + refit) |
| `ridge_lambdas` / `pls_components` | seq | small grids | Default head grids (per-stage overridable) |
| `heads` | seq | `("ridge", "pls")` | Default heads (per-stage overridable) |
| `top_k` | int | `50` | Rows each stage keeps in its own screen |
| `refit_top_k` | int \| None | `None`→`top_k` | Global retained rows refit with exact CV |
| `refit_per_head_top_k` | int \| None | `10` | Extra per-head retained rows (`None` disables) |
| `checkpoint_dir` | path \| None | `None` | Directory for one resumable score checkpoint per stage |
| `resume` | bool | `True` | Resume matching stage checkpoints when present |
| `max_chunks_per_run` | int \| None | `None` | Limit new chunks processed per stage in this call |
| `scale_x_values` | seq \| None | `None` | Optional grid such as `[False, True]`; each value runs the same staged campaign and the model config is selected by train exact-CV refit |
| `pls_score_mode` | str | `"cv"` | `"gcv_proxy"` enables the fast PLS screen proxy |
| `moment_policy` / `refit_moment_policy` | str | `"auto"` | Screen / refit moment routing |
| `impact` / `rank_diagnostics` | bool | `True` | Toggle the post-hoc audit reports |
| `X_audit` / `y_audit` | arrays \| None | `None` | Optional **offline** held-out audit |
| `return_stage_screens` | bool | `False` | Include the raw per-stage screen reports |

Stage override dict keys: `name`, `profile`, `heads`, `ridge_lambdas`,
`pls_components`, `top_k`, `max_chains`, `families`, `templates`,
`pls_score_mode`, `moment_policy`, `chain_ordering`, `split_head_scoring`.
Missing keys fall back to the campaign defaults. Unknown keys raise.

## Returned report

A JSON-friendly dict keyed by `report_schema = "n4m.aom_staged_chain_campaign.v1"`:

| Key | Meaning |
|-----|---------|
| `stages` | Per-stage summaries (`name`, `profile`, `heads`, `n_chains`, `n_top_candidates`, `screen_best`, …) |
| `rows` | Exact-CV refit rows (the retained union), each with `chain`, `head`, `param`, `refit_cv_rmse`, `screen_cv_rmse`, and cross-stage `campaign_stage` / `campaign_stages` |
| `best` / `best_cv` / `best_refit` | Production winner = minimum `refit_cv_rmse` row |
| `best_by_head` | Per-head best refit row |
| `merged_top_candidates` | Cross-stage deduplicated global screen pool |
| `retention` | `refit_top_k`, `refit_per_head_top_k`, and global/per-head union counts |
| `checkpoint_dir` / `max_chunks_per_run` / `n_remaining_stage_chunks_total` | Staged resume state; partial screens remain exact-refit-able over currently retained rows |
| `impact` | `aom_candidate_preprocessing_impact` over the refit rows (`refit_cv_rmse`) |
| `rank_diagnostics` | `aom_candidate_rank_diagnostics` (screen `screen_cv_rmse` vs exact `refit_cv_rmse` rank drift / recall) |
| `audit` | Offline holdout report (`audit_only=True`) or `None` |
| `refit` | The full `aom_refit_candidates` report |
| `selection_metric` / `selection_policy` / `selection_uses_test_set` | Selection provenance (`refit_cv_rmse` / `exact_cv_refit_train_only` / `False`) |
| `model_config_grid` / `model_config_summaries` / `selected_model_config` | Present when `scale_x_values` is used; records the train-CV-selected model config and per-config best refit scores |

`rows` is directly consumable by `NativeAOMFixedCandidateRegressor.from_refit_report`.

## Python usage

```python
import numpy as np
import n4m

rng = np.random.default_rng(7)
X = rng.standard_normal((64, 256))
y = X[:, 8] - 0.4 * X[:, 19] + 0.05 * rng.standard_normal(64)

# Staged compact -> wide -> lab screen over mixed Ridge/PLS heads.
report = n4m.aom_staged_chain_campaign(
    X, y,
    plan="compact_wide_lab",
    cv=5,
    refit_top_k=20,           # global retained rows
    refit_per_head_top_k=5,   # extra per-head rows
    checkpoint_dir="artifacts/aom_staged_checkpoints",
)

best = report["best"]                       # production winner, exact-CV on train
print(best["head"], best["param"], best["refit_cv_rmse"])
print(report["retention"])                  # how many rows were exact-refit
print(report["impact"]["by_operator"][:3])  # which preprocessing families paid off
print(report["rank_diagnostics"]["spearman_rank_correlation"])
print(report["screen_complete"], report["n_remaining_stage_chunks_total"])

# Materialize the winner with the existing reusable estimator.
model = n4m.NativeAOMFixedCandidateRegressor.from_refit_report(report).fit(X, y)
y_hat = model.predict(X)
```

Model-configuration grid selection, still train-only:

```python
report = n4m.aom_staged_chain_campaign(
    X, y,
    plan="compact",
    scale_x_values=[False, True],
    refit_top_k=12,
    refit_per_head_top_k=2,
)

assert report["selection_uses_test_set"] is False
print(report["selected_model_config"])  # {"scale_x": True/False, ...}
print(report["model_config_summaries"])
```

When a model-config grid is used, `rows` and `best` come from the selected
config, while route/candidate counters such as `n_screen_pls_moment_cv_fits`
sum the work paid across every config.

Focused preprocessing-family plans, useful when `max_chains` is intentionally
small. Start with `savgol_focus` for the fast incremental campaign; use
`strict_family_focus` as a heavier family-audit profile because Gaussian/FCK /
Whittaker stages can dominate wall time on some datasets.

```python
# Prioritize SavGol diversity instead of waiting for late lab-profile entries.
report = n4m.aom_staged_chain_campaign(
    X, y,
    plan="savgol_focus",
    max_chains=8,       # applied per focused stage
    refit_top_k=12,
    scale_x_values=[False, True],
)

# Also force strict Gaussian/FCK/Whittaker stages to be screened early.
report = n4m.aom_staged_chain_campaign(
    X, y,
    plan="strict_family_focus",
    max_chains=8,
    refit_top_k=12,
)
```

These plans are fixed source-free stage recipes over the existing strict-linear
lab families. They do not read dataset/source/name/id metadata and they still
select the final row only by train exact-CV refit.

Sklearn estimator form (same train-CV selection, no held-out audit inputs):

```python
model = n4m.NativeAOMStagedChainCampaignRegressor(
    plan="compact_wide_lab",
    cv=5,
    refit_top_k=20,
    refit_per_head_top_k=5,
    checkpoint_dir="artifacts/aom_staged_checkpoints",
).fit(X, y)

y_hat = model.predict(X)
diag = model.get_diagnostics()
assert diag["selection_uses_test_set"] is False
print(model.selected_head_, model.selected_param_, model.selected_cv_rmse_)
```

Fast SavGol-focused reusable preset:

```python
model = n4m.NativeAOMSavgolFocusRegressor(
    cv=5,
    checkpoint_dir="artifacts/aom_savgol_focus_checkpoints",
).fit(X, y)

diag = model.get_diagnostics()
assert diag["plan"] == "savgol_focus"
assert diag["selection_uses_test_set"] is False
print(diag["selected_stage"], diag["selected_model_config"])
```

The preset delegates to the same staged campaign engine with
`plan="savgol_focus"`, `max_chains=6`, `top_k=10`, `refit_top_k=8`,
`refit_per_head_top_k=2`, `scale_x_values=[False, True]` and
`split_head_scoring="auto"` by default. On a
CUDA build it also defaults to the one-GPU PLS route knobs used in the local
benchmark (`cuda_pls_parallel_folds=True`, `cuda_pls_min_device_features=1`,
`backend_min_cuda_product=1`). Use the generic staged estimator when you need
custom plans or family templates.

Cost-safe strict-family audit preset:

```python
model = n4m.NativeAOMStrictFamilyLiteRegressor(
    cv=5,
    checkpoint_dir="artifacts/aom_strict_family_lite_checkpoints",
).fit(X, y)

diag = model.get_diagnostics()
assert diag["plan"] == "strict_family_focus"
assert diag["selection_uses_test_set"] is False
print(diag["selected_stage"], diag["n_refit_candidates"])
```

`NativeAOMStrictFamilyLiteRegressor` exercises the broader
`strict_family_focus` stage recipe, but defaults to a small audit budget:
`max_chains=2`, `top_k=6`, `refit_top_k=4`, `refit_per_head_top_k=1`,
`scale_x=False`, no `scale_x_values` grid and `split_head_scoring="auto"`.
It is meant to sample SavGol,
Norris-Williams, finite-difference, Gaussian, FCK and Whittaker behavior without
paying for the heavier strict-family benchmark profile.

Custom stages (e.g. a Ridge-only compact pass then a PLS-only wide pass):

```python
report = n4m.aom_staged_chain_campaign(
    X, y,
    stages=[
        {"name": "ridge_compact", "profile": "compact", "heads": ("ridge",)},
        {"name": "pls_wide", "profile": "wide", "heads": ("pls",),
         "pls_score_mode": "gcv_proxy"},
    ],
    refit_per_head_top_k=5,
)
```

Offline audit (test ranking only — never used for selection):

```python
report = n4m.aom_staged_chain_campaign(
    X_train, y_train,
    plan="compact_wide",
    X_audit=X_test, y_audit=y_test,   # offline audit only
)
assert report["audit"]["audit_only"] is True
assert report["selection_uses_test_set"] is False
# report["best"] is identical whether or not the audit set is supplied.
```

The same callable is exported from `n4m`, `n4m.aom` and `n4m.moment`.
The reusable sklearn estimator is exported as
`NativeAOMStagedChainCampaignRegressor` from `n4m`, `n4m.sklearn`, `n4m.aom`
and `n4m.moment`.

Resume is stage-local: each stage delegates to
`aom_chain_score_campaign(..., checkpoint_path=...)`. With
`max_chunks_per_run`, the report may have `screen_complete=False`; the retained
rows available so far are still exact-CV refit on train and can be reused as a
partial audit. A later call with the same data/configuration and
`checkpoint_dir` resumes the remaining chunks.

## Reused building blocks

| Step | Helper |
|------|--------|
| Strict-chain grids | `build_aom_strict_chain_grid` / `iter_aom_strict_chain_grid` |
| Per-stage screen | `aom_chain_score_campaign` (chunked, streaming) |
| Global + per-head retention | `aom_screen_refit_candidate_pool` semantics |
| Exact-CV refit | `aom_refit_candidates` |
| Preprocessing impact | `aom_candidate_preprocessing_impact` |
| Screen-vs-refit rank | `aom_candidate_rank_diagnostics` |
| Offline audit | `aom_evaluate_candidates` |
| Report save/load | `aom_save_candidate_report` / `aom_load_candidate_report` |

## Workflow / benchmark note

This is the staged runner the AOM benchmark campaign calls for: run
compact/wide/lab strict-chain screens with Ridge, PLS and mixed heads, retain the
top global and per-head candidates, exact-refit the retained rows, and read
`impact` / `rank_diagnostics` to decide whether a wider cartesian budget is
justified before comparing against the robust AOM / TabPFN baselines. Because
each stage screens in chunks of `chain_chunk_size` and only keeps its own
`top_k`, large `lab` cartesians stream rather than materialize every scored
candidate.

### Timing smoke

`benchmarks/cross_binding/bench_aom_staged_chain_campaign_timing.py` records the
end-to-end wall-clock cost of the staged campaign on synthetic data, one CSV row
per `--plans` entry, with the retention / selection / impact / rank-diagnostics
and refit route / CUDA counters pulled from the report. It accepts the small
screen controls (`--plan`s, `--max-chains`, `--top-k`, `--refit-top-k`,
`--refit-per-head-top-k`, `--chain-chunk-size`, `--heads`, `--components`,
`--ridge-lambdas`, `--repeats`) and the campaign's CPU/GPU knobs
(`--cuda-pls-parallel-folds`, `--cuda-pls-min-device-features`,
`--cuda-pls-many-batched`, `--backend-min-cuda-product`, `--moment-policy`).

This measures **orchestration and exact-refit timing only** — the staged screen,
cross-stage retention, the single exact-CV refit of the retained union, and the
post-hoc diagnostics. It is **not** a benchmark of the future fused IKPLS
grinder; the per-candidate exact CV it times is the target that grinder must
beat, not the grinder itself.

```bash
PYTHONPATH=bindings/python/src N4M_LIB_PATH=build/dev-release/cpp/src/libn4m.so \
  python benchmarks/cross_binding/bench_aom_staged_chain_campaign_timing.py \
  --plans compact,compact_wide --max-chains 16 --top-k 24 --refit-top-k 8 \
  --output /tmp/aom_staged_chain_campaign_timing.csv
```

Tiny one-GPU CUDA smoke used for release readiness:

```bash
CUDA_VISIBLE_DEVICES=0 \
PYTHONPATH=bindings/python/src \
N4M_LIB_PATH=build/cuda-on/cpp/src/libn4m.so \
  /home/delete/.venv/bin/python benchmarks/cross_binding/bench_aom_staged_chain_campaign_timing.py \
  --output benchmarks/cross_binding/aom_staged_chain_campaign_timing_cuda_smoke.csv \
  --repeats 1 --plans compact --n-samples 96 --n-features 128 --cv 3 \
  --heads pls --components 1 --ridge-lambdas 0.1 --max-chains 4 \
  --chain-chunk-size 2 --top-k 4 --refit-top-k 3 \
  --refit-per-head-top-k 1 --moment-policy auto \
  --cuda-pls-min-device-features 1 --cuda-pls-parallel-folds
```

The smoke is expected to keep `selection_uses_test_set=False` and show nonzero
`n_pls_moment_cuda_device_cv_fits` with zero host PLS CV fits.