# `aom_chain_sweep_run` - user-defined native AOM chain sweep

_Group_: **Diagnostic / AOM** · _ABI_: `n4m_aom_chain_sweep_run`

## Description

`aom_chain_sweep_run` is the configurable native preprocessing-campaign
surface. Instead of selecting the built-in `compact` or `wide` AOM bank, the
caller supplies the chain list directly.

Current ABI v1 is intentionally restricted to strict-linear, shape-preserving
operators:

- `identity` / `raw`
- `detrend` / `detrend_poly`
- `savgol_smooth`
- `savgol_derivative`
- `norris_williams` / `nw`
- `finite_difference`
- `whittaker`
- `fck`

Stateful or train-fitted preprocessings such as SNV, MSC, EMSC, OSC/EPO and
baseline families are rejected in this path. They need fold-local fitting and
remain in the Python reference estimator layer.

## Python Usage

The dedicated AOM facade is available as `n4m.aom`; it aliases the same native
runtime as the top-level functions and `n4m.sklearn` classes:

```python
import n4m.aom as aom

res = aom.aom_chain_sweep_run(X, y, chains, heads=("ridge", "pls"))
inventory = aom.available_methods()
```

```python
import n4m

chains = [
    ["identity"],
    [("detrend", [1])],
    [("savgol_smooth", [5, 2])],
    [("detrend", [1]), ("savgol_derivative", [7, 2, 1])],
    [("savgol_smooth", [5, 2]), ("finite_difference", [1])],
]

res = n4m.aom_chain_sweep_run(
    X,
    y,
    chains,
    fold_ids=fold_ids,
    ridge_lambdas=[0.01, 0.1, 1.0],
    pls_components=[1, 2, 4],
    heads=("ridge", "pls"),
    scale_x=False,
    moment_policy="auto",
)
```

Sklearn-style native estimator over the same descriptor format:

```python
from n4m.sklearn import NativeAOMChainSweepRegressor

model = NativeAOMChainSweepRegressor(
    chains,
    fold_ids=fold_ids,
    ridge_lambdas=[0.01, 0.1, 1.0],
    pls_components=[1, 2, 4],
    heads=("ridge", "pls"),
    scale_x=False,
    moment_policy="auto",
).fit(X_train, y_train)

y_pred = model.predict(X_test)
```

Operator specs can be strings, tuples, or dictionaries:

```python
chains = [
    "identity",
    ("detrend", [2]),
    [{"kind": "savgol_derivative", "params": [11, 2, 1]}],
]
```

Use `"identity"` explicitly for a raw chain; empty chains are rejected.

`aom.available_methods()` returns JSON-safe metadata for the public AOM
surfaces, including global screen/refit presets, the ultra-configurable
campaign helpers, fixed-candidate winner reuse and linear AOM diversity heads.
It is an inventory for tooling and documentation, not a selector and not a
dataset-dependent router.

## C ABI Descriptor

```c
n4m_aom_chain_sweep_run(
    ctx, cfg, X, Y,
    cv, fold_ids, n_fold_ids,
    chain_offsets, n_chain_offsets,
    op_kinds, n_op_kinds,
    param_offsets, n_param_offsets,
    params, n_params,
    ridge_lambdas, n_ridge_lambdas,
    pls_components, n_pls_components,
    heads_mask,
    out_result)
```

Flat descriptor rules:

- `chain_offsets`: length `n_chains + 1`, monotonic, first `0`, last `n_ops`
- `op_kinds`: length `n_ops`, values from `n4m_operator_kind_t`
- `param_offsets`: length `n_ops + 1`, monotonic, first `0`, last `n_params`
- `params`: flat double parameter payload

Example for three chains:

- chain 0: `identity`
- chain 1: `detrend(1)`
- chain 2: `savgol_smooth(5,2) -> finite_difference(1)`

```c
int32_t chain_offsets[] = {0, 1, 2, 4};
int32_t op_kinds[] = {
    N4M_OP_IDENTITY,
    N4M_OP_DETREND_POLY,
    N4M_OP_SAVGOL_SMOOTH,
    N4M_OP_FINITE_DIFFERENCE,
};
int32_t param_offsets[] = {0, 0, 1, 3, 4};
double params[] = {1.0, 5.0, 2.0, 1.0};
```

## Outputs

Outputs match `aom_sweep_run`:

- `candidate_scores` `(n_candidates, 5)`: `candidate_id`, `chain_id`,
  `head_id`, `param`, `cv_rmse`
- `chain_offsets`, `op_kinds`, `param_offsets`, `chain_params`: flat
  descriptor of the validated strict-linear chain bank. For
  `aom_chain_sweep_run`, this echoes the caller-provided descriptor after
  native validation; for `aom_sweep_run`, it serializes the selected built-in
  profile.
- `candidate_routes` `(n_candidates)`: per-candidate scoring route code,
  `0=materialized`, `1=dense_operator_moment`, `2=banded_operator_moment`,
  `3=structured_operator_moment`.
- selected `oof_predictions`, final `predictions`, `coefficients`,
  `input_coefficients`, `intercept`, `x_mean`, `x_scale`, `y_mean`
- `fold_ids`
- scalars including `selected_chain_id`, `selected_head_id`,
  `selected_param`, `selected_cv_rmse`, `n_chains`, `n_candidates`,
  `n_operator_moment_candidates`,
  `n_ridge_operator_moment_candidates`,
  `n_pls_operator_moment_candidates`,
  `n_banded_operator_moment_candidates`,
  `n_structured_operator_moment_candidates`,
  `n_dense_operator_moment_candidates`, `n_materialized_candidates`,
  `n_ridge_materialized_candidates`, `n_pls_materialized_candidates`,
  `n_moment_prefix_cache_hits`, `n_moment_prefix_cache_misses`,
  `n_pls_moment_cv_fits`, `n_pls_materialized_cv_fits`,
  `n_pls_moment_score_batch_calls`,
  `n_pls_moment_score_batch_jobs`, `n_pls_gcv_proxy_candidates`,
  `n_pls_gcv_proxy_fits`, `n_pls_gcv_proxy_batch_calls`,
  `n_pls_gcv_proxy_batch_jobs`,
  `n_pls_moment_final_fits`, `n_pls_materialized_final_fits`,
  `aom_pls_score_mode`, and `score_only`

The scalar `profile` is `-1` for caller-provided chains.

`coefficients` are in the selected transformed-chain feature space.
`input_coefficients` are folded back into the original feature space, so
`X_new @ input_coefficients + intercept` reproduces the selected native model
without replaying the chain in Python.

`moment_policy="auto"` is the default and enables guarded exact
operator-moment scoring. Use `moment_policy="materialized"` or `"legacy"` to
force the legacy materialized-chain route for every chain/head. This is useful
when comparing route timings or when a small-cell workload is faster without
moment transforms.

Use `moment_policy="force_moments"` when the candidate screen must be
moment-only. Any chain/head/regime that would need a materialized fallback
returns `UNSUPPORTED` instead of being silently screened outside the moment
route. Python also accepts `"moments_only"`, `"operator_moments_only"`, and
`"strict_moments"`. The selected chain can still be materialized once after
ranking to expose OOF/final predictions and `input_coefficients`.

When the operator-moment route is used, repeated strict-linear chain prefixes
are cached for bounded medium-width grids. This is an exact reuse of
transformed all-sample and held-out moment sets; it does not affect ranking.
The cache is visible through `n_moment_prefix_cache_hits`,
`n_moment_prefix_cache_misses`, and, in `aom_chain_score_campaign`,
`moment_prefix_cache_hit_fraction`.

Use `score_only=True` for broad chain-ranking campaigns when no selected model
artifact is needed yet. The result keeps `candidate_scores`, selected ids,
route counters, `fold_ids` and chain descriptors; model-output matrices are
empty `0 x 0` matrices and scalar `score_only` is `1`. This avoids
selected-model refits and OOF/model output buffers in both operator-moment and
materialized candidate-screen routes. Materialized routes still pay fold-local
scoring fits, so this is not yet a replacement for batched IKPLS or a fully
fused CUDA grinder.
The PLS fit counters expose that residual cost: `n_pls_moment_cv_fits` and
`n_pls_materialized_cv_fits` count CV fits in the screen, and
`n_pls_moment_final_fits` / `n_pls_materialized_final_fits` count selected
final refits only when model outputs are requested. For PLS-only exact-CV
operator-moment screens, the native scorer batches eligible chains through one
internal score-only dispatch, preserving exact fold-CV scores while avoiding a
separate native PLS scoring call per chain. This is the exact screen path; it
is distinct from the cheaper `gcv_proxy` first pass below. The
`n_pls_moment_score_batch_calls` and `n_pls_moment_score_batch_jobs` counters
report how many native many-chain exact dispatches were used and how many
chain-fold jobs they contained.

Use `pls_score_mode="gcv_proxy"` only for explicit first-pass PLS screens. It
requires `score_only=True` and stays inside operator moments; if a requested
chain/head cannot be scored through moments, the call fails instead of falling
back to materialized scoring. PLS candidate scores then use a deterministic
PLS1 GCV RMSE proxy from all-sample transformed moments, so PLS rows expose
`score_metric="pls_gcv_proxy_rmse"` and `n_pls_gcv_proxy_*` counters. This is
not exact fold CV; use it to cheaply retain/rank many chains, then refit or
evaluate selected rows with the default `pls_score_mode="cv"` path. For
PLS-only operator-moment screens, the native proxy path also batches eligible
chains in one internal score-only dispatch and skips held-out moment
transforms, because the proxy only uses all-sample moments. The
`n_pls_gcv_proxy_fits` counter reports one proxy fit per chain, while
`n_pls_gcv_proxy_batch_calls` and `n_pls_gcv_proxy_batch_jobs` report the
many-chain dispatch shape.

Python helpers:

- `n4m.decode_aom_chains(res)` decodes the flat descriptor into operator
  chains.
- `n4m.aom_candidate_table(res, sort=True)` attaches the decoded chain to each
  candidate score row for top-k campaign reports, including `score_route_id`
  and the readable `score_route` label. PLS proxy rows also expose
  `score_metric="pls_gcv_proxy_rmse"`; exact-CV rows keep
  `score_metric="cv_rmse"`.

## Campaign Helpers

For larger strict-linear preprocessing screens, Python exposes two convenience
helpers over the same native ABI:

```python
chains = n4m.build_aom_strict_chain_grid(
    "lab",
    max_chains=5000,
)

campaign = n4m.aom_chain_score_campaign(
    X,
    y,
    chains=chains,
    fold_ids=fold_ids,
    ridge_lambdas=[0.01, 0.1, 1.0],
    pls_components=[1, 2, 4],
    heads=("ridge", "pls"),
    chain_chunk_size=1024,
    top_k=50,
    moment_policy="auto",
    backend_cuda_available=True,
    backend_min_cuda_product=512 * 512,
    checkpoint_path="reports/aom_lab_campaign_checkpoint.json",
    max_chunks_per_run=10,
)

best = campaign["best"]
print(best["chain"], best["head"], best["param"], best["cv_rmse"])

verified = n4m.aom_refit_candidates(
    X_train,
    y_train,
    campaign,
    top_k=20,
    fold_ids=fold_ids,
    scale_x=False,
)
print(verified["best_cv"]["chain"], verified["best_cv"]["refit_cv_rmse"])

screen_refit = n4m.aom_chain_screen_refit_campaign(
    X_train,
    y_train,
    chains=chains,
    fold_ids=fold_ids,
    ridge_lambdas=[0.01, 0.1, 1.0],
    pls_components=[1, 2, 4],
    heads=("ridge", "pls"),
    chain_chunk_size=1024,
    top_k=50,
    refit_top_k=20,
    moment_policy="force_moments",
    pls_score_mode="gcv_proxy",
    backend_cuda_available=True,
    backend_min_cuda_product=512 * 512,
    checkpoint_path="reports/aom_lab_campaign_checkpoint.json",
)
print(screen_refit["best_refit"]["chain"], screen_refit["best_refit"]["refit_cv_rmse"])

from n4m.sklearn import (
    NativeAOMFixedCandidateRegressor,
    NativeAOMScreenRefitRegressor,
)

screen_refit_model = NativeAOMScreenRefitRegressor(
    chains=chains,
    fold_ids=fold_ids,
    ridge_lambdas=[0.01, 0.1, 1.0],
    pls_components=[1, 2, 4],
    heads=("ridge", "pls"),
    chain_chunk_size=1024,
    top_k=50,
    refit_top_k=20,
    scale_x=False,
    moment_policy="force_moments",
    pls_score_mode="gcv_proxy",
).fit(X_train, y_train)

y_pred = screen_refit_model.predict(X_test)

model = NativeAOMFixedCandidateRegressor.from_candidate(
    best,
    fold_ids=fold_ids,
    scale_x=False,
).fit(X_train, y_train)

y_pred = model.predict(X_test)

holdout = n4m.aom_evaluate_candidates(
    X_train,
    y_train,
    X_test,
    y_test,
    campaign,
    top_k=20,
    fold_ids=fold_ids,
    scale_x=False,
)

print(holdout["best_eval"]["chain"], holdout["best_eval"]["eval_rmse"])
rank_diag = n4m.aom_candidate_rank_diagnostics(holdout, cutoffs=(1, 5, 10, 20))

n4m.aom_save_candidate_report("reports/aom_topk_eval.json", holdout)
n4m.aom_save_candidate_report("reports/aom_topk_eval.csv", holdout)

rows = n4m.aom_load_candidate_report("reports/aom_topk_eval.csv")
summary = n4m.aom_candidate_operator_summary(rows)

model = NativeAOMFixedCandidateRegressor.from_candidate(
    rows[0],
    fold_ids=fold_ids,
    scale_x=False,
).fit(X_train, y_train)

best_pls_model = NativeAOMFixedCandidateRegressor.from_campaign(
    campaign,
    head="pls",
    fold_ids=fold_ids,
    scale_x=False,
).fit(X_train, y_train)
```

`build_aom_strict_chain_grid("compact")` and `"wide"` reproduce the native
built-in chain banks. `"lab"` / `"cartesian"` builds a deterministic broader
strict-linear grid with multiple Savitzky-Golay smooth/derivative variants,
Norris-Williams, finite differences, Gaussian/FCK kernels and Whittaker chains.
Custom `families` and `templates` can define larger cartesian screens without
routing by dataset identity. The AOM `gaussian` family is the strict fixed
zero-padding banded variant used by the moment screen; the full
`n4m.sklearn.Gaussian` / `pp_gaussian` transformer remains the SciPy-compatible
preprocessing surface.
Use `iter_aom_strict_chain_grid(...)` when the same deterministic grid should
be consumed incrementally instead of materialized as one list. It accepts the
same grid arguments plus `start`, `stop`, `chunk_size` and `with_ids`; ids are
stable after de-duplication and `include_identity` filtering, so checkpointed
campaign launchers can resume by chain-id ranges without changing scores.

`aom_chain_score_campaign` always calls
`aom_chain_sweep_run(..., score_only=True)` and aggregates a global top-k over
chunks. It also keeps `top_candidates_by_head` and `best_by_head`, so a broad
mixed Ridge/PLS campaign can inspect the best preprocessing chains per model
head even when the global top-k is dominated by one head. It also keeps
`top_candidates_by_score_route` and `best_by_score_route`, so CPU/GPU audits
can inspect the best candidates scored through materialized, dense, banded or
structured moment routes. These per-head and per-route lists are audit outputs
only; they do not alter the global `top_candidates` order or the native scores.
Reports also expose `moment_backend_recommendations`, keyed by requested head,
using the same launch-planning policy as `moment_screen_backend_recommendation`.
That diagnostic uses only `n_samples`, `n_features`, `head`, `cuda_available`,
`backend_min_cuda_product`, plus the explicit PLS CUDA threshold and
many-batched flag; pass `backend_cuda_available=True` from an external launcher
when a CUDA build is available but the current process has not loaded it yet.
Use `backend_min_cuda_product` to reproduce or override the source-free launch
threshold in campaign reports without changing candidate scores.
The backend recommendation is not part of checkpoint fingerprints and does not
change candidate scoring or ranking.
The report also sums the route counters, so a campaign can state how many rows
used operator moments versus materialized fallback. Passing
`pls_score_mode="gcv_proxy"` to the campaign applies the explicit PLS proxy
screen described above and fingerprints checkpoints separately from exact-CV
campaigns. This helper is for reproducible ranking and inspection; it is not a
fused batched IKPLS or custom CUDA grinder.

For very large cartesian screens, pass `chain_ordering="prefix"` to
`aom_chain_score_campaign` or `aom_chain_screen_refit_campaign` to sort the
chain list by operator-prefix key before chunking. This does not change native
candidate scores: top rows keep their original `chain_id` and also expose
`ordered_chain_id` for audit. It only improves the chance that chains sharing a
strict-linear prefix land in the same native call and hit the per-call
moment-prefix cache. The default `chain_ordering="input"` preserves caller order.

For mixed Ridge/PLS campaigns, pass `split_head_scoring="auto"` to score each
chunk as two native score-only calls, Ridge-only then PLS-only, and merge the
candidate rows before top-k aggregation. This preserves the `(chain_id, head,
param)` scores and ranking semantics, but lets *both* halves use their native
head-homogeneous batch path: a single mixed call uses none of the batched fast
paths, so splitting turns on the Ridge moment score batch
(`n_ridge_moment_score_batch_calls`/`_jobs`) and the PLS exact or GCV-proxy
batch (`n_pls_moment_score_batch_calls`/`_jobs` for `pls_score_mode="cv"`,
`n_pls_gcv_proxy_batch_calls`/`_jobs` for `pls_score_mode="gcv_proxy"`).
Reports expose `n_split_head_chunks` and `n_chunk_score_calls`.

The lower-level campaign helpers (`aom_chain_score_campaign` /
`aom_chain_screen_refit_campaign`) default to `split_head_scoring="off"` for a
backwards-compatible launch shape. The sklearn screen/refit estimators default
to `"auto"`: `NativeAOMScreenRefitRegressor` (whose default heads are the mixed
`("ridge", "pls")` pair) and its `NativeAOMMomentScreenRefitRegressor` preset.
For single-head screens `"auto"` is inert and `n_split_head_chunks` stays `0`.

Use `n4m.aom_moment_screen_refit_campaign` when you want the same fast moment
profile as a function instead of an estimator. It wraps
`aom_chain_screen_refit_campaign` with `moment_policy="force_moments"`,
`chain_ordering="prefix"`, `split_head_scoring="auto"`,
`pls_score_mode="gcv_proxy"`, `refit_per_head_top_k=10`, and
`refit_execution="auto"`, while still accepting explicit chains, folds, grids,
CUDA flags, checkpoints and refit budgets. The combined report keeps the normal
`n4m.aom_chain_screen_refit_campaign.v1` schema and adds
`campaign_preset="moment_fast_screen_refit"`.

On CUDA builds, pass `cuda_pls_parallel_folds=True` to `aom_chain_sweep_run`,
`aom_chain_score_campaign`, `aom_refit_candidates`,
`aom_chain_screen_refit_campaign`, or the native sklearn screen/refit
wrappers to run eligible exact PLS1 moment jobs in bounded stream-parallel
batches on the selected single GPU. This preserves exact CV scores and reports
`n_pls_moment_cuda_parallel_fold_batches` plus
`n_pls_moment_cuda_parallel_fold_jobs`. It is a scheduling option over the
current exact moment jobs, not fused IKPLS.

An experimental many-job CUDA scheduler is also available for profiling with
`cuda_pls_many_batched=True` or the `N4M_CUDA_PLS_MANY_BATCHED=1`
environment fallback. It tiles independent exact PLS1 moment jobs on one GPU,
batches the dominant `p^2` operations with `cublasDgemmStridedBatched`, and
uses a small native CUDA kernel for per-job sign normalization while preserving
the same scores. If both CUDA PLS schedulers are requested,
`cuda_pls_many_batched=True` is tried before `cuda_pls_parallel_folds=True`.
It is not the default because current smoke timings did not beat the legacy
sequential-many workspace path. Use `N4M_CUDA_PLS_MANY_LEGACY=1` to force the
legacy non-batched path even when an explicit flag or env opt-in is set, and
`N4M_CUDA_PLS_BATCH_MAX_BYTES=<bytes>` to cap the experimental tile memory.

Pass `cuda_pls_min_device_features=<positive int>` to the same calls to change
the CUDA PLS1 moment device-route threshold from the default 1024 features.
This is useful for controlled CPU/CUDA crossover sweeps on medium-width NIRS
datasets. The value is included in campaign fingerprints, reports and sklearn
diagnostics, so checkpoint resume and benchmark CSVs do not mix different
GPU-route configurations.

Campaign and per-chunk reports include normalized timing and route metrics:
`chains_per_second`, `candidates_per_second`, `ms_per_chain`,
`ms_per_candidate`, `operator_moment_candidate_fraction`,
`materialized_candidate_fraction`, and route-specific Ridge/PLS plus
dense/banded/structured fractions. They also include `pls_cv_fits_per_chain`
and `pls_cv_fits_per_candidate`, derived from exact-CV PLS fit counters, plus
`pls_gcv_proxy_fits_per_chain` and `pls_gcv_proxy_fits_per_candidate` when the
proxy screen is enabled. These fields are derived from elapsed chunk times and
native route counters, and are intended for CPU/GPU campaign comparison and
for spotting chunks that leave the operator-moment route or pay excess
fold-local PLS fitting.
`benchmarks/cross_binding/bench_aom_screen_refit_scaling.py` gives the focused
timing for proxy screen plus exact-CV refit as `refit_top_k` increases; use it
to size retained-candidate budgets and to compare future batched IKPLS/CUDA
work against the current exact refit path. Pass `--head ridge` to the same
benchmark to measure grouped and batched exact-CV refit over Ridge lambda
grids. Pass `--head mixed --refit-per-head-top-k K` to measure the mixed
Ridge/PLS workflow that exact-refits the union of global top rows and per-head
top rows. Pass `--chain-ordering prefix` to measure prefix-aware chunk packing
and compare the emitted screen prefix-cache hit counters. Pass
`--split-head-scoring auto` on mixed screens to measure the PLS-only batched
subcall path separately from the historical single mixed call. On CUDA builds,
pass `--cuda-pls-parallel-folds` to time the bounded stream-parallel exact
PLS1 moment scheduling path and inspect the emitted CUDA-parallel
batches/jobs counters. Pass `--cuda-pls-min-device-features 256` or another
positive threshold to test medium-width PLS device routing explicitly.

When `checkpoint_path` is provided, the campaign writes a JSON checkpoint after
each completed chunk and resumes it by default on the next call. The
checkpoint contains the current global, per-head and per-route top-k rows,
per-chunk route counters and a fingerprint of the chain grid, folds,
hyperparameters and `X/y` contents. A mismatched checkpoint raises instead of
mixing scores from different screens. When a partial checkpoint is resumed,
top-k rows are filtered to the chunks actually present in the checkpoint before
new chunks are appended. This is intended for long 50k/200k-chain ranking runs
where process or GPU interruptions should not force a full restart.

Use `max_chunks_per_run` to advance a long campaign incrementally. For
example, a scheduler can run ten chunks, persist the checkpoint, then relaunch
the same call later. The returned report includes `complete`,
`n_remaining_chunks` and `processed_chunks_this_run`. The chunk budget itself
is not part of the checkpoint fingerprint, so it can be changed between
relaunches without invalidating the campaign.

`NativeAOMFixedCandidateRegressor` is the reuse surface for a selected row. It
fits exactly one decoded chain/head/parameter candidate through the same native
ABI and stores folded `input_coefficients`, so `predict(X_new)` does not replay
Python preprocessing objects. Use `from_candidate(row)` for an explicit row,
or `from_campaign(report, head="ridge"|"pls", rank=0)` to reuse the global
winner or a per-head campaign winner directly. Use
`from_refit_report(verified, rank=0)` after `aom_refit_candidates`, or
directly after `aom_chain_screen_refit_campaign`, to reuse the best exact-CV
row from a second-pass report. `rank` is zero-based inside the chosen global,
per-head, or refit-CV ordering.
By default the fixed-candidate estimator uses `fit_mode="cv"` and recomputes
the one-candidate exact CV score. When the row already has a verified exact-CV
score, pass `fit_mode="final_only"` and `precomputed_cv_rmse=...` to fit the
selected chain/head/parameter on all rows without CV replay. The underlying
native endpoint is `n4m.aom_chain_fixed_fit_run`; it returns final predictions,
folded input-space coefficients and intercept, but no OOF predictions or fold
ids because it is not a ranking/CV endpoint.
This endpoint is catalogued as `aom_pop.aom_chain_fixed_fit`.
The cross-binding timing benchmark reports this individual-winner reuse cost
as `native_aom_chain_fixed_fit_pls` and
`native_aom_chain_fixed_fit_ridge` rows in
`benchmarks/cross_binding/aom_sweep_timing.csv` and the matching CUDA smoke
CSV.

`n4m.aom_refit_candidates` is the train-only verification helper for broad
score-only screens. It refits each decoded row as a single exact native
candidate with `pls_score_mode="cv"` and reports `refit_cv_rmse`, `oof_rmse`,
`train_rmse`, screen score metadata and exact refit route/fitting counters.
This is the intended second pass after `pls_score_mode="gcv_proxy"` screens:
the proxy can retain many candidates cheaply, then this helper re-ranks the
retained rows by exact CV without using a holdout/test set.
Use `n4m.aom_refit_execution_plan(candidates, top_k=...,
auto_max_extra_fraction=...)` before the refit to audit the execution cost of
each exact score mode without touching `X` or `y`. It reports
`n_refit_groups`, `n_refit_scored_candidates`, and
`n_refit_extra_scored_candidates` for `individual`, `grouped_score`,
`batched_score`, and `union_batched_score`, plus the `recommended_mode` used by
`execution_mode="auto"`.
Use `execution_mode="grouped_score"` when only exact CV scores are needed:
rows sharing the same decoded chain/head are scored together, so multiple PLS
components or Ridge lambdas avoid redundant fold-local fits. The ranking is
still exact CV; grouped rows do not include per-candidate prediction arrays.
Use `execution_mode="batched_score"` to keep the same exact-CV scores while
batching multiple retained chains that share the same head and retained
parameter set into one native `aom_chain_sweep_run` call. This can reduce
Python/native call overhead and lets native strict-linear prefix caches span
retained chains. It still reports scores only; use `individual` when
per-candidate train/OOF prediction arrays are required.
Use `execution_mode="union_batched_score"` to batch all retained chains for a
head with the union of retained parameters for that head. This may score extra
chain/parameter pairs that are not returned as refit rows; the report exposes
`n_refit_scored_candidates` and `n_refit_extra_scored_candidates` so that
surplus is explicit. It can help when the parameter grid is small relative to
Python/native call overhead.
Use `execution_mode="auto"` when no prediction arrays are needed. It uses the
same plan as `aom_refit_execution_plan`: it selects `union_batched_score` only
when that reduces native refit groups and the extra scored candidates are no
more than `auto_max_extra_fraction * n_retained_candidates`; otherwise it uses
`batched_score`, which never scores unretained parameters.

`n4m.aom_chain_screen_refit_campaign` is the one-call version of that workflow:
it runs the chunked score-only campaign, then exact-CV refits the retained
`refit_top_k` rows. The combined report exposes `screen`, `refit`,
`best_screen`, `best_refit`, `screen_complete`, top-level `rows` and
`best_cv`, so it can be passed directly to
`NativeAOMFixedCandidateRegressor.from_refit_report`. If `max_chunks_per_run`
or an incomplete checkpoint leaves the screen partial, the helper still refits
the current top rows and marks `screen_complete=False`.
Set `refit_per_head_top_k` to include each head's best screen rows in the
exact-CV refit pool in addition to the global `refit_top_k` rows. This is useful
for mixed Ridge/PLS campaigns where PLS may be screened by a GCV proxy while
Ridge rows use exact CV. The helper deduplicates candidates by decoded
chain/head/parameter and reports `n_refit_global_candidates`,
`n_refit_per_head_candidates`, `n_refit_per_head_extra_candidates` and
`n_refit_union_candidates`.
By default it uses `refit_execution="auto"` and
`refit_auto_max_extra_fraction=1.0`, so the second pass can choose
`union_batched_score` when the plan says the reduced native calls justify the
bounded extra exact scores. If `return_predictions=True`, auto mode falls back
to individual replay because score-only batched modes do not return per-row
prediction arrays.

`NativeAOMScreenRefitRegressor` is the sklearn-style estimator form of the
same workflow. Its `fit` runs the two-pass campaign, stores
`campaign_report_`, `screen_report_` and `refit_report_`, then fits the chosen
verified row as a reusable fixed candidate through final-only native fit.
`predict(X_new)` uses the final folded input-space coefficients and does not
replay Python preprocessing objects. `get_diagnostics()` separates
screen/refit/final counters; after exact-CV refit, the `final_*` fields should
show zero final CV fits and only the selected all-row fit needed to build the
reusable model.

Reusable sklearn presets wrap the same estimator for the common end-user
workflows:

```python
from n4m.sklearn import (
    NativeAOMMomentScreenRefitRegressor,
    NativeAOMMomentPLSScreenRefitRegressor,
    NativeAOMMomentPLSExactScreenRefitRegressor,
    NativeAOMMomentRidgeScreenRefitRegressor,
)

mixed_model = NativeAOMMomentScreenRefitRegressor(
    profile="lab",
    max_chains=5000,
    ridge_lambdas=(1e-4, 1e-3, 1e-2, 1e-1, 1.0, 10.0, 100.0),
    pls_components=(1, 2, 3, 4, 6, 8),
    top_k=100,
    refit_top_k=50,
    refit_per_head_top_k=25,
    fold_ids=fold_ids,
).fit(X_train, y_train)

pls_model = NativeAOMMomentPLSScreenRefitRegressor(
    profile="lab",
    max_chains=5000,
    pls_components=(1, 2, 3, 4, 6, 8),
    top_k=100,
    refit_top_k=25,
    fold_ids=fold_ids,
).fit(X_train, y_train)

pls_exact_model = NativeAOMMomentPLSExactScreenRefitRegressor(
    profile="lab",
    max_chains=5000,
    pls_components=(1, 2, 3, 4, 6, 8),
    top_k=100,
    refit_top_k=25,
    fold_ids=fold_ids,
).fit(X_train, y_train)

ridge_model = NativeAOMMomentRidgeScreenRefitRegressor(
    profile="lab",
    max_chains=5000,
    ridge_lambdas=(1e-4, 1e-3, 1e-2, 1e-1, 1.0, 10.0, 100.0),
    top_k=100,
    refit_top_k=25,
    fold_ids=fold_ids,
).fit(X_train, y_train)
```

`NativeAOMMomentScreenRefitRegressor` is the mixed global preset. It fixes
`heads=("ridge", "pls")`, uses exact Ridge CV and
`pls_score_mode="gcv_proxy"` for the first pass, then exact-CV refits the
retained union of the global screen top rows and the per-head screen top rows.
The per-head inclusion is controlled by `refit_per_head_top_k`; it is a
train-only retention budget for exact verification, not a new score.

`NativeAOMMomentPLSScreenRefitRegressor` fixes `heads=("pls",)`,
`ridge_lambdas=()`, `pls_score_mode="gcv_proxy"`,
`moment_policy="force_moments"` and `chain_ordering="prefix"`, then exact-CV
refits retained rows with `pls_score_mode="cv"`.
`NativeAOMMomentPLSExactScreenRefitRegressor` fixes the same PLS-only moment
surface but uses `pls_score_mode="cv"` for the first-pass screen too; it is the
auditable exact-screen preset when proxy recall is the question.
`NativeAOMMomentRidgeScreenRefitRegressor` fixes `heads=("ridge",)`,
`pls_components=()`, `moment_policy="force_moments"` and the same prefix-aware
chunk ordering. All presets keep `profile`, custom
`chains`/`families`/`templates`, checkpointing, incremental
`max_chunks_per_run`, top-k budgets and exact-refit execution parameters
configurable. Because these presets are strict moment presets, they raise
`UNSUPPORTED` when the current fold geometry or chain/head regime would leave
the operator-moment route; use the generic
`NativeAOMScreenRefitRegressor(moment_policy="auto", ...)` when a production
run should allow guarded materialized fallbacks.

`n4m.aom_evaluate_candidates` is an explicit analysis helper for comparing
screen or refit rank against a caller-provided holdout/test split. It refits
each decoded candidate on `X_train, y_train`, predicts `X_eval`, and reports
`screen_cv_rmse`, `refit_cv_rmse`, `eval_rmse`, `eval_r2`, `cv_rank`,
`eval_rank`, and `rank_delta`. The eval set is not used to alter the fit,
choose a route, or select by dataset identity.

`n4m.aom_candidate_rank_diagnostics(report_or_rows)` turns a holdout report
into screen-recall metrics. It compares the screen score, `screen_cv_rmse` by
default, against `eval_rmse`, and reports Spearman rank correlation,
mean/median/max absolute rank drift, the eval rank of the screen winner, the
screen rank of the eval winner, and top-k overlap/recall for caller-provided
cutoffs. It can also consume rows reloaded by
`n4m.aom_load_candidate_report`.

`n4m.aom_candidate_report_records(report)` flattens campaign or holdout
candidate rows into JSON-safe dictionaries. `n4m.aom_save_candidate_report`
writes those rows as `.json`, `.jsonl` / `.ndjson`, or `.csv` without requiring
pandas. Prediction arrays produced by `return_predictions=True` are omitted by
default; pass `include_predictions=True` only for small reports. CSV exports
include `chain_json`, a compact JSON encoding of the decoded strict-linear
preprocessing chain, so a saved top-k row can be refit later with
`NativeAOMFixedCandidateRegressor.from_candidate(row)`.

`n4m.aom_load_candidate_report(path)` reads `.json`, `.jsonl` / `.ndjson`, or
`.csv` candidate reports and restores rows as refittable dictionaries. In
particular, CSV rows recover `chain` from `chain_json` and convert the standard
rank/id/score fields back to numeric types.

`n4m.aom_candidate_operator_summary(report_or_rows)` groups already-scored
candidate rows by model head, preprocessing operator, operator/head pair,
chain length, and scoring route when route labels are present. It reports
count, best score, mean/median score and rank stats using `eval_rmse` when
present, otherwise `cv_rmse`, `refit_cv_rmse` or `screen_cv_rmse`. This is an
analysis surface for pruning or expanding future preprocessing grids; it does
not alter candidate scores or select by dataset identity.

`n4m.aom_candidate_preprocessing_impact(report_or_rows)` is the more detailed
post-hoc impact view. It groups scored rows by inferred preprocessing stage,
operator, concrete option such as `savgol_smooth(7,2)`, position in the chain
and head/stage combinations. When an identity-chain baseline is present, it
also reports best-score improvement versus identity. This is for understanding
which preprocessing options deserve more cartesian budget; it does not rerank
or select candidates.

`n4m.aom_candidate_route_summary(report_or_rows)` is the route-coverage audit.
It consumes campaign, refit, holdout or reloaded candidate rows and reports the
materialized vs dense/banded/structured operator-moment counts and fractions
for the rows it received, globally, by head and by chain. When the input is a
campaign/refit report with aggregate counters, it also adds `reported_total`
for the full scored/refit candidate set, so a `top_k` report can distinguish
retained-row coverage from full-screen coverage. Use `all_operator_moment`,
`reported_total["all_operator_moment"]` and `materialized_or_unknown_chains` to
verify whether a broad preprocessing screen actually stayed in the moment
routes before reusing or expanding that grid. It is an audit surface only; it
does not rerank candidates or change routing.

## CUDA Facade Smoke

The AOM and moment Python facades can be checked against the CUDA build with:

```bash
CUDA_VISIBLE_DEVICES=0 python benchmarks/cross_binding/aom_moment_cuda_facade_smoke.py
```

The smoke loads `build/cuda-on`, runs `n4m.moment.sweep_run` and
`n4m.aom.aom_chain_sweep_run` on a wide PLS1 moment case, and fails if the
reported PLS CV route is host or materialized instead of CUDA-device moments.

## Backend Status

The method builds and tests in CPU and CUDA-enabled libn4m configurations. It
uses exact operator-moment scoring when a chain can be represented cheaply in
moment space. Dense transforms represent a chain by its feature-space operator
matrix and apply `x_sum A`, `A' X'X A`, and `A' X'Y`; they are guarded by
`p <= n_train` or the medium dense cap `p <= 48` with strictly positive Ridge
lambdas. Local linear operators (`identity`, Savitzky-Golay
smooth/derivative, Norris-Williams, finite difference, Gaussian and FCK) also
use a banded descriptor, avoiding dense chain matrices. The banded route is
enabled up to `p <= 256` for Ridge scoring and `p <= 1024` for compatible
single-target NIPALS PLS1 scoring. Chains containing `detrend_poly` use an
exact structured low-rank projection transform in moment space and can compose
with those banded local operators under the same wide guards. Chains containing
`whittaker` use an exact structured pentadiagonal solve for
`(I + lambda D2'D2)^-1` and can also compose with the banded local operators.
On CPU builds, `auto` routes Ridge rows with `p > n_train` through the exact
materialized dual-Ridge scorer because that is cheaper than feature-space
moment Ridge in this geometry. CPU `auto` also routes compatible PLS1 rows
through the exact materialized prefix scorer when `min_train < 4p`. CUDA builds
keep the operator-moment route in those cells.

Unsupported moment routes fall back per chain/head to the materialized native
sweep in `auto`, or return `UNSUPPORTED` in `force_moments`. Selected chains
are always materialized once to populate public OOF/final predictions. Batched
IKPLS, fully fused operator-moment updates for all regimes and custom CUDA
kernels are future acceleration layers.