aom_chain_sweep_run - user-defined native AOM chain sweep¶
Group: Diagnostic / AOM · ABI: n4m_aom_chain_sweep_run
Description¶
aom_chain_sweep_run is the configurable native preprocessing-campaign
surface. Instead of selecting the built-in compact or wide AOM bank, the
caller supplies the chain list directly.
Current ABI v1 is intentionally restricted to strict-linear, shape-preserving operators:
identity/rawdetrend/detrend_polysavgol_smoothsavgol_derivativenorris_williams/nwfinite_differencewhittakerfck
Stateful or train-fitted preprocessings such as SNV, MSC, EMSC, OSC/EPO and baseline families are rejected in this path. They need fold-local fitting and remain in the Python reference estimator layer.
Python Usage¶
The dedicated AOM facade is available as n4m.aom; it aliases the same native
runtime as the top-level functions and n4m.sklearn classes:
import n4m.aom as aom
res = aom.aom_chain_sweep_run(X, y, chains, heads=("ridge", "pls"))
inventory = aom.available_methods()
import n4m
chains = [
["identity"],
[("detrend", [1])],
[("savgol_smooth", [5, 2])],
[("detrend", [1]), ("savgol_derivative", [7, 2, 1])],
[("savgol_smooth", [5, 2]), ("finite_difference", [1])],
]
res = n4m.aom_chain_sweep_run(
X,
y,
chains,
fold_ids=fold_ids,
ridge_lambdas=[0.01, 0.1, 1.0],
pls_components=[1, 2, 4],
heads=("ridge", "pls"),
scale_x=False,
moment_policy="auto",
)
Sklearn-style native estimator over the same descriptor format:
from n4m.sklearn import NativeAOMChainSweepRegressor
model = NativeAOMChainSweepRegressor(
chains,
fold_ids=fold_ids,
ridge_lambdas=[0.01, 0.1, 1.0],
pls_components=[1, 2, 4],
heads=("ridge", "pls"),
scale_x=False,
moment_policy="auto",
).fit(X_train, y_train)
y_pred = model.predict(X_test)
Operator specs can be strings, tuples, or dictionaries:
chains = [
"identity",
("detrend", [2]),
[{"kind": "savgol_derivative", "params": [11, 2, 1]}],
]
Use "identity" explicitly for a raw chain; empty chains are rejected.
aom.available_methods() returns JSON-safe metadata for the public AOM
surfaces, including global screen/refit presets, the ultra-configurable
campaign helpers, fixed-candidate winner reuse and linear AOM diversity heads.
It is an inventory for tooling and documentation, not a selector and not a
dataset-dependent router.
C ABI Descriptor¶
n4m_aom_chain_sweep_run(
ctx, cfg, X, Y,
cv, fold_ids, n_fold_ids,
chain_offsets, n_chain_offsets,
op_kinds, n_op_kinds,
param_offsets, n_param_offsets,
params, n_params,
ridge_lambdas, n_ridge_lambdas,
pls_components, n_pls_components,
heads_mask,
out_result)
Flat descriptor rules:
chain_offsets: lengthn_chains + 1, monotonic, first0, lastn_opsop_kinds: lengthn_ops, values fromn4m_operator_kind_tparam_offsets: lengthn_ops + 1, monotonic, first0, lastn_paramsparams: flat double parameter payload
Example for three chains:
chain 0:
identitychain 1:
detrend(1)chain 2:
savgol_smooth(5,2) -> finite_difference(1)
int32_t chain_offsets[] = {0, 1, 2, 4};
int32_t op_kinds[] = {
N4M_OP_IDENTITY,
N4M_OP_DETREND_POLY,
N4M_OP_SAVGOL_SMOOTH,
N4M_OP_FINITE_DIFFERENCE,
};
int32_t param_offsets[] = {0, 0, 1, 3, 4};
double params[] = {1.0, 5.0, 2.0, 1.0};
Outputs¶
Outputs match aom_sweep_run:
candidate_scores(n_candidates, 5):candidate_id,chain_id,head_id,param,cv_rmsechain_offsets,op_kinds,param_offsets,chain_params: flat descriptor of the validated strict-linear chain bank. Foraom_chain_sweep_run, this echoes the caller-provided descriptor after native validation; foraom_sweep_run, it serializes the selected built-in profile.candidate_routes(n_candidates): per-candidate scoring route code,0=materialized,1=dense_operator_moment,2=banded_operator_moment,3=structured_operator_moment.selected
oof_predictions, finalpredictions,coefficients,input_coefficients,intercept,x_mean,x_scale,y_meanfold_idsscalars including
selected_chain_id,selected_head_id,selected_param,selected_cv_rmse,n_chains,n_candidates,n_operator_moment_candidates,n_ridge_operator_moment_candidates,n_pls_operator_moment_candidates,n_banded_operator_moment_candidates,n_structured_operator_moment_candidates,n_dense_operator_moment_candidates,n_materialized_candidates,n_ridge_materialized_candidates,n_pls_materialized_candidates,n_moment_prefix_cache_hits,n_moment_prefix_cache_misses,n_pls_moment_cv_fits,n_pls_materialized_cv_fits,n_pls_moment_score_batch_calls,n_pls_moment_score_batch_jobs,n_pls_gcv_proxy_candidates,n_pls_gcv_proxy_fits,n_pls_gcv_proxy_batch_calls,n_pls_gcv_proxy_batch_jobs,n_pls_moment_final_fits,n_pls_materialized_final_fits,aom_pls_score_mode, andscore_only
The scalar profile is -1 for caller-provided chains.
coefficients are in the selected transformed-chain feature space.
input_coefficients are folded back into the original feature space, so
X_new @ input_coefficients + intercept reproduces the selected native model
without replaying the chain in Python.
moment_policy="auto" is the default and enables guarded exact
operator-moment scoring. Use moment_policy="materialized" or "legacy" to
force the legacy materialized-chain route for every chain/head. This is useful
when comparing route timings or when a small-cell workload is faster without
moment transforms.
Use moment_policy="force_moments" when the candidate screen must be
moment-only. Any chain/head/regime that would need a materialized fallback
returns UNSUPPORTED instead of being silently screened outside the moment
route. Python also accepts "moments_only", "operator_moments_only", and
"strict_moments". The selected chain can still be materialized once after
ranking to expose OOF/final predictions and input_coefficients.
When the operator-moment route is used, repeated strict-linear chain prefixes
are cached for bounded medium-width grids. This is an exact reuse of
transformed all-sample and held-out moment sets; it does not affect ranking.
The cache is visible through n_moment_prefix_cache_hits,
n_moment_prefix_cache_misses, and, in aom_chain_score_campaign,
moment_prefix_cache_hit_fraction.
Use score_only=True for broad chain-ranking campaigns when no selected model
artifact is needed yet. The result keeps candidate_scores, selected ids,
route counters, fold_ids and chain descriptors; model-output matrices are
empty 0 x 0 matrices and scalar score_only is 1. This avoids
selected-model refits and OOF/model output buffers in both operator-moment and
materialized candidate-screen routes. Materialized routes still pay fold-local
scoring fits, so this is not yet a replacement for batched IKPLS or a fully
fused CUDA grinder.
The PLS fit counters expose that residual cost: n_pls_moment_cv_fits and
n_pls_materialized_cv_fits count CV fits in the screen, and
n_pls_moment_final_fits / n_pls_materialized_final_fits count selected
final refits only when model outputs are requested. For PLS-only exact-CV
operator-moment screens, the native scorer batches eligible chains through one
internal score-only dispatch, preserving exact fold-CV scores while avoiding a
separate native PLS scoring call per chain. This is the exact screen path; it
is distinct from the cheaper gcv_proxy first pass below. The
n_pls_moment_score_batch_calls and n_pls_moment_score_batch_jobs counters
report how many native many-chain exact dispatches were used and how many
chain-fold jobs they contained.
Use pls_score_mode="gcv_proxy" only for explicit first-pass PLS screens. It
requires score_only=True and stays inside operator moments; if a requested
chain/head cannot be scored through moments, the call fails instead of falling
back to materialized scoring. PLS candidate scores then use a deterministic
PLS1 GCV RMSE proxy from all-sample transformed moments, so PLS rows expose
score_metric="pls_gcv_proxy_rmse" and n_pls_gcv_proxy_* counters. This is
not exact fold CV; use it to cheaply retain/rank many chains, then refit or
evaluate selected rows with the default pls_score_mode="cv" path. For
PLS-only operator-moment screens, the native proxy path also batches eligible
chains in one internal score-only dispatch and skips held-out moment
transforms, because the proxy only uses all-sample moments. The
n_pls_gcv_proxy_fits counter reports one proxy fit per chain, while
n_pls_gcv_proxy_batch_calls and n_pls_gcv_proxy_batch_jobs report the
many-chain dispatch shape.
Python helpers:
n4m.decode_aom_chains(res)decodes the flat descriptor into operator chains.n4m.aom_candidate_table(res, sort=True)attaches the decoded chain to each candidate score row for top-k campaign reports, includingscore_route_idand the readablescore_routelabel. PLS proxy rows also exposescore_metric="pls_gcv_proxy_rmse"; exact-CV rows keepscore_metric="cv_rmse".
Campaign Helpers¶
For larger strict-linear preprocessing screens, Python exposes two convenience helpers over the same native ABI:
chains = n4m.build_aom_strict_chain_grid(
"lab",
max_chains=5000,
)
campaign = n4m.aom_chain_score_campaign(
X,
y,
chains=chains,
fold_ids=fold_ids,
ridge_lambdas=[0.01, 0.1, 1.0],
pls_components=[1, 2, 4],
heads=("ridge", "pls"),
chain_chunk_size=1024,
top_k=50,
moment_policy="auto",
backend_cuda_available=True,
backend_min_cuda_product=512 * 512,
checkpoint_path="reports/aom_lab_campaign_checkpoint.json",
max_chunks_per_run=10,
)
best = campaign["best"]
print(best["chain"], best["head"], best["param"], best["cv_rmse"])
verified = n4m.aom_refit_candidates(
X_train,
y_train,
campaign,
top_k=20,
fold_ids=fold_ids,
scale_x=False,
)
print(verified["best_cv"]["chain"], verified["best_cv"]["refit_cv_rmse"])
screen_refit = n4m.aom_chain_screen_refit_campaign(
X_train,
y_train,
chains=chains,
fold_ids=fold_ids,
ridge_lambdas=[0.01, 0.1, 1.0],
pls_components=[1, 2, 4],
heads=("ridge", "pls"),
chain_chunk_size=1024,
top_k=50,
refit_top_k=20,
moment_policy="force_moments",
pls_score_mode="gcv_proxy",
backend_cuda_available=True,
backend_min_cuda_product=512 * 512,
checkpoint_path="reports/aom_lab_campaign_checkpoint.json",
)
print(screen_refit["best_refit"]["chain"], screen_refit["best_refit"]["refit_cv_rmse"])
from n4m.sklearn import (
NativeAOMFixedCandidateRegressor,
NativeAOMScreenRefitRegressor,
)
screen_refit_model = NativeAOMScreenRefitRegressor(
chains=chains,
fold_ids=fold_ids,
ridge_lambdas=[0.01, 0.1, 1.0],
pls_components=[1, 2, 4],
heads=("ridge", "pls"),
chain_chunk_size=1024,
top_k=50,
refit_top_k=20,
scale_x=False,
moment_policy="force_moments",
pls_score_mode="gcv_proxy",
).fit(X_train, y_train)
y_pred = screen_refit_model.predict(X_test)
model = NativeAOMFixedCandidateRegressor.from_candidate(
best,
fold_ids=fold_ids,
scale_x=False,
).fit(X_train, y_train)
y_pred = model.predict(X_test)
holdout = n4m.aom_evaluate_candidates(
X_train,
y_train,
X_test,
y_test,
campaign,
top_k=20,
fold_ids=fold_ids,
scale_x=False,
)
print(holdout["best_eval"]["chain"], holdout["best_eval"]["eval_rmse"])
rank_diag = n4m.aom_candidate_rank_diagnostics(holdout, cutoffs=(1, 5, 10, 20))
n4m.aom_save_candidate_report("reports/aom_topk_eval.json", holdout)
n4m.aom_save_candidate_report("reports/aom_topk_eval.csv", holdout)
rows = n4m.aom_load_candidate_report("reports/aom_topk_eval.csv")
summary = n4m.aom_candidate_operator_summary(rows)
model = NativeAOMFixedCandidateRegressor.from_candidate(
rows[0],
fold_ids=fold_ids,
scale_x=False,
).fit(X_train, y_train)
best_pls_model = NativeAOMFixedCandidateRegressor.from_campaign(
campaign,
head="pls",
fold_ids=fold_ids,
scale_x=False,
).fit(X_train, y_train)
build_aom_strict_chain_grid("compact") and "wide" reproduce the native
built-in chain banks. "lab" / "cartesian" builds a deterministic broader
strict-linear grid with multiple Savitzky-Golay smooth/derivative variants,
Norris-Williams, finite differences, Gaussian/FCK kernels and Whittaker chains.
Custom families and templates can define larger cartesian screens without
routing by dataset identity. The AOM gaussian family is the strict fixed
zero-padding banded variant used by the moment screen; the full
n4m.sklearn.Gaussian / pp_gaussian transformer remains the SciPy-compatible
preprocessing surface.
Use iter_aom_strict_chain_grid(...) when the same deterministic grid should
be consumed incrementally instead of materialized as one list. It accepts the
same grid arguments plus start, stop, chunk_size and with_ids; ids are
stable after de-duplication and include_identity filtering, so checkpointed
campaign launchers can resume by chain-id ranges without changing scores.
aom_chain_score_campaign always calls
aom_chain_sweep_run(..., score_only=True) and aggregates a global top-k over
chunks. It also keeps top_candidates_by_head and best_by_head, so a broad
mixed Ridge/PLS campaign can inspect the best preprocessing chains per model
head even when the global top-k is dominated by one head. It also keeps
top_candidates_by_score_route and best_by_score_route, so CPU/GPU audits
can inspect the best candidates scored through materialized, dense, banded or
structured moment routes. These per-head and per-route lists are audit outputs
only; they do not alter the global top_candidates order or the native scores.
Reports also expose moment_backend_recommendations, keyed by requested head,
using the same launch-planning policy as moment_screen_backend_recommendation.
That diagnostic uses only n_samples, n_features, head, cuda_available,
backend_min_cuda_product, plus the explicit PLS CUDA threshold and
many-batched flag; pass backend_cuda_available=True from an external launcher
when a CUDA build is available but the current process has not loaded it yet.
Use backend_min_cuda_product to reproduce or override the source-free launch
threshold in campaign reports without changing candidate scores.
The backend recommendation is not part of checkpoint fingerprints and does not
change candidate scoring or ranking.
The report also sums the route counters, so a campaign can state how many rows
used operator moments versus materialized fallback. Passing
pls_score_mode="gcv_proxy" to the campaign applies the explicit PLS proxy
screen described above and fingerprints checkpoints separately from exact-CV
campaigns. This helper is for reproducible ranking and inspection; it is not a
fused batched IKPLS or custom CUDA grinder.
For very large cartesian screens, pass chain_ordering="prefix" to
aom_chain_score_campaign or aom_chain_screen_refit_campaign to sort the
chain list by operator-prefix key before chunking. This does not change native
candidate scores: top rows keep their original chain_id and also expose
ordered_chain_id for audit. It only improves the chance that chains sharing a
strict-linear prefix land in the same native call and hit the per-call
moment-prefix cache. The default chain_ordering="input" preserves caller order.
For mixed Ridge/PLS campaigns, pass split_head_scoring="auto" to score each
chunk as two native score-only calls, Ridge-only then PLS-only, and merge the
candidate rows before top-k aggregation. This preserves the (chain_id, head, param) scores and ranking semantics, but lets both halves use their native
head-homogeneous batch path: a single mixed call uses none of the batched fast
paths, so splitting turns on the Ridge moment score batch
(n_ridge_moment_score_batch_calls/_jobs) and the PLS exact or GCV-proxy
batch (n_pls_moment_score_batch_calls/_jobs for pls_score_mode="cv",
n_pls_gcv_proxy_batch_calls/_jobs for pls_score_mode="gcv_proxy").
Reports expose n_split_head_chunks and n_chunk_score_calls.
The lower-level campaign helpers (aom_chain_score_campaign /
aom_chain_screen_refit_campaign) default to split_head_scoring="off" for a
backwards-compatible launch shape. The sklearn screen/refit estimators default
to "auto": NativeAOMScreenRefitRegressor (whose default heads are the mixed
("ridge", "pls") pair) and its NativeAOMMomentScreenRefitRegressor preset.
For single-head screens "auto" is inert and n_split_head_chunks stays 0.
Use n4m.aom_moment_screen_refit_campaign when you want the same fast moment
profile as a function instead of an estimator. It wraps
aom_chain_screen_refit_campaign with moment_policy="force_moments",
chain_ordering="prefix", split_head_scoring="auto",
pls_score_mode="gcv_proxy", refit_per_head_top_k=10, and
refit_execution="auto", while still accepting explicit chains, folds, grids,
CUDA flags, checkpoints and refit budgets. The combined report keeps the normal
n4m.aom_chain_screen_refit_campaign.v1 schema and adds
campaign_preset="moment_fast_screen_refit".
On CUDA builds, pass cuda_pls_parallel_folds=True to aom_chain_sweep_run,
aom_chain_score_campaign, aom_refit_candidates,
aom_chain_screen_refit_campaign, or the native sklearn screen/refit
wrappers to run eligible exact PLS1 moment jobs in bounded stream-parallel
batches on the selected single GPU. This preserves exact CV scores and reports
n_pls_moment_cuda_parallel_fold_batches plus
n_pls_moment_cuda_parallel_fold_jobs. It is a scheduling option over the
current exact moment jobs, not fused IKPLS.
An experimental many-job CUDA scheduler is also available for profiling with
cuda_pls_many_batched=True or the N4M_CUDA_PLS_MANY_BATCHED=1
environment fallback. It tiles independent exact PLS1 moment jobs on one GPU,
batches the dominant p^2 operations with cublasDgemmStridedBatched, and
uses a small native CUDA kernel for per-job sign normalization while preserving
the same scores. If both CUDA PLS schedulers are requested,
cuda_pls_many_batched=True is tried before cuda_pls_parallel_folds=True.
It is not the default because current smoke timings did not beat the legacy
sequential-many workspace path. Use N4M_CUDA_PLS_MANY_LEGACY=1 to force the
legacy non-batched path even when an explicit flag or env opt-in is set, and
N4M_CUDA_PLS_BATCH_MAX_BYTES=<bytes> to cap the experimental tile memory.
Pass cuda_pls_min_device_features=<positive int> to the same calls to change
the CUDA PLS1 moment device-route threshold from the default 1024 features.
This is useful for controlled CPU/CUDA crossover sweeps on medium-width NIRS
datasets. The value is included in campaign fingerprints, reports and sklearn
diagnostics, so checkpoint resume and benchmark CSVs do not mix different
GPU-route configurations.
Campaign and per-chunk reports include normalized timing and route metrics:
chains_per_second, candidates_per_second, ms_per_chain,
ms_per_candidate, operator_moment_candidate_fraction,
materialized_candidate_fraction, and route-specific Ridge/PLS plus
dense/banded/structured fractions. They also include pls_cv_fits_per_chain
and pls_cv_fits_per_candidate, derived from exact-CV PLS fit counters, plus
pls_gcv_proxy_fits_per_chain and pls_gcv_proxy_fits_per_candidate when the
proxy screen is enabled. These fields are derived from elapsed chunk times and
native route counters, and are intended for CPU/GPU campaign comparison and
for spotting chunks that leave the operator-moment route or pay excess
fold-local PLS fitting.
benchmarks/cross_binding/bench_aom_screen_refit_scaling.py gives the focused
timing for proxy screen plus exact-CV refit as refit_top_k increases; use it
to size retained-candidate budgets and to compare future batched IKPLS/CUDA
work against the current exact refit path. Pass --head ridge to the same
benchmark to measure grouped and batched exact-CV refit over Ridge lambda
grids. Pass --head mixed --refit-per-head-top-k K to measure the mixed
Ridge/PLS workflow that exact-refits the union of global top rows and per-head
top rows. Pass --chain-ordering prefix to measure prefix-aware chunk packing
and compare the emitted screen prefix-cache hit counters. Pass
--split-head-scoring auto on mixed screens to measure the PLS-only batched
subcall path separately from the historical single mixed call. On CUDA builds,
pass --cuda-pls-parallel-folds to time the bounded stream-parallel exact
PLS1 moment scheduling path and inspect the emitted CUDA-parallel
batches/jobs counters. Pass --cuda-pls-min-device-features 256 or another
positive threshold to test medium-width PLS device routing explicitly.
When checkpoint_path is provided, the campaign writes a JSON checkpoint after
each completed chunk and resumes it by default on the next call. The
checkpoint contains the current global, per-head and per-route top-k rows,
per-chunk route counters and a fingerprint of the chain grid, folds,
hyperparameters and X/y contents. A mismatched checkpoint raises instead of
mixing scores from different screens. When a partial checkpoint is resumed,
top-k rows are filtered to the chunks actually present in the checkpoint before
new chunks are appended. This is intended for long 50k/200k-chain ranking runs
where process or GPU interruptions should not force a full restart.
Use max_chunks_per_run to advance a long campaign incrementally. For
example, a scheduler can run ten chunks, persist the checkpoint, then relaunch
the same call later. The returned report includes complete,
n_remaining_chunks and processed_chunks_this_run. The chunk budget itself
is not part of the checkpoint fingerprint, so it can be changed between
relaunches without invalidating the campaign.
NativeAOMFixedCandidateRegressor is the reuse surface for a selected row. It
fits exactly one decoded chain/head/parameter candidate through the same native
ABI and stores folded input_coefficients, so predict(X_new) does not replay
Python preprocessing objects. Use from_candidate(row) for an explicit row,
or from_campaign(report, head="ridge"|"pls", rank=0) to reuse the global
winner or a per-head campaign winner directly. Use
from_refit_report(verified, rank=0) after aom_refit_candidates, or
directly after aom_chain_screen_refit_campaign, to reuse the best exact-CV
row from a second-pass report. rank is zero-based inside the chosen global,
per-head, or refit-CV ordering.
By default the fixed-candidate estimator uses fit_mode="cv" and recomputes
the one-candidate exact CV score. When the row already has a verified exact-CV
score, pass fit_mode="final_only" and precomputed_cv_rmse=... to fit the
selected chain/head/parameter on all rows without CV replay. The underlying
native endpoint is n4m.aom_chain_fixed_fit_run; it returns final predictions,
folded input-space coefficients and intercept, but no OOF predictions or fold
ids because it is not a ranking/CV endpoint.
This endpoint is catalogued as aom_pop.aom_chain_fixed_fit.
The cross-binding timing benchmark reports this individual-winner reuse cost
as native_aom_chain_fixed_fit_pls and
native_aom_chain_fixed_fit_ridge rows in
benchmarks/cross_binding/aom_sweep_timing.csv and the matching CUDA smoke
CSV.
n4m.aom_refit_candidates is the train-only verification helper for broad
score-only screens. It refits each decoded row as a single exact native
candidate with pls_score_mode="cv" and reports refit_cv_rmse, oof_rmse,
train_rmse, screen score metadata and exact refit route/fitting counters.
This is the intended second pass after pls_score_mode="gcv_proxy" screens:
the proxy can retain many candidates cheaply, then this helper re-ranks the
retained rows by exact CV without using a holdout/test set.
Use n4m.aom_refit_execution_plan(candidates, top_k=..., auto_max_extra_fraction=...) before the refit to audit the execution cost of
each exact score mode without touching X or y. It reports
n_refit_groups, n_refit_scored_candidates, and
n_refit_extra_scored_candidates for individual, grouped_score,
batched_score, and union_batched_score, plus the recommended_mode used by
execution_mode="auto".
Use execution_mode="grouped_score" when only exact CV scores are needed:
rows sharing the same decoded chain/head are scored together, so multiple PLS
components or Ridge lambdas avoid redundant fold-local fits. The ranking is
still exact CV; grouped rows do not include per-candidate prediction arrays.
Use execution_mode="batched_score" to keep the same exact-CV scores while
batching multiple retained chains that share the same head and retained
parameter set into one native aom_chain_sweep_run call. This can reduce
Python/native call overhead and lets native strict-linear prefix caches span
retained chains. It still reports scores only; use individual when
per-candidate train/OOF prediction arrays are required.
Use execution_mode="union_batched_score" to batch all retained chains for a
head with the union of retained parameters for that head. This may score extra
chain/parameter pairs that are not returned as refit rows; the report exposes
n_refit_scored_candidates and n_refit_extra_scored_candidates so that
surplus is explicit. It can help when the parameter grid is small relative to
Python/native call overhead.
Use execution_mode="auto" when no prediction arrays are needed. It uses the
same plan as aom_refit_execution_plan: it selects union_batched_score only
when that reduces native refit groups and the extra scored candidates are no
more than auto_max_extra_fraction * n_retained_candidates; otherwise it uses
batched_score, which never scores unretained parameters.
n4m.aom_chain_screen_refit_campaign is the one-call version of that workflow:
it runs the chunked score-only campaign, then exact-CV refits the retained
refit_top_k rows. The combined report exposes screen, refit,
best_screen, best_refit, screen_complete, top-level rows and
best_cv, so it can be passed directly to
NativeAOMFixedCandidateRegressor.from_refit_report. If max_chunks_per_run
or an incomplete checkpoint leaves the screen partial, the helper still refits
the current top rows and marks screen_complete=False.
Set refit_per_head_top_k to include each head’s best screen rows in the
exact-CV refit pool in addition to the global refit_top_k rows. This is useful
for mixed Ridge/PLS campaigns where PLS may be screened by a GCV proxy while
Ridge rows use exact CV. The helper deduplicates candidates by decoded
chain/head/parameter and reports n_refit_global_candidates,
n_refit_per_head_candidates, n_refit_per_head_extra_candidates and
n_refit_union_candidates.
By default it uses refit_execution="auto" and
refit_auto_max_extra_fraction=1.0, so the second pass can choose
union_batched_score when the plan says the reduced native calls justify the
bounded extra exact scores. If return_predictions=True, auto mode falls back
to individual replay because score-only batched modes do not return per-row
prediction arrays.
NativeAOMScreenRefitRegressor is the sklearn-style estimator form of the
same workflow. Its fit runs the two-pass campaign, stores
campaign_report_, screen_report_ and refit_report_, then fits the chosen
verified row as a reusable fixed candidate through final-only native fit.
predict(X_new) uses the final folded input-space coefficients and does not
replay Python preprocessing objects. get_diagnostics() separates
screen/refit/final counters; after exact-CV refit, the final_* fields should
show zero final CV fits and only the selected all-row fit needed to build the
reusable model.
Reusable sklearn presets wrap the same estimator for the common end-user workflows:
from n4m.sklearn import (
NativeAOMMomentScreenRefitRegressor,
NativeAOMMomentPLSScreenRefitRegressor,
NativeAOMMomentPLSExactScreenRefitRegressor,
NativeAOMMomentRidgeScreenRefitRegressor,
)
mixed_model = NativeAOMMomentScreenRefitRegressor(
profile="lab",
max_chains=5000,
ridge_lambdas=(1e-4, 1e-3, 1e-2, 1e-1, 1.0, 10.0, 100.0),
pls_components=(1, 2, 3, 4, 6, 8),
top_k=100,
refit_top_k=50,
refit_per_head_top_k=25,
fold_ids=fold_ids,
).fit(X_train, y_train)
pls_model = NativeAOMMomentPLSScreenRefitRegressor(
profile="lab",
max_chains=5000,
pls_components=(1, 2, 3, 4, 6, 8),
top_k=100,
refit_top_k=25,
fold_ids=fold_ids,
).fit(X_train, y_train)
pls_exact_model = NativeAOMMomentPLSExactScreenRefitRegressor(
profile="lab",
max_chains=5000,
pls_components=(1, 2, 3, 4, 6, 8),
top_k=100,
refit_top_k=25,
fold_ids=fold_ids,
).fit(X_train, y_train)
ridge_model = NativeAOMMomentRidgeScreenRefitRegressor(
profile="lab",
max_chains=5000,
ridge_lambdas=(1e-4, 1e-3, 1e-2, 1e-1, 1.0, 10.0, 100.0),
top_k=100,
refit_top_k=25,
fold_ids=fold_ids,
).fit(X_train, y_train)
NativeAOMMomentScreenRefitRegressor is the mixed global preset. It fixes
heads=("ridge", "pls"), uses exact Ridge CV and
pls_score_mode="gcv_proxy" for the first pass, then exact-CV refits the
retained union of the global screen top rows and the per-head screen top rows.
The per-head inclusion is controlled by refit_per_head_top_k; it is a
train-only retention budget for exact verification, not a new score.
NativeAOMMomentPLSScreenRefitRegressor fixes heads=("pls",),
ridge_lambdas=(), pls_score_mode="gcv_proxy",
moment_policy="force_moments" and chain_ordering="prefix", then exact-CV
refits retained rows with pls_score_mode="cv".
NativeAOMMomentPLSExactScreenRefitRegressor fixes the same PLS-only moment
surface but uses pls_score_mode="cv" for the first-pass screen too; it is the
auditable exact-screen preset when proxy recall is the question.
NativeAOMMomentRidgeScreenRefitRegressor fixes heads=("ridge",),
pls_components=(), moment_policy="force_moments" and the same prefix-aware
chunk ordering. All presets keep profile, custom
chains/families/templates, checkpointing, incremental
max_chunks_per_run, top-k budgets and exact-refit execution parameters
configurable. Because these presets are strict moment presets, they raise
UNSUPPORTED when the current fold geometry or chain/head regime would leave
the operator-moment route; use the generic
NativeAOMScreenRefitRegressor(moment_policy="auto", ...) when a production
run should allow guarded materialized fallbacks.
n4m.aom_evaluate_candidates is an explicit analysis helper for comparing
screen or refit rank against a caller-provided holdout/test split. It refits
each decoded candidate on X_train, y_train, predicts X_eval, and reports
screen_cv_rmse, refit_cv_rmse, eval_rmse, eval_r2, cv_rank,
eval_rank, and rank_delta. The eval set is not used to alter the fit,
choose a route, or select by dataset identity.
n4m.aom_candidate_rank_diagnostics(report_or_rows) turns a holdout report
into screen-recall metrics. It compares the screen score, screen_cv_rmse by
default, against eval_rmse, and reports Spearman rank correlation,
mean/median/max absolute rank drift, the eval rank of the screen winner, the
screen rank of the eval winner, and top-k overlap/recall for caller-provided
cutoffs. It can also consume rows reloaded by
n4m.aom_load_candidate_report.
n4m.aom_candidate_report_records(report) flattens campaign or holdout
candidate rows into JSON-safe dictionaries. n4m.aom_save_candidate_report
writes those rows as .json, .jsonl / .ndjson, or .csv without requiring
pandas. Prediction arrays produced by return_predictions=True are omitted by
default; pass include_predictions=True only for small reports. CSV exports
include chain_json, a compact JSON encoding of the decoded strict-linear
preprocessing chain, so a saved top-k row can be refit later with
NativeAOMFixedCandidateRegressor.from_candidate(row).
n4m.aom_load_candidate_report(path) reads .json, .jsonl / .ndjson, or
.csv candidate reports and restores rows as refittable dictionaries. In
particular, CSV rows recover chain from chain_json and convert the standard
rank/id/score fields back to numeric types.
n4m.aom_candidate_operator_summary(report_or_rows) groups already-scored
candidate rows by model head, preprocessing operator, operator/head pair,
chain length, and scoring route when route labels are present. It reports
count, best score, mean/median score and rank stats using eval_rmse when
present, otherwise cv_rmse, refit_cv_rmse or screen_cv_rmse. This is an
analysis surface for pruning or expanding future preprocessing grids; it does
not alter candidate scores or select by dataset identity.
n4m.aom_candidate_preprocessing_impact(report_or_rows) is the more detailed
post-hoc impact view. It groups scored rows by inferred preprocessing stage,
operator, concrete option such as savgol_smooth(7,2), position in the chain
and head/stage combinations. When an identity-chain baseline is present, it
also reports best-score improvement versus identity. This is for understanding
which preprocessing options deserve more cartesian budget; it does not rerank
or select candidates.
n4m.aom_candidate_route_summary(report_or_rows) is the route-coverage audit.
It consumes campaign, refit, holdout or reloaded candidate rows and reports the
materialized vs dense/banded/structured operator-moment counts and fractions
for the rows it received, globally, by head and by chain. When the input is a
campaign/refit report with aggregate counters, it also adds reported_total
for the full scored/refit candidate set, so a top_k report can distinguish
retained-row coverage from full-screen coverage. Use all_operator_moment,
reported_total["all_operator_moment"] and materialized_or_unknown_chains to
verify whether a broad preprocessing screen actually stayed in the moment
routes before reusing or expanding that grid. It is an audit surface only; it
does not rerank candidates or change routing.
CUDA Facade Smoke¶
The AOM and moment Python facades can be checked against the CUDA build with:
CUDA_VISIBLE_DEVICES=0 python benchmarks/cross_binding/aom_moment_cuda_facade_smoke.py
The smoke loads build/cuda-on, runs n4m.moment.sweep_run and
n4m.aom.aom_chain_sweep_run on a wide PLS1 moment case, and fails if the
reported PLS CV route is host or materialized instead of CUDA-device moments.
Backend Status¶
The method builds and tests in CPU and CUDA-enabled libn4m configurations. It
uses exact operator-moment scoring when a chain can be represented cheaply in
moment space. Dense transforms represent a chain by its feature-space operator
matrix and apply x_sum A, A' X'X A, and A' X'Y; they are guarded by
p <= n_train or the medium dense cap p <= 48 with strictly positive Ridge
lambdas. Local linear operators (identity, Savitzky-Golay
smooth/derivative, Norris-Williams, finite difference, Gaussian and FCK) also
use a banded descriptor, avoiding dense chain matrices. The banded route is
enabled up to p <= 256 for Ridge scoring and p <= 1024 for compatible
single-target NIPALS PLS1 scoring. Chains containing detrend_poly use an
exact structured low-rank projection transform in moment space and can compose
with those banded local operators under the same wide guards. Chains containing
whittaker use an exact structured pentadiagonal solve for
(I + lambda D2'D2)^-1 and can also compose with the banded local operators.
On CPU builds, auto routes Ridge rows with p > n_train through the exact
materialized dual-Ridge scorer because that is cheaper than feature-space
moment Ridge in this geometry. CPU auto also routes compatible PLS1 rows
through the exact materialized prefix scorer when min_train < 4p. CUDA builds
keep the operator-moment route in those cells.
Unsupported moment routes fall back per chain/head to the materialized native
sweep in auto, or return UNSUPPORTED in force_moments. Selected chains
are always materialized once to populate public OOF/final predictions. Batched
IKPLS, fully fused operator-moment updates for all regimes and custom CUDA
kernels are future acceleration layers.