# `aom_preprocess` — AOM (Adaptive Operator Mixture) preprocessing bank _Group_: **Diagnostic** · _Registry tolerance_: `5.0` ## Description AOM operator-bank preprocessing primitive (`aom_pop.aom_preprocessing`) > **Registry note** — `nirs4all-methods` exposes this primitive directly as > `n4m.aom_preprocess` and `n4m.aom.aom_preprocess`. In-tree > `nirs4all.operators.models.sklearn.aom_pls` remains the sanctioned reference > provider for qualitative AOM parity. ### Parameters | Name | Type | Default | Notes | |------|------|---------|-------| | `operators` | sequence of AOM operator specs | compact strict-linear bank | Same syntax as `n4m.aom_pls` / `n4m.aom_chain_sweep_run`: strings, integer enum ids, `(kind, params)` tuples or `{"kind": ..., "params": ...}` mappings. | | `gating_mode` | `{"soft", "hard", 1, 0}` | `"soft"` | `soft` averages all operator outputs; `hard` selects the first operator deterministically. | | `y` | array-like or `None` | `None` | Optional response matrix passed to the native fit path for operators that need supervised fit state. | ## Explanations ### Bibliographic source Beurier, G., Reiter, R., Noûs, C., Rouan, L. & Cornet, D. (2026). *Reframing preprocessing selection as model-internal calibration in near-infrared spectroscopy: a large-scale benchmark of operator-adaptive PLS and Ridge models*. arXiv:2605.13587. https://arxiv.org/abs/2605.13587 — introduces operator-adaptive PLS (AOM-PLS / POP-PLS) and the bench against 50+ NIRS datasets that the git-pinned oracle `nirs4all.operators.models.sklearn.aom_pls` is calibrated against. ### Mathematical principle `aom_preprocess` is the **operator-bank primitive** that AOM-PLS and POP-PLS build on. Given the centered spectral matrix $\mathbf{X} \in \mathbb{R}^{n\times p}$ and a finite bank of strict-linear operators $\{\mathbf{A}_b\}_{b=1}^{M} \subset \mathbb{R}^{p\times p}$ — matrices fully determined by the wavelength grid — `aom_preprocess` materializes the $M$ preprocessed views $\mathbf{X}_b = \mathbf{X}\mathbf{A}_b^{\top}$ and gates them. The direct `n4m_aom_preprocess_fit` surface currently supports the reusable strict-linear operator subset covered by the smoke tests: identity, first-degree polynomial detrending, Savitzky-Golay smoothing, Savitzky-Golay first derivative, Norris-Williams, finite difference, Gaussian smoothing, Whittaker smoothing and FCK. Strict chains and model-scoring diversity are handled by the AOM sweep/campaign operator-moment paths. SNV / MSC / EMSC / ASLS / OSC remain excluded from the moment contract because they depend on per-sample normalization, $\mathbf{y}$, or a reference spectrum. Two gating modes are supported: * **soft** ($\texttt{gating\_mode}=1$): equal-weight average $$\mathbf{X}_{\text{AOM}}^{\text{soft}} \;=\; \frac{1}{M}\sum_{b=1}^{M}\mathbf{X}\mathbf{A}_b^{\top}.$$ * **hard** ($\texttt{gating\_mode}=0$): deterministic first-operator selection, $$\mathbf{X}_{\text{AOM}}^{\text{hard}} \;=\; \mathbf{X}\mathbf{A}_{1}^{\top}.$$ Both modes preserve the **cross-covariance identity** exploited by the AOM/POP selectors: with $\mathbf{S} = \mathbf{X}^{\top}\mathbf{Y}$ and any $\mathbf{A}_b$ in the bank, $$\bigl(\mathbf{X}\mathbf{A}_b^{\top}\bigr)^{\top}\mathbf{Y} \;=\; \mathbf{A}_b\,\mathbf{S},$$ so a downstream PLS step can score the whole bank by $M$ cheap $O(pq)$ left actions instead of $M$ full $O(np)$ matrix products. The motivation is that **no single preprocessing is best on all calibrations** — different wavelength regions favour different transforms — and the AOM-PLS / POP-PLS selectors exploit that by picking, respectively, a global operator (one $b^{\star}$ for the whole model) or a per-component operator (one $b_a$ for each latent direction). Predictions on new spectra reuse the absorbed operator(s) through the recovered original-space coefficients — **no preprocessing replay at predict time**. ### Implementation `n4m_aom_preprocess_fit` via the native C ABI. Python exposes this as `n4m.aom_preprocess` and `n4m.aom.aom_preprocess`; both return the native MethodResult fields as NumPy arrays/scalars: - `transformed`: final gated transform, shape `(n_samples, n_features)`; - `operator_outputs`: operator-major matrix of per-operator transformed views; - `weights`: gating weights; - `operator_kinds`: integer AOM operator ids; - `n_operators`, `n_samples`, `n_features`, `mode`. Reference: git-pinned oracle `nirs4all.operators.models.sklearn.aom_pls` (sanctioned exception). MATLAB header (`bindings/matlab/+pls4all/aom_preprocess.m`): ```text pls4all.aom_preprocess AOM preprocessing fit/transform. ``` ### Usage The `nirs4all-methods` Python package exposes the product surface directly. The lower-level C ABI and legacy `pls4all` examples below dispatch into the same native kernel. **nirs4all-methods Python** ```python import n4m import n4m.aom as aom res = n4m.aom_preprocess( X, y, operators=[ "identity", ("savgol_smooth", [5, 2]), ("detrend_poly", [1]), ("savgol_derivative", [5, 2, 1]), ("norris_williams", [5, 5, 1]), ("finite_difference", [1]), ("gaussian", [1.0]), ("whittaker", [100.0]), ("fck", [1.0]), ], gating_mode="soft", ) X_aom = res["transformed"] operator_views = res["operator_outputs"] weights = res["weights"] assert aom.aom_preprocess is n4m.aom_preprocess ``` For model selection rather than standalone preprocessing, prefer `n4m.aom_pls`, `n4m.pop_pls`, `n4m.aom_sweep_run` or `n4m.aom_chain_sweep_run`; those surfaces fold selected operators back into input-space coefficients for direct prediction reuse. **Native and compatibility bindings** ::::{tab-set} :class: pls4all-bindings :::{tab-item} C ABI · libn4m :sync: c :class-label: lang-c ```c /* C ABI — libn4m */ n4m_context_t* ctx = NULL; n4m_operator_bank_t* bank = NULL; n4m_gating_strategy_t* gate = NULL; n4m_method_result_t* res = NULL; n4m_context_create(&ctx); n4m_operator_bank_create(&bank); n4m_gating_strategy_create(&gate, N4M_GATING_SOFT); /* add operators to bank with n4m_operator_bank_add */ n4m_aom_preprocess_fit(ctx, bank, gate, &x_view, &y_view, &res); /* read transformed/operator_outputs/weights via double-matrix getters */ /* read operator_kinds via n4m_method_result_get_int64_vector */ n4m_method_result_destroy(res); n4m_gating_strategy_destroy(gate); n4m_operator_bank_destroy(bank); n4m_context_destroy(ctx); ``` ::: :::{tab-item} Python · n4m :sync: python-raw :class-label: lang-python ```python import n4m res = n4m.aom_preprocess(X, y, operators=["identity"], gating_mode="soft") X_aom = res["transformed"] operator_kinds = res["operator_kinds"] ``` ::: :::{tab-item} Python · n4m.sklearn :sync: python-sklearn :class-label: lang-python ```python from n4m.sklearn import NativeAOMPLSRegressor model = NativeAOMPLSRegressor(max_components=2, cv=4).fit(X, y) yhat = model.predict(X) ``` ::: :::{tab-item} R · pls4all_method() :sync: r-dispatcher :class-label: lang-r ```r library(pls4all) # Unified low-level dispatcher (May 2026 R cleanup): res <- pls4all_method("aom_preprocess", X, y, n_components = 2L, params = list(n_operators = 3L, gating_mode = 0L)) # res is a named list with MethodResult arrays/scalars. # selected_indices / top_k_intervals are 1-based. ``` ::: :::{tab-item} MATLAB · pls4all (MEX) :sync: matlab-mex :class-label: lang-matlab ```matlab res = pls4all.aom_preprocess(X, y, 2); % see header of bindings/matlab/+pls4all/aom_preprocess.m for full % parameter surface: % res = aom_preprocess(X, Y, n_operators, gating_mode) yhat = predict(res, Xtest); ``` ::: :::{tab-item} MATLAB · pls4all (classdef) :sync: matlab-classdef :class-label: lang-matlab _No idiomatic classdef wrapper — invoke `pls4all.fit("aom_preprocess", X, y, …)` directly from the unified MEX factory._ ::: :::: **Registry parity references** 📐 :::{card} :class-card: external-refs - 📐 **`nirs4all`** (python · python) — `nirs4all` in-tree · qualitative (rmse_rel ≤ 5e+00) — In-tree nirs4all AOM provider (sanctioned external reference). pls4all's current primitive exposes a small operator-bank preprocessing kernel, while nirs4all exposes the full AOM/POP estimator stack; the parity remains qualitative. ::: ### Benchmarks Adaptive wall-clock per cell measured against [`full_matrix.csv`](../benchmarks/overview.md). Only backends that implement this method are listed; libraries without the method are omitted. **Verdict**  ·  ✓ ref / ≈ ref / ~ shape mark a reference-gate pass at strict / relaxed / qualitative tolerance  ·  ✓ bind = pls4all binding agrees with the C++ baseline  ·  ✗ divergent  ·  ⚠ error  ·  — not run. The fastest backend per column is marked 🏆. **Reference gate**: qualitative — shape/smoke comparison only. The external library and pls4all do not produce numerically equivalent output for this method (see the MethodSpec notes); the `rmse_rel_tol ≤ 5e+00` budget is set wide on purpose. Treat ~ shape as *“we ran both, both finished”*, not as numerical agreement. Rows tagged with **📐** are the canonical parity references for this method (declared in [`parity_timing.registry`](../benchmarks/methodology.md)). C++ and external rows show reference parity; pls4all language bindings show binding parity against the C++ backend. Hover the icon for role and tolerance band. ::::{tab-set} :class: parity-tabs :::{tab-item} 1 thread :sync: threads-1
BackendParity50×250 (ms)100×50 (ms)100×500 (ms)100×2500 (ms)200×40 (ms)250×50 (ms)500×50 (ms)500×500 (ms)500×2500 (ms)2500×50 (ms)2500×500 (ms)2500×2500 (ms)10000×50 (ms)10000×500 (ms)
C++ native · libn4m
pls4all.cpp.blas2.49 ms14.6 ms🏆75.4 ms1.79 ms7.75 ms75.8 ms390.8 ms38.9 ms401.3 ms2.3 s148.1 ms1.8 s
pls4all.cpp.blas+omp2.44 ms15.7 ms74.2 ms1.77 ms9.04 ms72.9 ms🏆381.1 ms39.5 ms395.5 ms2.3 s146.4 ms🏆1.8 s🏆
pls4all.cpp.omp2.58 ms15.5 ms74.5 ms1.67 ms7.13 ms🏆76.3 ms378.3 ms🏆38.6 ms391.1 ms🏆2.3 s155.5 ms1.8 s
pls4all.cpp.ref1.81 ms🏆14.8 ms71.4 ms🏆1.66 ms🏆7.79 ms82.6 ms381.9 ms37.9 ms🏆393.4 ms2.2 s🏆148.9 ms1.8 s
Python · pls4all
pls4all.python1.77 ms
pls4all.sklearn✓ bind2.89 ms🏆1.86 ms2.60 ms🏆
R · pls4all
pls4all.R✓ bind13.8 ms5.56 ms15.2 ms
pls4all.R.formula✓ bind24.0 ms6.46 ms13.0 ms
pls4all.R.mdatools✓ bind22.7 ms7.29 ms11.0 ms
pls4all.R.pls✓ bind24.9 ms6.75 ms12.0 ms
MATLAB · pls4all
pls4all.matlab✗ +6e+004.67 ms2.76 ms5.96 ms
pls4all.matlab.classdef✗ +6e+0010.1 ms3.25 ms7.83 ms
Python · external
📐nirs4all1.95 ms
::: :::{tab-item} 3 threads :sync: threads-3
BackendParity50×250 (ms)100×50 (ms)100×500 (ms)100×2500 (ms)200×40 (ms)250×50 (ms)500×50 (ms)500×500 (ms)500×2500 (ms)2500×50 (ms)2500×500 (ms)2500×2500 (ms)10000×50 (ms)10000×500 (ms)
C++ native · libn4m
pls4all.cpp.blas~ shape2.79 ms
pls4all.cpp.blas+omp~ shape1.53 ms🏆
pls4all.cpp.omp~ shape1.60 ms
pls4all.cpp.ref~ shape1.69 ms
Python · pls4all
pls4all.python✓ bind1.69 ms
pls4all.sklearn✓ bind1.81 ms
R · pls4all
pls4all.R✓ bind5.24 ms
pls4all.R.formula✓ bind8.08 ms
pls4all.R.mdatools✓ bind6.64 ms
pls4all.R.pls✓ bind6.62 ms
MATLAB · pls4all
pls4all.matlab✗ +6e+004.16 ms
pls4all.matlab.classdef✗ +6e+004.04 ms
Python · external
📐nirs4allsource1.97 ms
::: :::{tab-item} 10 threads :sync: threads-10
BackendParity50×250 (ms)100×50 (ms)100×500 (ms)100×2500 (ms)200×40 (ms)250×50 (ms)500×50 (ms)500×500 (ms)500×2500 (ms)2500×50 (ms)2500×500 (ms)2500×2500 (ms)10000×50 (ms)10000×500 (ms)
C++ native · libn4m
pls4all.cpp.blas~ shape1.45 ms
pls4all.cpp.blas+omp~ shape1.50 ms
pls4all.cpp.omp~ shape1.47 ms
pls4all.cpp.ref~ shape1.49 ms
Python · pls4all
pls4all.python✓ bind1.48 ms
pls4all.sklearn✓ bind1.43 ms🏆
R · pls4all
pls4all.R✓ bind3.96 ms
pls4all.R.formula✓ bind4.71 ms
pls4all.R.mdatools✓ bind4.81 ms
pls4all.R.pls✓ bind5.05 ms
MATLAB · pls4all
pls4all.matlab✗ +6e+002.43 ms
pls4all.matlab.classdef✗ +6e+002.67 ms
Python · external
📐nirs4allsource1.74 ms
::: :::: --- _See also_: [benchmark overview](../benchmarks/overview.md) · [methods index](index.md) · [interactive dashboard](../landing/dashboard.md)