# Cross-binding benchmark — methodology ## Goal For each `(algorithm, backend, n, p, threads)` cell, report: 1. **Binding parity**: pls4all binding/core rows compared against the canonical native C++ row for that method and dataset. 2. **Reference parity**: every successful row, including external libraries, compared against the registry-declared method oracle. 3. **Timing**: adaptive wall-clock milliseconds. The reported value may be a single run, a mean, or a median depending on the observed cell cost; the CSV records the choice in `timing_statistic`. 4. **Versions metadata**: language, BLAS implementation, binding / external library versions. The combination supports two separate claims: pls4all bindings are thin and consistent with the C++ core, and pls4all's method implementations match the external oracle selected for each method. ## Cell composition | Axis | Values | |---|---| | **Algorithms** | Canonical `benchmarks.parity_timing.registry.METHODS` catalog (`--algorithms all`) | | **Backends** | pls4all bindings + registry-driven external reference columns (`ref.`) | | **Sizes** | Default 11-size sweep, or one canonical MethodSpec cell per method with `--registry-cells` | | **Thread counts** | 1, 3, 10 | | **libn4m build** | `blas-omp` by default (OpenBLAS + OpenMP); `dev-release` available for the single-thread reference column | The current canonical registry sweep is production-build first: `full_matrix.csv` contains the `cpp` rows for `blas-omp`. Separate native/BLAS-only/OpenMP-only build tiers are only present when a targeted refresh measured them. Therefore a blank-looking C++ sibling tier is run coverage, not a parity contradiction; the dashboard renders it as `NR` and does not invent a divergence δ/J for a cell that was not executed. `pls4all.registry` is the benchmark registry's canonical pls4all call (`MethodSpec.pls4all_fn`). It is not a public binding/API column, so the dashboard excludes it from the user-facing matrix and score cards. The public Python columns are `pls4all.python`, `pls4all.sklearn`, and Python externals. Public binding backends that are part of the matrix but absent from the current CSV snapshot are kept visible as `NR` (`not_run`) rather than being dropped. This makes missing MATLAB/Octave coverage explicit when `matlab_tier1` / `matlab_tier2` have not been executed. A "skip" record is emitted when an external backend does not implement a given algorithm. In `--reference-backends registry` mode those rows should be rare because unsupported pairs are not scheduled. In legacy `fixed`/`all` audit modes they are expected. ## Timing Protocol Each cell uses the same adaptive protocol in Python, R and Octave/MATLAB: 1. Run #1 is a warmstart at `BASE` and is timed. 2. If run #1 takes more than 5 min, report run #1 and stop. 3. Otherwise, run #2 is the first scored run. From this point on the warmstart is excluded from the score. 4. If run #2 takes more than 30 s, report run #2 alone. 5. If run #2 takes more than 5 s, run one more sample and report the mean of runs #2-#3. 6. If run #2 takes more than 1 s, run up to 10 total executions and report the median of runs #2-#10. 7. If run #2 takes more than 0.1 s, run up to 20 total executions and report the median after the warmstart. 8. Otherwise, run up to 40 total executions and report the median after the warmstart. `reported_ms` is the score used by the dashboard. `n_runs` is the number of scored samples after excluding the warmstart, except for the one-run warmstart-abort case. `total_runs` includes the warmstart. `median_ms` is kept as a compatibility alias for older renderers and mirrors `reported_ms` under the current `adaptive-v1` timing schema. The per-cell timeout is only a 24 h guard. Slow cells should stop because of the adaptive protocol, not because of a short timeout. ## Determinism The base seed is `1_234_567_890` — a uint32-safe integer that round-trips losslessly through R/Octave doubles and is accepted as `sklearn`'s `random_state`. **All backends in the same cell read the same orchestrator-generated CSV** (`benchmarks/cross_binding/data/data_x

_seed.csv`). This is essential because Python NumPy, R `set.seed()` and Octave `randn("state", ...)` produce different streams from the same seed — sharing the CSV bytes is the only way to make cross-language parity meaningful. ## Reference policy There are two references. For **binding parity**, each `(algorithm, n, p)` group uses: 1. **`cpp` at 1 thread, `blas-omp` build** when present (default for all algos with a libn4m entry point); else 2. **`python_tier1` at 1 thread** as fallback (covers algos that don't have a direct ctypes path on the C++ side). The binding reference's predictions are saved to `benchmarks/cross_binding/data/.predictions/*.npy` and compared element-wise to pls4all core/binding rows only. For **reference parity**, the comparator is the canonical external reference returned by the registry for that method. This is the row that defines whether the implementation matches the literature or established library behavior. External libraries are compared to this oracle too, so library-to-library divergence is visible. Successful canonical reference rows also refresh a stored oracle snapshot under `benchmarks/cross_binding/data/.reference_oracles/`. `--only-pls4all` runs load that snapshot to keep Gate 2 active even when the external backend is not scheduled. If the snapshot does not exist yet, the row fails with an explicit oracle-missing note. Dashboard JSON is built from `full_matrix.csv` plus targeted `dashboard_refresh_*.csv` deltas. Those refresh files are not a separate gate policy: they are ordinary orchestrator rows that replace stale cells by exact execution key until the full timing matrix is regenerated. ## Parity tolerance Binding parity uses strict max-absolute tolerance, normally `1e-6`. Reference parity uses the method's registry tolerance, usually RMSE relative to the oracle prediction or a mask-distance equivalent for selectors. Per-algorithm overrides exist for inherently noisier algorithms: | Algorithm | Tolerance | Reason | |---|---|---| | `gpr_pls` | 1e-3 | Iterative GP solver, different convergence criteria across libs | | `bagging_pls`, `boosting_pls`, ensembles | 1e-3 | Stochastic averaging; per-implementation RNG differences | | `GA`, `PSO`, `VISSA` selectors | non-applicable | Stochastic feature selection; per-implementation RNG streams | Wide selector tolerances are qualitative evidence, not a release-quality oracle. The dashboard therefore distinguishes selector set-overlap (`divergence_metric="jaccard"`) from numeric relative-RMSE δ, and documented RNG/noise/model selector mismatches render as `cross_check`/`BD J` rather than as red numeric failures. ## Thread control The orchestrator sets the following env vars **before** spawning each backend subprocess: ``` OMP_NUM_THREADS = N OPENBLAS_NUM_THREADS = N MKL_NUM_THREADS = N BLIS_NUM_THREADS = N BENCH_THREADS = N ``` In addition: - Python pls4all calls `Context.num_threads = N` for belt-and-braces. - Octave bench scripts call `maxNumCompThreads(N)` at start. - Externals (sklearn, pls::plsr, plsregress) rely on the env vars only. - MATLAB/libPLS registry references run through `oct2py`; the orchestrator prepends `$PLS4ALL_R_ENV/bin` and sets `OCTAVE_HOME` so the conda-provided Octave is visible from Python. `OPENBLAS_NUM_THREADS == OMP_NUM_THREADS` (i.e. not OMP×BLAS) to avoid oversubscription. ## Notes on observed parity gaps The smoke runs surfaced a recurring `0.054` divergence among three backends: `ikpls`, `r_tier2`, `matlab_tier2`. Root cause: those wrappers default to `scale_x=True / scale_y=True` (unit-variance scaling), while `cpp`, `python_tier1`, `python_tier2`, `r_tier1`, `r_pls`, `r_mixomics`, `matlab_tier1`, `matlab_pls` default to `scale_x=False / scale_y=False` (centring only — the spectroscopy convention). This is **not a bug**: both conventions are valid. Current dashboard payloads use `cross_check` for documented noncanonical API/facade convention cells when the canonical registry/C++ path is already exact, so those timings remain visible without classifying the method as a parity failure. Users should pick the convention matching their reference paper. ## Timeout Per-cell wall-clock guard: **24 h**. Cells should normally stop through the adaptive timing rules. The guard is only there to catch hangs, OOMs or dependency deadlocks. Guard hits are marked with the ⏳ icon in the rendered Markdown. Empty / failed cells are marked `—`. ## Hardware context Captured per run in the rendered Markdown header (host platform string, BLAS impl + version, run date). For the headline runs documented in this repo, the host is reproducible from the commit SHA + the `results/full_matrix.csv` `versions_json` column. ## Re-running ```bash # Complete canonical method/reference matrix, including build + docs render. # Existing cells in results/full_matrix.csv are skipped by default. benchmarks/cross_binding/run_overnight.sh # Exhaustive stress matrix with registry-declared references. FULL_MATRIX=1 REFERENCE_BACKENDS=registry benchmarks/cross_binding/run_overnight.sh # Legacy fixed/all audit; unsupported external pairs produce NOT_IMPLEMENTED. FULL_MATRIX=1 REFERENCE_BACKENDS=all benchmarks/cross_binding/run_overnight.sh # Include the CUDA libn4m build too when CUDA is available. FULL_MATRIX=1 LIBP4A_BUILD=all benchmarks/cross_binding/run_overnight.sh # Same run on the Pages branch (main), then commit/push docs/_static + # benchmark markdown and trigger the GitHub Pages docs workflow. PUBLISH_WEB=1 benchmarks/cross_binding/run_overnight.sh # Exhaustive run, then publish the refreshed dashboard from main. FULL_MATRIX=1 PUBLISH_WEB=1 benchmarks/cross_binding/run_overnight.sh # From a work branch, commit/push the web sources but skip live Pages deploy. PUBLISH_WEB=1 DEPLOY_PAGES=0 benchmarks/cross_binding/run_overnight.sh # Recompute after a pls4all optimization or dependency update. FORCE=1 CLEAN_BUILD=1 benchmarks/cross_binding/run_overnight.sh # Only retry cells that previously failed, preserving successful timings. RERUN_FAILED=1 benchmarks/cross_binding/run_overnight.sh # PLS headline sweep only. python benchmarks/cross_binding/orchestrator.py \ --algorithms pls --threads 1 3 10 --n-runs 5 \ --resume-existing \ --libn4m-build blas-omp --reference-backends registry \ --out-csv benchmarks/cross_binding/results/full_matrix.csv # Render python benchmarks/cross_binding/combine_and_render.py \ --csvs benchmarks/cross_binding/results/full_matrix.csv \ --out docs/benchmarks/cross_binding.md ```