Scaling and memory#

Note

This page documents the in-memory assumption of the estimators and the practical data-size limits. Tracks ENG-19.

The in-memory assumption#

The estimators in process-improve (PCA, PLS, TPLS, MBPCA, MBPLS, and the batch feature pipeline) are in-memory: fit expects the (scaled) data matrix to fit in RAM, and the iterative NIPALS / SVD paths hold a small number of working copies on top of it. There is currently no out-of-core or streaming code path - the whole matrix is materialised as a dense numpy/pandas array.

This is the right trade-off for the data sizes these methods are normally applied to (process and lab data: thousands to low millions of rows, tens to a few hundred columns), where keeping the pandas row/column labels and the full diagnostic suite (SPE, Hotelling’s T2, contributions) is worth far more than streaming.

Estimating the memory you need#

A dense float64 matrix needs roughly:

bytes ~= n_rows x n_cols x 8

So a 10,000,000 x 200 matrix is about 16 GB just for the data - before any working copies. As a rule of thumb, budget 2-4x the raw matrix size for a fit:

  • MCUVScaler / center / scale return a scaled copy.

  • PCA.fit copies X once for the working array; NIPALS deflation works in place on that copy.

  • PLS.fit similarly holds scaled X and Y plus deflation copies.

Example: a 1,000,000 x 100 matrix is ~0.8 GB raw, so expect ~2-3 GB resident during fit - comfortable on a workstation, tight on a laptop.

When this is not the right tool#

If your matrix does not fit in RAM (with the 2-4x headroom above), this package is not currently the right fit for a single fit call. Options today:

  • Down-sample or aggregate to a representative subset for model building, then transform / predict the full data in chunks (transform and predict are far cheaper than fit and can be applied batch-by-batch).

  • Reduce dtype upstream (e.g. float32) if the precision budget allows; this halves the footprint. Note the estimators promote to float64 internally, so this mainly helps the input/transform side.

  • Use an out-of-core PCA from another library (e.g. sklearn.decomposition.IncrementalPCA with partial_fit over chunks, or a dask-backed SVD) for the dimensionality-reduction step, then bring the reduced scores back into this package for the diagnostics.

Roadmap#

A first-class out-of-core path for PCA (an incremental / chunked fitter, likely behind a [bigdata] optional-dependency extra) is tracked in ENG-19. It is demand-driven: if you have a concrete larger-than-RAM use case, please comment on that issue.