Scaling and memory#
Note
This page documents the in-memory assumption of the estimators and the practical data-size limits. Tracks ENG-19.
The in-memory assumption#
The estimators in process-improve (PCA, PLS, TPLS, MBPCA, MBPLS, and the
batch feature pipeline) are in-memory: fit expects the (scaled) data
matrix to fit in RAM, and the iterative NIPALS / SVD paths hold a small number
of working copies on top of it. There is currently no out-of-core or
streaming code path - the whole matrix is materialised as a dense
numpy/pandas array.
This is the right trade-off for the data sizes these methods are normally applied to (process and lab data: thousands to low millions of rows, tens to a few hundred columns), where keeping the pandas row/column labels and the full diagnostic suite (SPE, Hotelling’s T2, contributions) is worth far more than streaming.
Estimating the memory you need#
A dense float64 matrix needs roughly:
bytes ~= n_rows x n_cols x 8
So a 10,000,000 x 200 matrix is about 16 GB just for the data - before
any working copies. As a rule of thumb, budget 2-4x the raw matrix size for
a fit:
MCUVScaler/center/scalereturn a scaled copy.PCA.fitcopiesXonce for the working array; NIPALS deflation works in place on that copy.PLS.fitsimilarly holds scaledXandYplus deflation copies.
Example: a 1,000,000 x 100 matrix is ~0.8 GB raw, so expect ~2-3 GB
resident during fit - comfortable on a workstation, tight on a laptop.
When this is not the right tool#
If your matrix does not fit in RAM (with the 2-4x headroom above), this package
is not currently the right fit for a single fit call. Options today:
Down-sample or aggregate to a representative subset for model building, then
transform/predictthe full data in chunks (transformandpredictare far cheaper thanfitand can be applied batch-by-batch).Reduce dtype upstream (e.g.
float32) if the precision budget allows; this halves the footprint. Note the estimators promote tofloat64internally, so this mainly helps the input/transform side.Use an out-of-core PCA from another library (e.g.
sklearn.decomposition.IncrementalPCAwithpartial_fitover chunks, or adask-backed SVD) for the dimensionality-reduction step, then bring the reduced scores back into this package for the diagnostics.
Roadmap#
A first-class out-of-core path for PCA (an incremental / chunked fitter, likely
behind a [bigdata] optional-dependency extra) is tracked in
ENG-19. It is demand-driven:
if you have a concrete larger-than-RAM use case, please comment on that issue.