Selecting the Number of Components#

Choosing the right number of components is critical. Too few components underfit (miss important structure), too many overfit (model noise).

PRESS Cross-Validation#

The PCA.select_n_components() class method uses Predicted Residual Error Sum of Squares (PRESS) with K-fold cross-validation:

  1. For each candidate number of components (1, 2, …, max): - Split data into K folds - Fit PCA on K-1 folds, predict the held-out fold - Compute prediction error (PRESS)

  2. Apply Wold’s criterion: stop when adding a component does not meaningfully reduce PRESS (ratio > threshold)

from process_improve.multivariate.methods import PCA, MCUVScaler

X_scaled = MCUVScaler().fit_transform(X)

result = PCA.select_n_components(
    X_scaled,
    max_components=10,
    cv=7,  # 7-fold cross-validation
    threshold=0.95,  # Wold's criterion threshold
)

print(f"Recommended components: {result.n_components}")
print(f"PRESS values: {result.press}")
print(f"PRESS ratios: {result.press_ratio}")

The result is a Bunch with:

  • n_components: recommended number of components

  • press: PRESS value for each number of components

  • press_ratio: ratio PRESS_a / PRESS_{a-1} (values > threshold suggest overfitting)

  • cv_scores: raw cross-validation scores per fold

Wold’s Criterion#

The default threshold of 0.95 means: stop adding components when the PRESS ratio exceeds 0.95. A ratio close to 1.0 means the new component barely improves prediction - it is likely fitting noise.

Lower thresholds (e.g., 0.90) are more conservative (fewer components). Higher thresholds (e.g., 0.98) are more liberal (more components).