Selecting the Number of Components#
Choosing the right number of components is critical. Too few components underfit (miss important structure), too many overfit (model noise).
PRESS Cross-Validation#
The PCA.select_n_components() class method uses Predicted Residual Error
Sum of Squares (PRESS) with K-fold cross-validation:
For each candidate number of components (1, 2, …, max): - Split data into K folds - Fit PCA on K-1 folds, predict the held-out fold - Compute prediction error (PRESS)
Apply Wold’s criterion: stop when adding a component does not meaningfully reduce PRESS (ratio > threshold)
from process_improve.multivariate.methods import PCA, MCUVScaler
X_scaled = MCUVScaler().fit_transform(X)
result = PCA.select_n_components(
X_scaled,
max_components=10,
cv=7, # 7-fold cross-validation
threshold=0.95, # Wold's criterion threshold
)
print(f"Recommended components: {result.n_components}")
print(f"PRESS values: {result.press}")
print(f"PRESS ratios: {result.press_ratio}")
The result is a Bunch with:
n_components: recommended number of componentspress: PRESS value for each number of componentspress_ratio: ratioPRESS_a / PRESS_{a-1}(values > threshold suggest overfitting)cv_scores: raw cross-validation scores per fold
Wold’s Criterion#
The default threshold of 0.95 means: stop adding components when the PRESS ratio exceeds 0.95. A ratio close to 1.0 means the new component barely improves prediction - it is likely fitting noise.
Lower thresholds (e.g., 0.90) are more conservative (fewer components). Higher thresholds (e.g., 0.98) are more liberal (more components).