Cross-Validation#
Cross-validation is used for two purposes in multivariate analysis:
Component selection - choosing the right number of components (PCA).
Coefficient uncertainty - obtaining error bars for PLS beta coefficients.
Selecting the Number of Components#
Choosing the right number of components is critical. Too few components underfit (miss important structure), too many overfit (model noise).
PRESS Cross-Validation#
The PCA.select_n_components() class method uses Predicted Residual Error
Sum of Squares (PRESS) with K-fold cross-validation:
For each candidate number of components (1, 2, …, max): - Split data into K folds - Fit PCA on K-1 folds, predict the held-out fold - Compute prediction error (PRESS)
Apply Wold’s criterion: stop when adding a component does not meaningfully reduce PRESS (ratio > threshold)
from process_improve.multivariate.methods import PCA, MCUVScaler
X_scaled = MCUVScaler().fit_transform(X)
result = PCA.select_n_components(
X_scaled,
max_components=10,
cv=7, # 7-fold cross-validation
threshold=0.95, # Wold's criterion threshold
)
print(f"Recommended components: {result.n_components}")
print(f"PRESS values: {result.press}")
print(f"PRESS ratios: {result.press_ratio}")
The result is a Bunch with:
n_components: recommended number of componentspress: PRESS value for each number of componentspress_ratio: ratioPRESS_a / PRESS_{a-1}(values > threshold suggest overfitting)cv_scores: raw cross-validation scores per fold
Wold’s Criterion#
The default threshold of 0.95 means: stop adding components when the PRESS ratio exceeds 0.95. A ratio close to 1.0 means the new component barely improves prediction - it is likely fitting noise.
Lower thresholds (e.g., 0.90) are more conservative (fewer components). Higher thresholds (e.g., 0.98) are more liberal (more components).
PLS Component Selection#
PLS.select_n_components() cross-validates a PLS model and reports how it
performs on unseen data, in contrast to the calibration statistics stored on a
fitted model (rmse_, r2_cumulative_), which always improve as
components are added.
from process_improve.multivariate.methods import PLS, MCUVScaler
X_s = MCUVScaler().fit_transform(X)
Y_s = MCUVScaler().fit_transform(Y)
result = PLS.select_n_components(X_s, Y_s, max_components=8, cv=5)
print(f"Recommended components: {result.n_components}")
print(result.rmsecv["total"]) # RMSECV per component count
print(result.r2y_validated["total"]) # Validated R2 of Y
The result is a Bunch with:
n_components: recommended count, the one with the lowest overall RMSECVrmsecv: root-mean-square error of cross-validation, per Y variable and overallr2y_validated/r2x_validated: validated explained variance, per variable and overallpress: overall Y prediction error sum of squares per component countcv_predictions: out-of-fold Y predictions at the recommended count
The cv argument accepts an integer (K-fold) or any scikit-learn splitter
object, such as KFold or LeaveOneOut.
PLS Beta Coefficient Error Bars#
For PLS models, model.cross_validate() refits the model on data subsets
and computes confidence intervals for the regression coefficients. This answers
the question: “How reliable is each beta coefficient?”
Three resampling strategies are supported:
Jackknife (
cv="loo", default) - leave-one-out resampling. Uses the jackknife variance formula with t-distribution critical values.K-fold (
cv=5) - K-fold cross-validation. Faster for large datasets.Bootstrap (
n_bootstrap=200) - resample with replacement. Uses percentile confidence intervals.
from process_improve.multivariate.methods import PLS, MCUVScaler
scaler_x = MCUVScaler().fit(X)
scaler_y = MCUVScaler().fit(Y)
X_s, Y_s = scaler_x.transform(X), scaler_y.transform(Y)
pls = PLS(n_components=2).fit(X_s, Y_s)
# Jackknife (leave-one-out) cross-validation
cv = pls.cross_validate(X_s, Y_s, cv="loo")
print(cv.significant) # Which betas have CIs excluding zero
print(cv.beta_ci_lower) # Lower 95% CI
print(cv.beta_ci_upper) # Upper 95% CI
print(cv.q_squared) # Cross-validated R² (Q²)
print(cv.rmse_cv) # Cross-validated RMSE
The result is a Bunch with:
beta_mean,beta_std: mean and standard error of betas across resamplesbeta_ci_lower,beta_ci_upper: confidence interval boundssignificant: boolean mask -Truewhere the CI excludes zerobeta_samples: raw betas from every resample (n_resamples × K × M)y_hat_cv: out-of-fold Y predictions (jackknife / K-fold only)press: Prediction Error Sum of Squaresrmse_cv: cross-validated RMSE per Y variableq_squared: cross-validated R² (Q²) per Y variable
See Projection to Latent Structures (PLS) for detailed documentation and additional examples.