Model Evaluation and Visualization#

After fitting a PCA or PLS model, the next questions are practical: how many components should the model keep, how well does it predict, and which variables drive it? This page is a worked tour of the evaluation and plotting tools, built around a PLS model that relates a set of process measurements (X) to quality outcomes (Y). Each example assumes X and Y are already scaled, for example with MCUVScaler.

Choosing the Number of Components#

Calibration fit always improves as components are added, so it cannot tell you when to stop. PLS.select_n_components cross-validates the model and reports the root-mean-square error of cross-validation (RMSECV) together with the validated explained variance:

from process_improve.multivariate import PLS

result = PLS.select_n_components(X, Y, max_components=8, cv=5)
print(result.n_components)        # recommended component count
print(result.rmsecv["total"])     # RMSECV per component count

See Cross-Validation for the full description, including PLS.cross_validate for beta-coefficient error bars.

Explained Variance#

Once the model is fitted, explained_variance_plot shows how much variance each component captures, both per component and cumulatively:

model = PLS(n_components=result.n_components).fit(X, Y)
model.explained_variance_plot()

For PCA the bars refer to variance in the X-block; for PLS they refer to the Y-block. The same method is available on a fitted PCA model.

Correlation Loadings#

correlation_loadings_plot places each variable by its correlation with two components’ scores. A variable’s squared distance from the origin is the fraction of its variance explained by those two components, so every variable lies inside the unit circle. Concentric ellipses mark variance-explained thresholds:

model.correlation_loadings_plot(pc_horiz=1, pc_vert=2)

For PLS the X- and Y-variables are overlaid, which reveals how process variables relate to quality outcomes. The ellipse thresholds are configurable. The 50% and 100% ellipses are the convention - the outer ellipse is the unit circle, the inner one marks variables that are well explained - but any fractions work:

model.correlation_loadings_plot(variance_ellipses=(0.75, 0.95))

Observed versus Predicted#

predictions_vs_observed_plot draws a parity plot of the calibration predictions against the observed Y, with a y = x reference line and an RMSE annotation:

model.predictions_vs_observed_plot(y_observed=Y, variable="quality")

Points close to the reference line indicate accurate predictions; systematic departures from it point to model bias.

Regression Coefficients#

coefficient_plot shows the PLS regression coefficients as a bar chart, one bar per X-variable, for a chosen Y-variable:

model.coefficient_plot(variable="quality")

Tall bars mark the X-variables that most strongly drive the prediction. To see how reliable each coefficient is, pair this plot with the cross-validated error bars from PLS.cross_validate (see Cross-Validation).

Comparing Two Data Blocks#

The RV coefficient and its modified form RV2 measure how much common structure two matrices, measured on the same observations, share. They are a multivariate generalization of a squared correlation:

from process_improve.multivariate import rv_coefficient, rv2_coefficient

rv_coefficient(X, Y)     # in [0, 1]; 1 means identical configurations
rv2_coefficient(X, Y)    # modified RV, unbiased for high-dimensional data

Use rv2_coefficient when the blocks have many more variables than observations: the ordinary RV coefficient is biased upwards in that regime and tends towards 1 even for unrelated blocks.