Principal Component Analysis (PCA)#
PCA finds the directions of maximum variance in a single data matrix X. It is the workhorse of multivariate exploration, process monitoring, and outlier detection.
Mathematical Background#
PCA decomposes an (N × K) data matrix into scores, loadings, and residuals:
where:
T (N × A) - scores: the coordinates of each observation in the reduced space.
P (K × A) - loadings: the direction vectors that define the model plane.
E (N × K) - residuals: what the model does not explain.
A - the number of components retained.
Geometrically, PCA fits a best-fit hyperplane through the cloud of observations in K-dimensional space. Each observation is projected perpendicularly onto this plane; the projected coordinates are the scores. Minimizing the perpendicular residual distance is equivalent to maximizing the variance captured in the scores - the two formulations yield the same solution.
Preprocessing#
Preprocessing decisions have a large influence on the model.
Centering subtracts the column means so that the model explains deviations from average behavior. Mean centering is the default; median centering is a robust alternative that simultaneously highlights outliers.
Scaling equalizes the influence of variables measured in different units
or ranges. Unit-variance scaling (dividing by the standard deviation) is the
most common choice. When combined with mean centering this is called
autoscaling. Use MCUVScaler
for sklearn-compatible autoscaling:
from process_improve.multivariate.methods import PCA, MCUVScaler
scaler = MCUVScaler()
X_scaled = scaler.fit_transform(X)
model = PCA(n_components=3).fit(X_scaled)
Transformations (log, square root) can reduce the effect of extreme measurements or improve linearity. Consider adding derived columns (interactions, squared terms, ratios) to the raw data matrix - the model will sort out which are informative.
Interpreting Scores#
Scores (model.scores_) are the coordinates of each observation in the
latent variable space. They reveal the dominant patterns in the data:
Clusters - observations that are similar in the original K variables will be close together in score space.
Outliers - observations far from the main cluster may represent unusual operating conditions or measurement errors.
Time trends - when rows are ordered by time, gradual drifts or sudden shifts become visible as trajectories in the score plot.
Groups - coloring scores by a categorical variable (grade, shift, equipment ID) can reveal systematic differences.
Use model.score_plot() for quick visualization.
Interpreting Loadings#
Loadings (model.loadings_) describe how each original variable
contributes to each component:
Variables with large absolute loadings on a component are important for that component.
Variables that are positively correlated cluster together in the loading plot; variables that are negatively correlated appear on opposite sides.
The sign of a loading can flip between different runs or software - the sign of the component as a whole is arbitrary, so always interpret loading patterns rather than individual signs.
Use model.loading_plot() for quick visualization.
SPE (Squared Prediction Error)#
SPE (model.spe_) measures how well each observation conforms to the
correlation structure captured by the model. It is the sum of squared
residuals for that observation:
SPE = 0 means the observation lies exactly on the model plane.
High SPE means the observation breaks the correlation structure - a new event or fault has introduced variation that the model’s A components do not describe.
Column-wise residuals give the per-variable R² (
model.r2_per_variable_), showing which variables are well explained and which are not.
Use model.spe_plot() and model.spe_limit() for monitoring.
Hotelling’s T²#
Hotelling’s T² (model.hotellings_t2_) measures how far each
observation’s projection is from the center of the model plane:
where \(s_a^2\) is the variance of the a-th score column.
T² is always non-negative and follows an F-distribution, so confidence limits (e.g., 95 %) can be computed analytically.
High T² means the observation follows the correlation structure but with unusual magnitude - it is extreme within the model, not outside it.
On a score plot of component a vs b, the T² limit traces an elliptical boundary. This single ellipse replaces examining A(A−1)/2 pairwise scatterplots.
Use model.t2_plot() and model.hotellings_t2_limit() for monitoring.
Score Contributions#
Score contributions decompose a score-space movement back into the original variable space, answering: “Which variables caused this observation to score where it did?”
Each score value \(t_{i,a}\) can be written as a sum of K contributions \(x_{i,k} \, p_{k,a}\), one per original variable. Plotting these as a bar chart reveals the dominant drivers.
The reference point matters: contributions always measure the difference from one point to another. The default reference is the model center (the origin after centering), but you can compare any two points or groups:
# Why does observation 5 differ from the model center?
contrib = model.score_contributions(model.scores_.iloc[5].values)
# Why do observations 5 and 10 differ from each other?
contrib = model.score_contributions(
model.scores_.iloc[5].values,
model.scores_.iloc[10].values,
)
# T²-weighted contributions (scale by 1/sqrt(eigenvalue))
contrib = model.score_contributions(model.scores_.iloc[5].values, weighted=True)
Outlier Detection#
The detect_outliers() method combines two complementary approaches:
Statistical limits - SPE and T² limits at a specified confidence level
Robust ESD test - Generalized Extreme Studentized Deviate test using median and MAD, which is robust to masking effects where multiple outliers hide each other
Results are returned as a list of dicts sorted by severity (most severe first):
outliers = model.detect_outliers(conf_level=0.95)
for o in outliers:
print(f"{o['observation']}: {o['outlier_types']} (severity={o['severity']})")
Model Selection#
R² always increases as components are added, so it cannot tell you when to stop. The cross-validated Q² metric decreases once the model begins fitting noise rather than systematic structure. The point where Q² stops improving indicates the useful number of components.
Use PCA.select_n_components() for automated selection via PRESS
cross-validation with Wold’s criterion (see Selecting the Number of Components for
details). Remember that the answer is never exact - examine one or two
components beyond and before the suggestion, and consider the interpretability
of each additional component in the context of your application.
Process Monitoring#
PCA-based process monitoring is one of the most common industrial applications. The workflow:
Build a reference model on data from normal, in-control operation.
Preprocess new observations using the same centering and scaling learned from the training data (e.g., the same
MCUVScalerinstance).Project new observations into the model with
model.predict(X_new).Monitor SPE and T² - compare each new observation’s SPE and T² against the control limits:
result = model.predict(X_new) spe_lim = model.spe_limit(conf_level=0.95) t2_lim = model.hotellings_t2_limit(conf_level=0.95) spe_violations = result.spe_ > spe_lim t2_violations = result.hotellings_t2_ > t2_lim
Investigate violations using score contributions to identify which original variables are responsible.
A high SPE signals a new type of event (the correlation structure has changed). A high T² signals an extreme version of normal behavior. Both can occur simultaneously.
Missing Data#
Real process data often has missing values - sensor failures, skipped lab analyses, or variables recorded at different frequencies. The PCA implementation supports several iterative algorithms for fitting models in the presence of missing data:
TSR (Trimmed Scores Regression) - generally the best default choice; uses the available data to estimate scores, then reconstructs missing values.
NIPALS - the classical algorithm; handles moderate amounts of missing data well.
SCP - Single Component Projection; a simpler alternative.
Enable missing data handling by passing a settings dictionary:
model = PCA(
n_components=3,
missing_data_settings={
"md_method": "tsr",
"md_tol": 1e-6, # convergence tolerance
"md_max_iter": 200, # maximum iterations
},
)
model.fit(X_with_nans)
# Check convergence
print(model.fitting_info_["iterations"])
print(model.fitting_info_["final_error"])
Troubleshooting#
Symptom |
Likely cause and remedy |
|---|---|
|
One or more columns are constant (zero variance) or are exact linear combinations of other columns. Remove constant columns and check for duplicate or perfectly correlated variables. |
Missing-data algorithm does not converge |
Increase |
Shape mismatch in |
The new data must have the same columns (K) as the training data, in
the same order. If you used |
Very different results with different software |
Check that the same preprocessing (centering, scaling) was applied. Also note that the sign of a component can flip - this is normal and does not affect interpretation. |
First component explains almost all variance |
The data may not have been centered. Without centering, the first component captures the mean rather than variation around it. |