Principal Component Analysis (PCA)#

PCA finds the directions of maximum variance in a single data matrix X. It is the workhorse of multivariate exploration, process monitoring, and outlier detection.

Mathematical Background#

PCA decomposes an (N × K) data matrix into scores, loadings, and residuals:

\[\mathbf{X} = \mathbf{T}\mathbf{P}^T + \mathbf{E}\]

where:

T (N × A) - scores: the coordinates of each observation in the reduced space.
P (K × A) - loadings: the direction vectors that define the model plane.
E (N × K) - residuals: what the model does not explain.
A - the number of components retained.

Geometrically, PCA fits a best-fit hyperplane through the cloud of observations in K-dimensional space. Each observation is projected perpendicularly onto this plane; the projected coordinates are the scores. Minimizing the perpendicular residual distance is equivalent to maximizing the variance captured in the scores - the two formulations yield the same solution.

Preprocessing#

Preprocessing decisions have a large influence on the model.

Centering subtracts the column means so that the model explains deviations from average behavior. Mean centering is the default; median centering is a robust alternative that simultaneously highlights outliers.

Scaling equalizes the influence of variables measured in different units or ranges. Unit-variance scaling (dividing by the standard deviation) is the most common choice. When combined with mean centering this is called autoscaling. Use MCUVScaler for sklearn-compatible autoscaling:

from process_improve.multivariate.methods import PCA, MCUVScaler

scaler = MCUVScaler()
X_scaled = scaler.fit_transform(X)
model = PCA(n_components=3).fit(X_scaled)

Transformations (log, square root) can reduce the effect of extreme measurements or improve linearity. Consider adding derived columns (interactions, squared terms, ratios) to the raw data matrix - the model will sort out which are informative.

Interpreting Scores#

Scores (model.scores_) are the coordinates of each observation in the latent variable space. They reveal the dominant patterns in the data:

Clusters - observations that are similar in the original K variables will be close together in score space.
Outliers - observations far from the main cluster may represent unusual operating conditions or measurement errors.
Time trends - when rows are ordered by time, gradual drifts or sudden shifts become visible as trajectories in the score plot.
Groups - coloring scores by a categorical variable (grade, shift, equipment ID) can reveal systematic differences.

Use model.score_plot() for quick visualization.

Interpreting Loadings#

Loadings (model.loadings_) describe how each original variable contributes to each component:

Variables with large absolute loadings on a component are important for that component.
Variables that are positively correlated cluster together in the loading plot; variables that are negatively correlated appear on opposite sides.
The sign of a loading can flip between different runs or software - the sign of the component as a whole is arbitrary, so always interpret loading patterns rather than individual signs.

Use model.loading_plot() for quick visualization.

SPE (Squared Prediction Error)#

SPE (model.spe_) measures how well each observation conforms to the correlation structure captured by the model. It is the sum of squared residuals for that observation:

\[\text{SPE}_i = \mathbf{e}_i^T \mathbf{e}_i\]

SPE = 0 means the observation lies exactly on the model plane.
High SPE means the observation breaks the correlation structure - a new event or fault has introduced variation that the model’s A components do not describe.
Column-wise residuals give the per-variable R² (model.r2_per_variable_), showing which variables are well explained and which are not.

Use model.spe_plot() and model.spe_limit() for monitoring.

Hotelling’s T²#

Hotelling’s T² (model.hotellings_t2_) measures how far each observation’s projection is from the center of the model plane:

\[T^2_i = \sum_{a=1}^{A} \frac{t_{i,a}^2}{s_a^2}\]

where \(s_a^2\) is the variance of the a-th score column.

T² is always non-negative and follows an F-distribution, so confidence limits (e.g., 95 %) can be computed analytically.
High T² means the observation follows the correlation structure but with unusual magnitude - it is extreme within the model, not outside it.
On a score plot of component a vs b, the T² limit traces an elliptical boundary. This single ellipse replaces examining A(A−1)/2 pairwise scatterplots.

Use model.t2_plot() and model.hotellings_t2_limit() for monitoring.

Score Contributions#

Score contributions decompose a score-space movement back into the original variable space, answering: “Which variables caused this observation to score where it did?”

Each score value \(t_{i,a}\) can be written as a sum of K contributions \(x_{i,k} \, p_{k,a}\), one per original variable. Plotting these as a bar chart reveals the dominant drivers.

The reference point matters: contributions always measure the difference from one point to another. The default reference is the model center (the origin after centering), but you can compare any two points or groups:

# Why does observation 5 differ from the model center?
contrib = model.score_contributions(model.scores_.iloc[5].values)

# Why do observations 5 and 10 differ from each other?
contrib = model.score_contributions(
    model.scores_.iloc[5].values,
    model.scores_.iloc[10].values,
)

# T²-weighted contributions (scale by 1/sqrt(eigenvalue))
contrib = model.score_contributions(model.scores_.iloc[5].values, weighted=True)

Outlier Detection#

The detect_outliers() method combines two complementary approaches:

Statistical limits - SPE and T² limits at a specified confidence level
Robust ESD test - Generalized Extreme Studentized Deviate test using median and MAD, which is robust to masking effects where multiple outliers hide each other

Results are returned as a list of dicts sorted by severity (most severe first):

outliers = model.detect_outliers(conf_level=0.95)
for o in outliers:
    print(f"{o['observation']}: {o['outlier_types']} (severity={o['severity']})")

Model Selection#

R² always increases as components are added, so it cannot tell you when to stop. The cross-validated Q² metric decreases once the model begins fitting noise rather than systematic structure. The point where Q² stops improving indicates the useful number of components.

Use PCA.select_n_components() for automated selection via PRESS cross-validation with Wold’s criterion (see Selecting the Number of Components for details). Remember that the answer is never exact - examine one or two components beyond and before the suggestion, and consider the interpretability of each additional component in the context of your application.

Process Monitoring#

PCA-based process monitoring is one of the most common industrial applications. The workflow:

Build a reference model on data from normal, in-control operation.
Preprocess new observations using the same centering and scaling learned from the training data (e.g., the same MCUVScaler instance).
Project new observations into the model with model.predict(X_new).

Monitor SPE and T² - compare each new observation’s SPE and T² against the control limits:

result = model.predict(X_new)

spe_lim = model.spe_limit(conf_level=0.95)
t2_lim = model.hotellings_t2_limit(conf_level=0.95)

spe_violations = result.spe_ > spe_lim
t2_violations = result.hotellings_t2_ > t2_lim

Investigate violations using score contributions to identify which original variables are responsible.

A high SPE signals a new type of event (the correlation structure has changed). A high T² signals an extreme version of normal behavior. Both can occur simultaneously.

Missing Data#

Real process data often has missing values - sensor failures, skipped lab analyses, or variables recorded at different frequencies. The PCA implementation supports several iterative algorithms for fitting models in the presence of missing data:

TSR (Trimmed Scores Regression) - generally the best default choice; uses the available data to estimate scores, then reconstructs missing values.
NIPALS - the classical algorithm; handles moderate amounts of missing data well.
SCP - Single Component Projection; a simpler alternative.

Enable missing data handling by passing a settings dictionary:

model = PCA(
    n_components=3,
    missing_data_settings={
        "md_method": "tsr",
        "md_tol": 1e-6,  # convergence tolerance
        "md_max_iter": 200,  # maximum iterations
    },
)
model.fit(X_with_nans)

# Check convergence
print(model.fitting_info_["iterations"])
print(model.fitting_info_["final_error"])

Troubleshooting#

Symptom	Likely cause and remedy
`LinAlgError: Singular matrix`	One or more columns are constant (zero variance) or are exact linear combinations of other columns. Remove constant columns and check for duplicate or perfectly correlated variables.
Missing-data algorithm does not converge	Increase `md_max_iter` or relax `md_tol`. If the fraction of missing data is very large (>40 %), the algorithm may struggle - try removing rows or columns with excessive missingness first.
Shape mismatch in `predict()`	The new data must have the same columns (K) as the training data, in the same order. If you used `MCUVScaler`, call `scaler.transform(X_new)` to apply the training centering/scaling.
Very different results with different software	Check that the same preprocessing (centering, scaling) was applied. Also note that the sign of a component can flip - this is normal and does not affect interpretation.
First component explains almost all variance	The data may not have been centered. Without centering, the first component captures the mean rather than variation around it.