Principal Component Analysis (PCA) =================================== PCA finds the directions of maximum variance in a single data matrix **X**. It is the workhorse of multivariate exploration, process monitoring, and outlier detection. Mathematical Background ----------------------- PCA decomposes an (N × K) data matrix into scores, loadings, and residuals: .. math:: \mathbf{X} = \mathbf{T}\mathbf{P}^T + \mathbf{E} where: - **T** (N × A) - scores: the coordinates of each observation in the reduced space. - **P** (K × A) - loadings: the direction vectors that define the model plane. - **E** (N × K) - residuals: what the model does not explain. - A - the number of components retained. Geometrically, PCA fits a best-fit hyperplane through the cloud of observations in K-dimensional space. Each observation is projected perpendicularly onto this plane; the projected coordinates are the scores. Minimizing the perpendicular residual distance is equivalent to maximizing the variance captured in the scores - the two formulations yield the same solution. Preprocessing ------------- Preprocessing decisions have a large influence on the model. **Centering** subtracts the column means so that the model explains *deviations* from average behavior. Mean centering is the default; median centering is a robust alternative that simultaneously highlights outliers. **Scaling** equalizes the influence of variables measured in different units or ranges. Unit-variance scaling (dividing by the standard deviation) is the most common choice. When combined with mean centering this is called *autoscaling*. Use :class:`~process_improve.multivariate.methods.MCUVScaler` for sklearn-compatible autoscaling: .. code-block:: python from process_improve.multivariate.methods import PCA, MCUVScaler scaler = MCUVScaler() X_scaled = scaler.fit_transform(X) model = PCA(n_components=3).fit(X_scaled) **Transformations** (log, square root) can reduce the effect of extreme measurements or improve linearity. Consider adding derived columns (interactions, squared terms, ratios) to the raw data matrix - the model will sort out which are informative. Interpreting Scores ------------------- Scores (``model.scores_``) are the coordinates of each observation in the latent variable space. They reveal the dominant patterns in the data: - **Clusters** - observations that are similar in the original K variables will be close together in score space. - **Outliers** - observations far from the main cluster may represent unusual operating conditions or measurement errors. - **Time trends** - when rows are ordered by time, gradual drifts or sudden shifts become visible as trajectories in the score plot. - **Groups** - coloring scores by a categorical variable (grade, shift, equipment ID) can reveal systematic differences. Use ``model.score_plot()`` for quick visualization. Interpreting Loadings --------------------- Loadings (``model.loadings_``) describe how each original variable contributes to each component: - Variables with **large absolute loadings** on a component are important for that component. - Variables that are **positively correlated** cluster together in the loading plot; variables that are **negatively correlated** appear on opposite sides. - The **sign of a loading** can flip between different runs or software - the sign of the component as a whole is arbitrary, so always interpret loading *patterns* rather than individual signs. Use ``model.loading_plot()`` for quick visualization. SPE (Squared Prediction Error) ------------------------------ SPE (``model.spe_``) measures how well each observation conforms to the correlation structure captured by the model. It is the sum of squared residuals for that observation: .. math:: \text{SPE}_i = \mathbf{e}_i^T \mathbf{e}_i - SPE = 0 means the observation lies exactly on the model plane. - **High SPE** means the observation *breaks* the correlation structure - a new event or fault has introduced variation that the model's A components do not describe. - Column-wise residuals give the per-variable R² (``model.r2_per_variable_``), showing which variables are well explained and which are not. Use ``model.spe_plot()`` and ``model.spe_limit()`` for monitoring. Hotelling's T² -------------- Hotelling's T² (``model.hotellings_t2_``) measures how far each observation's projection is from the center of the model plane: .. math:: T^2_i = \sum_{a=1}^{A} \frac{t_{i,a}^2}{s_a^2} where :math:`s_a^2` is the variance of the *a*-th score column. - T² is always non-negative and follows an F-distribution, so confidence limits (e.g., 95 %) can be computed analytically. - **High T²** means the observation follows the correlation structure but with **unusual magnitude** - it is extreme *within* the model, not outside it. - On a score plot of component *a* vs *b*, the T² limit traces an **elliptical boundary**. This single ellipse replaces examining A(A−1)/2 pairwise scatterplots. Use ``model.t2_plot()`` and ``model.hotellings_t2_limit()`` for monitoring. Score Contributions ------------------- Score contributions decompose a score-space movement back into the original variable space, answering: *"Which variables caused this observation to score where it did?"* Each score value :math:`t_{i,a}` can be written as a sum of K contributions :math:`x_{i,k} \, p_{k,a}`, one per original variable. Plotting these as a bar chart reveals the dominant drivers. The reference point matters: contributions always measure the difference *from* one point *to* another. The default reference is the model center (the origin after centering), but you can compare any two points or groups: .. code-block:: python # Why does observation 5 differ from the model center? contrib = model.score_contributions(model.scores_.iloc[5].values) # Why do observations 5 and 10 differ from each other? contrib = model.score_contributions( model.scores_.iloc[5].values, model.scores_.iloc[10].values, ) # T²-weighted contributions (scale by 1/sqrt(eigenvalue)) contrib = model.score_contributions(model.scores_.iloc[5].values, weighted=True) Outlier Detection ----------------- The ``detect_outliers()`` method combines two complementary approaches: 1. **Statistical limits** - SPE and T² limits at a specified confidence level 2. **Robust ESD test** - Generalized Extreme Studentized Deviate test using median and MAD, which is robust to masking effects where multiple outliers hide each other Results are returned as a list of dicts sorted by severity (most severe first): .. code-block:: python outliers = model.detect_outliers(conf_level=0.95) for o in outliers: print(f"{o['observation']}: {o['outlier_types']} (severity={o['severity']})") Model Selection --------------- R² always increases as components are added, so it cannot tell you when to stop. The cross-validated Q² metric *decreases* once the model begins fitting noise rather than systematic structure. The point where Q² stops improving indicates the useful number of components. Use ``PCA.select_n_components()`` for automated selection via PRESS cross-validation with Wold's criterion (see :doc:`cross_validation` for details). Remember that the answer is never exact - examine one or two components beyond and before the suggestion, and consider the interpretability of each additional component in the context of your application. Process Monitoring ------------------ PCA-based process monitoring is one of the most common industrial applications. The workflow: 1. **Build a reference model** on data from normal, in-control operation. 2. **Preprocess new observations** using the *same* centering and scaling learned from the training data (e.g., the same ``MCUVScaler`` instance). 3. **Project new observations** into the model with ``model.predict(X_new)``. 4. **Monitor SPE and T²** - compare each new observation's SPE and T² against the control limits: .. code-block:: python result = model.predict(X_new) spe_lim = model.spe_limit(conf_level=0.95) t2_lim = model.hotellings_t2_limit(conf_level=0.95) spe_violations = result.spe_ > spe_lim t2_violations = result.hotellings_t2_ > t2_lim 5. **Investigate violations** using score contributions to identify which original variables are responsible. A high SPE signals a *new type* of event (the correlation structure has changed). A high T² signals an *extreme* version of normal behavior. Both can occur simultaneously. Missing Data ------------ Real process data often has missing values - sensor failures, skipped lab analyses, or variables recorded at different frequencies. The PCA implementation supports several iterative algorithms for fitting models in the presence of missing data: - **TSR** (Trimmed Scores Regression) - generally the best default choice; uses the available data to estimate scores, then reconstructs missing values. - **NIPALS** - the classical algorithm; handles moderate amounts of missing data well. - **SCP** - Single Component Projection; a simpler alternative. Enable missing data handling by passing a settings dictionary: .. code-block:: python model = PCA( n_components=3, missing_data_settings={ "md_method": "tsr", "md_tol": 1e-6, # convergence tolerance "md_max_iter": 200, # maximum iterations }, ) model.fit(X_with_nans) # Check convergence print(model.fitting_info_["iterations"]) print(model.fitting_info_["final_error"]) Troubleshooting --------------- .. list-table:: :header-rows: 1 :widths: 30 70 * - Symptom - Likely cause and remedy * - ``LinAlgError: Singular matrix`` - One or more columns are constant (zero variance) or are exact linear combinations of other columns. Remove constant columns and check for duplicate or perfectly correlated variables. * - Missing-data algorithm does not converge - Increase ``md_max_iter`` or relax ``md_tol``. If the fraction of missing data is very large (>40 %), the algorithm may struggle - try removing rows or columns with excessive missingness first. * - Shape mismatch in ``predict()`` - The new data must have the same columns (K) as the training data, in the same order. If you used ``MCUVScaler``, call ``scaler.transform(X_new)`` to apply the training centering/scaling. * - Very different results with different software - Check that the same preprocessing (centering, scaling) was applied. Also note that the sign of a component can flip - this is normal and does not affect interpretation. * - First component explains almost all variance - The data may not have been centered. Without centering, the first component captures the mean rather than variation around it.