Multivariate Analysis#

Models#

PCA#

class process_improve.multivariate.methods.PCA(n_components, *, algorithm='auto', missing_data_settings=None)[source]#

Bases: TransformerMixin, BaseEstimator

Principal Component Analysis with support for missing data.

Parameters:
  • n_components (int) – Number of principal components to extract.

  • algorithm (str, default="auto") – Algorithm to use for fitting the model. - "auto": Uses SVD when data is complete, NIPALS when data has missing values. - "svd": Singular Value Decomposition. Requires complete data. - "nipals": Non-linear Iterative Partial Least Squares. Handles missing data. - "tsr": Trimmed Score Regression. Handles missing data.

  • missing_data_settings (dict or None, default=None) – Settings for iterative missing data algorithms (NIPALS, TSR). Keys: md_tol (convergence tolerance), md_max_iter (max iterations).

  • fitting) (Attributes (after)

  • --------------------------

  • scores (pd.DataFrame of shape (n_samples, n_components)) – The score matrix (T).

  • loadings (pd.DataFrame of shape (n_features, n_components)) – The loading matrix (P).

  • r2_per_component (pd.Series of length n_components) – Fractional R² explained by each component.

  • r2_cumulative (pd.Series of length n_components) – Cumulative R² after each component.

  • r2_per_variable (pd.DataFrame of shape (n_features, n_components)) – Per-variable cumulative R² after each component.

  • spe (pd.DataFrame of shape (n_samples, n_components)) – Squared Prediction Error (stored as sqrt of row sum-of-squares).

  • hotellings_t2 (pd.DataFrame of shape (n_samples, n_components)) – Cumulative Hotelling’s T² statistic.

  • explained_variance (np.ndarray of shape (n_components,)) – Variance explained by each component.

  • scaling_factor_for_scores (pd.Series of length n_components) – Standard deviation per score (sqrt of explained variance).

  • has_missing_data (bool) – Whether the training data contained missing values.

  • fitting_info (dict) – Timing and iteration info from the fitting algorithm.

fit(X, y=None)[source]#

Fit a principal component analysis (PCA) model to the data.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Training data. May contain NaN values for missing data.

  • y (ignored)

Returns:

self

Return type:

PCA

transform(X)[source]#

Project new data onto the fitted PCA model to obtain scores.

Parameters:

X (array-like of shape (n_samples, n_features)) – New data to project. Must have the same number of features as the training data.

Returns:

scores

Return type:

pd.DataFrame of shape (n_samples, n_components)

fit_transform(X, y=None)[source]#

Fit the model and return the training scores.

Parameters:
Return type:

DataFrame

predict(X)[source]#

Project new data and compute diagnostics (scores, Hotelling’s T², SPE).

Parameters:

X (array-like of shape (n_samples, n_features))

Returns:

result – With keys scores, hotellings_t2, spe.

Return type:

sklearn.utils.Bunch

score(X, y=None)[source]#

Negative mean squared reconstruction error (higher is better).

Follows the sklearn convention where higher scores indicate better model fit. This makes PCA compatible with cross_val_score, GridSearchCV, and other sklearn model-selection utilities.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Test data to score.

  • y (ignored)

Returns:

score – Negative mean squared reconstruction error.

Return type:

float

Examples

>>> from sklearn.model_selection import cross_val_score
>>> scores = cross_val_score(PCA(n_components=2), X_scaled, cv=5)
>>> print(f"Mean CV score: {scores.mean():.4f}")
classmethod select_n_components(X, *, max_components=None, cv=5, threshold=0.95, **pca_kwargs)[source]#

Select the number of components via PRESS cross-validation.

Fits PCA models with 1, 2, …, max_components components, evaluates each with K-fold cross-validation, and recommends the optimal number using Wold’s criterion: stop adding components when PRESS_a / PRESS_{a-1} > threshold.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Training data.

  • max_components (int, optional) – Maximum number of components to evaluate. Default is min(n_samples - 1, n_features).

  • cv (int or sklearn CV splitter, default 5) – Number of cross-validation folds, or an sklearn splitter object (e.g., KFold(n_splits=10, shuffle=True)).

  • threshold (float, default 0.95) – Wold’s criterion threshold. If PRESS_a / PRESS_{a-1} exceeds this value, component a is deemed not significant. Lower values are more aggressive (fewer components).

  • **pca_kwargs – Additional keyword arguments passed to the PCA() constructor (e.g., algorithm="nipals" for data with missing values).

Returns:

result – With keys:

  • n_components - recommended number of components (int)

  • press - PRESS per component count (pd.Series, indexed 1..A_max)

  • press_ratio - PRESS_a / PRESS_{a-1} (pd.Series, indexed 2..A_max)

  • cv_scores - per-fold scores (pd.DataFrame, A_max rows x cv cols)

Return type:

sklearn.utils.Bunch

score_contributions(t_start, t_end=None, components=None, *, weighted=False)[source]#

Contribution of each variable to a score-space movement.

Decomposes the difference (t_end - t_start) back into variable space via the loadings for the selected components.

Parameters:
  • t_start (array-like of shape (n_components,)) – Score vector of the observation of interest. Can be a row from self.scores_ or from predict(X_new).scores.

  • t_end (array-like of shape (n_components,), optional) – Reference point in score space. Default is the model center (a vector of zeros), which is the most common choice.

  • components (list of int, optional) – 1-based component indices to decompose over, matching the model’s column convention. Examples: [2, 3] for a PC2-vs-PC3 score plot, or None (default) for all components - appropriate for Hotelling’s T² contributions.

  • weighted (bool, default False) – If True, scale the score difference by 1/sqrt(explained_variance) per component before back-projecting. This gives contributions to the T² statistic rather than to the Euclidean score distance.

Returns:

contributions – One value per variable; sign indicates direction. Index contains the variable (feature) names.

Return type:

pd.Series of shape (n_features,)

Examples

>>> pca = PCA(n_components=3).fit(X_scaled)
>>> # Why does observation 0 differ from the model center?
>>> contrib = pca.score_contributions(pca.scores_.iloc[0].values)
>>> # Which variables drive the difference between obs 0 and obs 5?
>>> contrib = pca.score_contributions(
...     pca.scores_.iloc[0].values, pca.scores_.iloc[5].values
... )
detect_outliers(conf_level=0.95)[source]#

Detect outlier observations using SPE and Hotelling’s T² diagnostics.

Combines two approaches:

  1. Statistical limits - observations exceeding the SPE or T² limit at conf_level are flagged.

  2. Robust ESD test - the generalized ESD test (with robust median/MAD variant) identifies observations that are unusual relative to the rest of the data, even if they fall below the statistical limit.

An observation can be flagged for one or both reasons.

Parameters:

conf_level (float, default 0.95) – Confidence level in [0.8, 0.999]. Controls both the statistical limits and the ESD test’s significance level (alpha = 1 - conf_level).

Returns:

outliers – Sorted from most severe to least. Each dict contains:

  • observation - index label of the observation

  • outlier_types - list of "spe" and/or "hotellings_t2"

  • spe - SPE value for this observation

  • hotellings_t2 - T² value for this observation

  • spe_limit - SPE limit at the given confidence level

  • hotellings_t2_limit - T² limit at the given confidence level

  • severity - max(spe/spe_limit, t2/t2_limit)

Return type:

list of dict

Examples

>>> pca = PCA(n_components=3).fit(X_scaled)
>>> outliers = pca.detect_outliers(conf_level=0.95)
>>> for o in outliers:
...     print(f"{o['observation']}: {o['outlier_types']} (severity={o['severity']})")

PLS#

class process_improve.multivariate.methods.PLS(n_components, *, scale=True, max_iter=1000, tol=np.float64(1.4901161193847656e-08), copy=True, missing_data_settings=None)[source]#

Bases: RegressorMixin, TransformerMixin, BaseEstimator

Projection to Latent Structures (PLS) regression with diagnostics.

Implements PLS via the NIPALS algorithm with production diagnostics: SPE, Hotelling’s T², score contributions, and outlier detection. The API mirrors PCA so that model.scores_, model.spe_, and model.detect_outliers() work identically for both model types.

Parameters:
  • n_components (int) – Number of latent components to extract.

  • scale (bool, default=True) – Whether to scale X and Y to unit variance (sklearn internal scaling). When using MCUVScaler externally, set scale=False to avoid double scaling.

  • max_iter (int, default=1000) – Maximum number of iterations for the NIPALS algorithm.

  • tol (float, default=sqrt(machine epsilon)) – Convergence tolerance for the NIPALS algorithm.

  • copy (bool, default=True) – Whether to copy X and Y before fitting.

  • missing_data_settings (dict or None, default=None) – Settings for missing data algorithms (NIPALS/TSR for PLS). Keys: md_method ("tsr", "scp", "nipals"), md_tol, md_max_iter.

  • fitting) (Attributes (after)

  • --------------------------

  • scores (pd.DataFrame of shape (n_samples, n_components)) – X-block score matrix (T). This is the primary score matrix; equivalent to x_scores in older versions.

  • y_scores (pd.DataFrame of shape (n_samples, n_components)) – Y-block score matrix (U).

  • x_loadings (pd.DataFrame of shape (n_features, n_components)) – X-block loading matrix (P).

  • y_loadings (pd.DataFrame of shape (n_targets, n_components)) – Y-block loading matrix (C).

  • x_weights (pd.DataFrame of shape (n_features, n_components)) – X-block weight matrix (W).

  • y_weights (pd.DataFrame of shape (n_targets, n_components)) – Y-block weight matrix.

  • direct_weights (pd.DataFrame of shape (n_features, n_components)) – Direct (W*) weights: W (P'W)^{-1}. Used for direct projection T = X @ W*.

  • beta_coefficients (pd.DataFrame of shape (n_features, n_targets)) – Regression coefficients linking X directly to Y.

  • predictions (pd.DataFrame of shape (n_samples, n_targets)) – Y predictions from the training data.

  • spe (pd.DataFrame of shape (n_samples, n_components)) – Squared Prediction Error (stored as sqrt of row sum-of-squares).

  • hotellings_t2 (pd.DataFrame of shape (n_samples, n_components)) – Cumulative Hotelling’s T² statistic.

  • r2_per_component (pd.Series of length n_components) – Fractional R² (on Y) explained by each component.

  • r2_cumulative (pd.Series of length n_components) – Cumulative R² (on Y) after each component.

  • r2_per_variable (pd.DataFrame of shape (n_features, n_components)) – Per-variable cumulative R² for X after each component.

  • r2y_per_variable (pd.DataFrame of shape (n_targets, n_components)) – Per-variable R² for Y after each component.

  • rmse (pd.DataFrame of shape (n_targets, n_components)) – Root mean squared error of Y predictions per component.

  • explained_variance (np.ndarray of shape (n_components,)) – Variance explained by each component in X.

  • scaling_factor_for_scores (pd.Series of length n_components) – Standard deviation per score (sqrt of explained variance).

  • has_missing_data (bool) – Whether the training data contained missing values.

  • fitting_info (dict) – Timing and iteration info from the fitting algorithm.

See also

PCA

Principal Component Analysis.

MCUVScaler

Mean-center unit-variance scaler.

References

Abdi, “Partial least squares regression and projection on latent structure regression (PLS Regression)”, 2010, DOI: 10.1002/wics.51

Examples

>>> import pandas as pd
>>> from process_improve.multivariate.methods import PLS, MCUVScaler
>>> X = pd.DataFrame({"A": [1, 2, 3, 4], "B": [4, 3, 2, 1]})
>>> Y = pd.DataFrame({"y": [2.1, 3.9, 6.2, 7.8]})
>>> pls = PLS(n_components=1)
>>> pls = pls.fit(MCUVScaler().fit_transform(X), MCUVScaler().fit_transform(Y))
>>> pls.scores_.shape
(4, 1)
fit(X, Y)[source]#

Fit a projection to latent structures (PLS) model to the data.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – Training data, where n_samples is the number of samples (rows) and n_features is the number of features (columns).

  • Y (array-like, shape (n_samples, n_targets)) – Training data, where n_samples is the number of samples (rows) and n_targets is the number of target outputs (columns).

Returns:

Model object.

Return type:

PLS

References

Abdi, “Partial least squares regression and projection on latent structure regression (PLS Regression)”, 2010, DOI: 10.1002/wics.51

transform(X, Y=None)[source]#

Project X (and optionally Y) into the latent space.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Data to transform.

  • Y (array-like of shape (n_samples, n_targets), optional) – Ignored. Present for API compatibility with sklearn pipelines.

Returns:

X_scores – Projected X data (scores).

Return type:

pd.DataFrame of shape (n_samples, n_components)

fit_transform(X, Y=None)[source]#

Fit the model and return X scores.

Parameters:
  • X (array-like of shape (n_samples, n_features))

  • Y (array-like of shape (n_samples, n_targets))

Returns:

X_scores

Return type:

pd.DataFrame of shape (n_samples, n_components)

score(X, Y, sample_weight=None)[source]#

Return the R² score for the prediction.

Parameters:
  • X (array-like of shape (n_samples, n_features))

  • Y (array-like of shape (n_samples, n_targets)) – True target values.

  • sample_weight (array-like of shape (n_samples,), optional)

Returns:

score – R² of self.predict(X).y_hat w.r.t. Y.

Return type:

float

predict(X)[source]#

Project new data and compute diagnostics.

Parameters:

X (array-like of shape (n_samples, n_features))

Returns:

result – With keys scores, hotellings_t2, spe, y_hat.

Return type:

sklearn.utils.Bunch

Examples

>>> result = pls.predict(scaler_x.transform(X_new))
>>> result.y_hat           # Predicted Y values
>>> result.spe             # SPE for each new observation
>>> result.hotellings_t2   # T² for each new observation
score_contributions(t_start, t_end=None, components=None, *, weighted=False)[source]#

Contribution of each X variable to a score-space movement.

Identical to PCA.score_contributions but uses the PLS X-loadings (P matrix) for back-projection.

Parameters:
  • t_start (array-like of shape (n_components,)) – Score vector of the observation of interest.

  • t_end (array-like of shape (n_components,), optional) – Reference point in score space. Default is the model center (zeros).

  • components (list of int, optional) – 1-based component indices. Default: all components.

  • weighted (bool, default False) – If True, scale by 1/sqrt(explained_variance) for T² contributions.

Returns:

contributions

Return type:

pd.Series of shape (n_features,)

Examples

>>> pls = PLS(n_components=3).fit(X_scaled, Y_scaled)
>>> contrib = pls.score_contributions(pls.scores_.iloc[0].values)
>>> contrib.abs().sort_values(ascending=False).head()  # top contributors
detect_outliers(conf_level=0.95)[source]#

Detect outlier observations using SPE and Hotelling’s T² diagnostics.

Same approach as PCA.detect_outliers: combines statistical limits with the robust generalized ESD test.

Parameters:

conf_level (float, default 0.95) – Confidence level in [0.8, 0.999].

Returns:

outliers – Sorted from most severe to least. Each dict contains observation, outlier_types, spe, hotellings_t2, spe_limit, hotellings_t2_limit, severity.

Return type:

list of dict

Examples

>>> pls = PLS(n_components=3).fit(X_scaled, Y_scaled)
>>> outliers = pls.detect_outliers(conf_level=0.95)
>>> for o in outliers:
...     print(f"{o['observation']}: {o['outlier_types']}")
set_score_request(*, sample_weight='$UNCHANGED$')#

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

  • self (PLS)

Returns:

self – The updated object.

Return type:

object

TPLS#

class process_improve.multivariate.methods.TPLS(n_components, d_matrix, max_iter=500, skip_f_matrix_preprocessing=False)[source]#

Bases: RegressorMixin, BaseEstimator

TPLS algorithm for T-shaped data structures (we also include standard pre-processing of the data inside this class).

Source: Garcia-Munoz, https://doi.org/10.1016/j.chemolab.2014.02.006, Chem.Intell.Lab.Sys. v133, p 49 to 62, 2014.

We change the notation from the original paper to avoid confusion with a generic “X” matrix, and match symbols that are more natural for our use.

Notation mapping (paper → this code):

  • X^T → D: d_matrix (external), d_mats (internal) - Database of properties

  • X → D^T: transposed D (not used directly)

  • R → F: f_mats - Formula matrices

  • Z → Z: z_mats - Process conditions

  • Y → Y: y_mats - Quality indicators

Notes 1. Matrices in F, Z and Y must all have the same number of rows. 2. Columns in F must be the same as the rows in D. 3. Conditions in Z may be missing (turning it into an L-shaped data structure).

Parameters:
  • n_components (int) – A parameter used to specify the number of components.

  • d_matrix (dict[str, dict[str, pd.DataFrame]]) – A dictionary containing the properties of each group of materials. Keys are group names; values are DataFrames with properties as columns and materials as rows. This “D” matrix is provided once at construction and reused for fitting, prediction and cross-validation.

  • max_iter (int, optional) – The maximum number of iterations for the TPLS algorithm. Default is 500.

  • skip_f_matrix_preprocessing (bool)

Notes

The input X is a dictionary with 4 keys:

  • D: Database of DataFrames (properties in columns, materials in rows). D = {"Group A": df_props_a, "Group B": df_props_b, ...}

  • F: Formula matrices (rows = blends, columns = materials). F = {"Group A": df_formulas_a, "Group B": df_formulas_b, ...}

  • Z: Process conditions - one row per blend, one column per condition.

  • Y: Product quality indicators - one row per blend, one column per indicator.

n_samples#

The number of samples (rows) in the training data

Type:

int

n_substances#

The number of substances (columns) in the training data, i.e. the number of materials in the F matrix.

Type:

int

Example

>>> import numpy as np
>>> import pandas as pd
>>> rng = np.random.default_rng()
>>>
>>> n_props_a, n_props_b = 6, 4            # Two groups of properties: A and B.
>>> n_materials_a, n_materials_b = 12, 8   # Number of materials in each group.
>>> n_formulas = 40                        # Number of formulas in matrix F.
>>> n_outputs = 3
>>> n_conditions = 2
>>>
>>> properties = {
>>>     "Group A": pd.DataFrame(rng.standard_normal((n_materials_a, n_props_a))),
>>>     "Group B": pd.DataFrame(rng.standard_normal((n_materials_b, n_props_b))),
>>> }
>>> formulas = {
>>>     "Group A": pd.DataFrame(rng.standard_normal((n_formulas, n_materials_a))),
>>>     "Group B": pd.DataFrame(rng.standard_normal((n_formulas, n_materials_b))),
>>> }
>>> process_conditions = {"Conditions": pd.DataFrame(rng.standard_normal((n_formulas, n_conditions)))}
>>> quality_indicators = {"Quality":    pd.DataFrame(rng.standard_normal((n_formulas, n_outputs)))}
>>> all_data = {"Z": process_conditions, "D": properties, "F": formulas, "Y": quality_indicators}
>>> estimator = TPLS(n_components=4)
>>> estimator.fit(all_data)
fit(X, y=None)[source]#

Fit the preprocessing parameters and also the latent variable model from the training data.

Parameters:
  • X ({dictionary of dataframes}, keys that must be present: "D", "F", "Z", and "Y") – The training input samples. See documentation in the class definition for more information on each matrix.

  • y (None)

Returns:

self – Returns self.

Return type:

object

predict(X)[source]#

Model inference on new data.

This will pre-process the new data and apply those subsequently to the latent variable model.

Example

# Training phase: estimator = TPLS(n_components=2).fit(training_data)

# Testing/inference phase: new_data = {“Z”: …, “F”: …} # you need at least the F block for a new prediction. “Z” is optional. predictions = estimator.predict(new_data_pp)

Parameters:

X (DataFrameDict) – The input samples.

Returns:

y – Returns an array of prediction objects. More details to come here later. Please ask.

Return type:

dict

display_results(show_cumulative_stats=True)[source]#

Display the results of the model fitting.

Parameters:

show_cumulative_stats (bool)

Return type:

str

score(X, y=None, sample_weight=None)[source]#

Return r2_score` on test data.

See RegressorMixin.score for more details.

Parameters:
  • X (DataFrameDict) – Test samples.

  • y (Not used. In the X input, there is a already a “Y” block. This will be the Y-data.)

  • sample_weight (Not used.)

Returns:

score\(R^2\) of self.predict(X).

Return type:

float

help()[source]#

Help for the TPLS Estimator.

Data organization#

Quick tips#

Build model: tpls = TPLS(n_components=2, d_matrix=d_matrix).fit(X) Get model’s predictions: tpls.hat <– the hat-matrix, i.e., the predictions Predict on new data: tpls.predict(X_new) See model summary: tpls.display_results() This help page: tpls.help()

Statistical values#

.t_scores_super Super scores for the entire model [pd.DataFrame] .hotellings_t2 Hotelling’s T2 values for each observation, per component [pd.DataFrame] .spe Squared prediction error for each block [dict of pd.DataFrames]

.hotellings_t2_limit() Returns the Hotelling’s T2 limit for the model [float] .spe_limit[block]() Return the SPE limit for the block; e.g. .spe_limit[“Y”]() [float]

TODO: self.hotellings_t2: pd.DataFrame = pd.DataFrame() self.hotellings_t2_limit: Callable = hotellings_t2_limit self.scaling_factor_for_scores = pd.Series() self.ellipse_coordinates: Callable = ellipse_coordinates

Return type:

str

vip(block=None)[source]#

Return Variable Importance in Projection (VIP) scores for TPLS blocks.

VIP scores are computed during fitting for the D-block (material properties) and F-block (formulation variables) and stored in feature_importance.

Parameters:

block (str or None, default=None) – Which block to return. Must be "D" or "F", or None to return all blocks.

Returns:

If block is None: {"D": {group: pd.Series, ...}, "F": {group: pd.Series, ...}}. If block is "D" or "F": the inner dict {group: pd.Series, ...} for that block, where each pd.Series is indexed by feature names.

Return type:

dict

Raises:

ValueError – If the model is not fitted or block is not "D", "F", or None.

Examples

>>> tpls = TPLS(...).fit(data)
>>> tpls.vip()          # all blocks
>>> tpls.vip("D")       # D-block only → {group_name: pd.Series, ...}
set_score_request(*, sample_weight='$UNCHANGED$')#

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

  • self (TPLS)

Returns:

self – The updated object.

Return type:

object

MBPLS#

Multi-block PLS in the hierarchical / superblock formulation of Westerhuis, Kourti & MacGregor (1998). Each X-block is preprocessed independently and weighted by 1/sqrt(K_b) before the inner NIPALS loop, so blocks of unequal width contribute fairly to the consensus super-score.

class process_improve.multivariate.methods.MBPLS(n_components, *, max_iter=500, tol=None)[source]#

Bases: RegressorMixin, BaseEstimator

Multi-block PLS (hierarchical / superblock formulation).

Generic multi-block PLS as described by Westerhuis, Kourti & MacGregor (1998) and Westerhuis & Smilde (2001). Each X-block is preprocessed independently (mean-centred and unit-variance scaled), then divided by sqrt(K_b) so that blocks of unequal width contribute fairly to the super-score.

Parameters:
  • n_components (int) – Number of latent variables to extract.

  • max_iter (int, default=500) – Maximum NIPALS iterations per latent variable.

  • tol (float or None, default=None) – Convergence tolerance on the change in the Y-block score. If None, np.finfo(float).eps ** (6/7) is used (matching the legacy multi-block reference implementation).

  • fitting) (Attributes (after)

  • --------------------------

  • block_names (list[str]) – Ordered list of X-block names (the keys of the input dict).

  • block_widths (dict[str, int]) – Number of variables in each X-block.

  • super_scores (pd.DataFrame, shape (n_samples, n_components)) – Super-block (consensus) X-scores T.

  • super_y_scores (pd.DataFrame, shape (n_samples, n_components)) – Super-block Y-scores U.

  • super_weights (pd.DataFrame, shape (n_blocks, n_components)) – Super-block weights w_super; rows indexed by block name.

  • super_y_loadings (pd.DataFrame, shape (n_targets, n_components)) – Y-block loadings c.

  • block_scores (dict[str, pd.DataFrame]) – Per-block X-scores t_b, each shape (n_samples, n_components).

  • block_weights (dict[str, pd.DataFrame]) – Per-block X-weights w_b, each shape (K_b, n_components). Each column has unit norm.

  • block_loadings (dict[str, pd.DataFrame]) – Per-block X-loadings p_b (used for deflation), each shape (K_b, n_components).

  • predictions (pd.DataFrame, shape (n_samples, n_targets)) – In-sample Y predictions on the original scale.

  • explained_variance (np.ndarray, shape (n_components,)) – Variance of the super-score per component (ddof=1).

  • scaling_factor_for_super_scores (pd.Series) – sqrt(explained_variance_) per component.

  • fitting_info (dict) – Per-component iteration count and timing.

  • has_missing_data (bool) – Whether any X-block or Y had NaN values.

Notes

Block weighting uses the convention \(X_b / \sqrt{K_b}\) so that every block contributes the same total sum of squares to the super-score, regardless of how many variables it has.

References

Westerhuis, J. A., Kourti, T. & MacGregor, J. F. Analysis of multiblock and hierarchical PCA and PLS models. Journal of Chemometrics, 12 (1998), 301-321.

Westerhuis, J. A. & Smilde, A. K. Deflation in multiblock PLS. Journal of Chemometrics, 15 (2001), 485-493.

fit(X, y)[source]#

Fit the multi-block PLS model.

Parameters:
  • X (dict[str, pd.DataFrame]) – X-blocks. Keys are block names; values are DataFrames sharing the same row index (and row count). Each block is preprocessed independently.

  • y (pd.DataFrame) – Y-block. Same row index / row count as the X-blocks.

Return type:

MBPLS

block_spe_limit(block, conf_level=0.95)[source]#

SPE limit for one X-block using the Nomikos & MacGregor chi-square approximation.

Operates on the same scale as block_spe_[block] (sqrt of row sum of squares), so the value can be drawn directly on a SPE plot.

Parameters:
Return type:

float

super_spe_limit(conf_level=0.95)[source]#

SPE limit for the merged super-block (sum of per-block SPE squared).

Parameters:

conf_level (float)

Return type:

float

spe_contributions(X)[source]#

Per-variable squared residuals for each X-block (SPE contributions).

For each new observation and each X-block, reconstruct the block as T_super @ P_b^T (matching the deflation step used during fit) and return the squared per-variable residuals. Useful for fault diagnosis: the variable with the largest contribution is the most likely culprit for a high SPE.

Returns:

One DataFrame per block, shape (n_samples, K_b). Values are preprocessed-scale squared residuals (centred and scaled inside the model). Sum across columns equals block_spe_[b].iloc[:, -1] ** 2.

Return type:

dict[str, pd.DataFrame]

Parameters:

X (dict[str, DataFrame])

score_contributions(t_super_start, t_super_end=None, components=None, *, weighted=False)[source]#

Per-block per-variable contributions to a super-score movement.

The multi-block analogue of PCA.score_contributions(). Decomposes a super-score-space delta back into preprocessed-scale variable-space deltas, one per X-block.

For MBPLS, the super-score at component a is built from the per-block scores (which themselves are weighted regressions of the block on w_b), so the back-projection through w_super[b,a] * w_b[:,a] / sqrt(K_b) gives the variable contribution.

Parameters:
  • t_super_start (array-like, shape (n_components,)) – Super-score row of the observation of interest. Typically a row from self.super_scores_ or from predict(X_new).super_scores.

  • t_super_end (array-like, optional) – Reference point in super-score space. Defaults to the model centre (zeros).

  • components (list of int, optional) – 1-based component indices to decompose over. None (default) uses all components - appropriate for Hotelling’s T² contributions.

  • weighted (bool, default=False) – If True, divide the super-score delta by sqrt(explained_variance_) per component before back-projecting, giving contributions to the T² statistic instead of the Euclidean super-score distance.

Returns:

One Series per X-block (length K_b), indexed by variable (column) name.

Return type:

dict[str, pd.Series]

super_score_plot(pc_horiz=1, pc_vert=2)[source]#

Scatter plot of super-scores for two components.

Parameters:
  • pc_horiz (int)

  • pc_vert (int)

Return type:

Figure

super_weights_bar_plot(component=1)[source]#

Bar plot of super-weights w_super for a single component.

Parameters:

component (int)

Return type:

Figure

predictions_vs_observed_plot(y_observed, variable=None)[source]#

Scatter plot of predicted vs observed Y, with y=x reference and RMSEE annotation.

Parameters:
  • y_observed (pd.DataFrame) – The observed Y on the original scale, same columns as the training Y.

  • variable (str or None, default=None) – If given, plot only that Y-variable. If None, plot the first one.

Return type:

Figure

display_results(show_cumulative=True)[source]#

Format a short text summary of per-block R²X, overall R²Y, iterations and timing.

Parameters:

show_cumulative (bool)

Return type:

str

transform(X)[source]#

Project new data to super-scores using the fitted model.

Parameters:

X (dict[str, DataFrame])

Return type:

DataFrame

predict(X)[source]#

Project new data and predict Y on the original scale.

Returns a sklearn.utils.Bunch with fields super_scores, block_scores (dict[str, DataFrame]) and predictions (DataFrame on original Y scale).

Parameters:

X (dict[str, DataFrame])

Return type:

Bunch

set_score_request(*, sample_weight='$UNCHANGED$')#

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

  • self (MBPLS)

Returns:

self – The updated object.

Return type:

object

process_improve.multivariate.methods.randomization_test_mbpls(model, X, y, n_permutations=200, *, seed=None)[source]#

Randomization (permutation) test for component significance in MBPLS.

For each component a, the null hypothesis is “there is no real relationship between X and Y at this component”; the test permutes the rows of y, refits a fresh MBPLS with the same number of components, and recomputes the test statistic. The risk is the fraction of permutations whose statistic equals or exceeds the original model’s.

Statistic: per-component absolute correlation between the super X-score and the super Y-score, |t_super(:,a)' u_super(:,a)| / (||t|| * ||u||), matching the legacy ConnectMV randomization-objective for PLS.

Parameters:
  • model (MBPLS) – A fitted MBPLS model.

  • X (dict[str, DataFrame], DataFrame) – The same training data used to fit model.

  • y (dict[str, DataFrame], DataFrame) – The same training data used to fit model.

  • n_permutations (int, default=200) – Number of Y-row permutations to evaluate.

  • seed (int or None, default=None) – Seed for the permutation RNG (None uses non-reproducible randomness).

Returns:

Indexed by component 1..A with columns:

  • observed : the actual model’s per-component statistic.

  • risk_pct : fraction (in %) of permutations with statistic >= observed. Low values (e.g. < 5%) suggest the component is significant; values near 50% suggest the component is no better than chance.

Return type:

pd.DataFrame

References

Wiklund, S., Nilsson, D., Eriksson, L., Sjöström, M., Wold, S. & Faber, K. A randomization test for PLS component selection. J. Chemometrics, 21 (2007), 427-439.

MBPCA#

Multi-block PCA / consensus-PCA. Same dict-of-DataFrames API as MBPLS; no Y-block.

class process_improve.multivariate.methods.MBPCA(n_components, *, max_iter=500, tol=None)[source]#

Bases: TransformerMixin, BaseEstimator

Multi-block PCA (hierarchical / consensus PCA).

Generic multi-block PCA following the consensus-PCA / hierarchical PCA formulation of Westerhuis, Kourti & MacGregor (1998). Each X-block is preprocessed independently (mean-centred and unit-variance scaled), then divided by sqrt(K_b) so blocks of unequal width contribute fairly to the consensus super-score.

The hierarchical NIPALS loop alternates: (i) regress each block on the super-score to get block loadings and block scores, (ii) collect block scores into a super-block, (iii) regress the super-block to get a new super-score / super-loading, repeat to convergence. After convergence, deflate every block by the super-score and the corresponding block loading scaled by the super-loading element.

Parameters:
  • n_components (int)

  • max_iter (int, default=500)

  • tol (float or None, default=None) – Convergence tolerance on the super-score change. None uses np.finfo(float).eps ** (9/10) (matches the legacy reference).

  • fitting) (Attributes (after)

  • --------------------------

  • block_names

  • MBPLS) (block_widths (as)

  • A) (super_loadings DataFrame (B x)

  • A)

  • block_scores

  • dict[str (r2_x_per_variable)

  • DataFrame]

  • r2_x_per_block_cumulative

  • r2_x_per_block_per_component

  • dict[str

  • DataFrame]

  • block_vip

  • super_vip

  • block_spe

  • block_hotellings_t2

  • super_hotellings_t2

  • explained_variance

  • scaling_factor_for_super_scores

  • fitting_info

  • has_missing_data

Notes

The deflation step is \(X_b \leftarrow X_b - t_{\rm super}\, (p_b\,p_s[b]\,\sqrt{K_b})^\top\), derived in Westerhuis et al. 1998. The legacy MATLAB mbpca.m had this step marked as broken by the original author; this implementation re-derives it directly from the paper and is independently validated against the pure-numpy reference oracles in the test suite.

References

Westerhuis, J. A., Kourti, T. & MacGregor, J. F. Analysis of multiblock and hierarchical PCA and PLS models. J. Chemometrics, 12 (1998), 301-321.

fit(X, y=None)[source]#

Fit the multi-block PCA model.

Parameters:
Return type:

MBPCA

transform(X)[source]#

Project new data to super-scores using the fitted model.

Parameters:

X (dict[str, DataFrame])

Return type:

DataFrame

predict(X)[source]#

Project new data; return super_scores, block_scores, block_spe, hotellings_t2.

Parameters:

X (dict[str, DataFrame])

Return type:

Bunch

block_spe_limit(block, conf_level=0.95)[source]#

SPE limit for one X-block (Nomikos & MacGregor chi-square approximation).

Parameters:
Return type:

float

super_spe_limit(conf_level=0.95)[source]#

SPE limit for the merged super-block.

Parameters:

conf_level (float)

Return type:

float

spe_contributions(X)[source]#

Per-variable squared residuals for each X-block (SPE contributions).

Reconstruction matches the MBPCA deflation step: X_b = T_super @ (P_b * p_super[b] * sqrt(K_b))^T summed over components. Returns squared residuals on the preprocessed scale; sum across columns equals block_spe_[b].iloc[:, -1] ** 2.

Parameters:

X (dict[str, DataFrame])

Return type:

dict[str, DataFrame]

score_contributions(t_super_start, t_super_end=None, components=None, *, weighted=False)[source]#

Per-block per-variable contributions to a super-score movement (MBPCA).

Decomposes a super-score-space delta into preprocessed-scale variable contributions per X-block. The MBPCA back-projection mirrors the deflation step used during fit:

\[\text{contrib}_{b,j} = \sum_a (\Delta t_\mathrm{super}[a]) \cdot P_b[j, a] \cdot p_\mathrm{super}[b, a] \cdot \sqrt{K_b}\]

See MBPLS.score_contributions() for the parameter and return descriptions; the API is identical.

Parameters:
Return type:

dict[str, Series]

super_score_plot(pc_horiz=1, pc_vert=2)[source]#

Scatter plot of MBPCA super-scores for two components.

Parameters:
  • pc_horiz (int)

  • pc_vert (int)

Return type:

Figure

super_loadings_bar_plot(component=1)[source]#

Bar plot of MBPCA super-loadings for a single component.

Parameters:

component (int)

Return type:

Figure

display_results(show_cumulative=True)[source]#

Format a short text summary of per-block R²X, iterations and timing.

Parameters:

show_cumulative (bool)

Return type:

str

Preprocessing#

class process_improve.multivariate.methods.MCUVScaler[source]#

Bases: BaseEstimator, TransformerMixin

Create our own mean centering and scaling to unit variance (MCUV) class The default scaler in sklearn does not handle small datasets accurately, with ddof.

fit(X)[source]#

Get the centering and scaling object constants.

Parameters:

X (ndarray | DataFrame)

Return type:

MCUVScaler

transform(X)[source]#

Do work of the transformation.

Parameters:

X (ndarray | DataFrame)

Return type:

DataFrame

inverse_transform(X)[source]#

Do the inverse transformation.

Parameters:

X (ndarray | DataFrame)

Return type:

DataFrame

process_improve.multivariate.methods.center(X, func=<function mean>, axis=0, extra_output=False)[source]#

Perform centering of data, using a function, func (default: np.mean). The function, if supplied, but return a vector with as many columns as the matrix X.

axis [optional; default=0] {integer or None}

This specifies the axis along which the centering vector will be calculated if not provided. The function is applied along the axis: 0=down the columns; 1 = across the rows.

Missing values: The sample mean is computed by taking the sum along the axis, skipping any missing data, and dividing by N = number of values which are present. Values which were missing before, are left as missing after.

Parameters:
Return type:

ndarray | DataFrame

process_improve.multivariate.methods.scale(X, func=<function std>, axis=0, extra_output=False, **kwargs)[source]#

Scales the data (does NOT do any centering); scales to unit variance by default.

func [optional; default=np.std] {a function}

The default (np.std) will use NumPy to calculate the sample standard deviation of the data, and use that as scale.

TODO: provide a scaling vector. The sample standard deviation is computed along the required axis, skipping over any missing data, and dividing by N-1, where N = number of values which are present, i.e. not counting missing values.

axis [optional; default=0] {integer}

Transformations are applied on slices of data. This specifies the axis along which the transformation will be applied.

#`markers` [optional; default=None] # A vector (or slice) used to store indices (the markers) where the ## variance needs to be replaced with the low_variance_replacement # #`variance_tolerance` [optional; default=1E-7] {floating point} # A slice is considered to have no variance when the actual variance of # that slice is smaller than this value. # #`low_variance_replacement` [optional; default=0.0] {floating point} # Used to replace values in the output where the markers indicates no # or low variance.

Usage#

X = … # data matrix X = scale(center(X)) my_scale = np.mad X = scale(center(X), func=my_scale)

Parameters:
Return type:

ndarray | DataFrame

Plots#

process_improve.multivariate.plots.score_plot(model, pc_horiz=1, pc_vert=2, pc_depth=-1, items_to_highlight=None, settings=None, fig=None)[source]#

Generate a 2-dimensional score plot for the given latent variable model.

Parameters:
  • model (MVmodel object (PCA, or PLS)) – A latent variable model generated by this library.

  • pc_horiz (int, optional) – Which component to plot on the horizontal axis, by default 1 (the first component)

  • pc_vert (int, optional) – Which component to plot on the vertical axis, by default 2 (the second component)

  • pc_depth (int, optional) – If pc_depth >= 1, then a 3D score plot is generated, with this component on the 3rd axis

  • items_to_highlight (dict, optional) –

    Keys are JSON strings parseable by json.loads into a Plotly line specifier; values are lists of index names to highlight. For example:

    items_to_highlight = {'{"color": "red", "symbol": "cross"}': items_in_red}
    

    will highlight the items in items_in_red with the given colour and shape.

  • settings (dict) –

    Default settings:

    {
        "show_ellipse": True,          # bool: show the Hotelling's T2 ellipse
        "ellipse_conf_level": 0.95,    # float: ellipse confidence level (< 1.00)
        "title": "Score plot of ...",  # str: overall plot title
        "show_labels": False,          # bool: add a label for each observation
        "show_legend": True,           # bool: show clickable legend
        "html_image_height": 500,      # int: image height in pixels
        "html_aspect_ratio_w_over_h": 16/9,  # float: width as ratio of height
    }
    

  • fig (Figure | None)

Return type:

Figure

Examples

>>> pca = PCA(n_components=3).fit(X_scaled)
>>> pca.score_plot()                          # PC1 vs PC2
>>> pca.score_plot(pc_horiz=1, pc_vert=3)     # PC1 vs PC3
>>> pca.score_plot(pc_horiz=1, pc_vert=2, pc_depth=3)  # 3D
process_improve.multivariate.plots.loading_plot(model, loadings_type='p', pc_horiz=1, pc_vert=2, settings=None, fig=None)[source]#

Generate a 2-dimensional loadings for the given latent variable model.

Parameters:
  • model (MVmodel object (PCA, or PLS)) – A latent variable model generated by this library.

  • loadings_type (str, optional) –

    A choice of the following:

    ’p’ : (default for PCA) : the P (projection) loadings: only option possible for PCA ‘w’ : the W loadings: Suitable for PLS ‘w*’ : (default for PLS) the W* (or R) loadings: Suitable for PLS ‘w*c’ : the W* (from X-space) with C loadings from the Y-space: Suitable for PLS ‘c’ : the C loadings from the Y-space: Suitable for PLS

    For PCA model any other choice besides ‘p’ will be ignored.

  • pc_horiz (int, optional) – Which component to plot on the horizontal axis, by default 1 (the first component)

  • pc_vert (int, optional) – Which component to plot on the vertical axis, by default 2 (the second component)

  • settings (dict) –

    Default settings:

    {
        "title": "Loadings plot ...",  # str: overall plot title
        "show_labels": True,           # bool: add a label for each variable
        "html_image_height": 500,      # int: image height in pixels
        "html_aspect_ratio_w_over_h": 16/9,  # float: width as ratio of height
    }
    

  • fig (Figure | None)

Return type:

Figure

Examples

>>> pca.loading_plot()                                 # P loadings, PC1 vs PC2
>>> pls.loading_plot(loadings_type="w*c")              # W* and C loadings
>>> pls.loading_plot(loadings_type="w", pc_vert=3)     # W loadings, PC1 vs PC3
process_improve.multivariate.plots.spe_plot(model, with_a=-1, items_to_highlight=None, settings=None, fig=None)[source]#

Generate a squared-prediction error (SPE) plot for the given latent variable model using with_a number of latent variables. The default will use the total number of latent variables which have already been fitted.

Parameters:
  • model (MVmodel object (PCA, or PLS)) – A latent variable model generated by this library.

  • with_a (int, optional) – Uses this many number of latent variables, and therefore shows the SPE after this number of model components. By default the total number of components fitted will be used.

  • items_to_highlight (dict, optional) –

    Keys are JSON strings parseable by json.loads into a Plotly line specifier; values are lists of index names to highlight. For example:

    items_to_highlight = {'{"color": "red", "symbol": "cross"}': items_in_red}
    

    will highlight the items in items_in_red with the given colour and shape.

  • settings (dict) –

    Default settings:

    {
        "show_limit": True,            # bool: show the SPE confidence limit line
        "conf_level": 0.95,            # float: confidence level for limit (< 1.00)
        "title": "SPE plot ...",        # str: overall plot title
        "default_marker": {...},        # dict: e.g. dict(color="darkblue", size=7)
        "show_labels": False,           # bool: add a label for each observation
        "show_legend": False,           # bool: show clickable legend
        "html_image_height": 500,       # int: image height in pixels
        "html_aspect_ratio_w_over_h": 16/9,  # float: width as ratio of height
    }
    

  • fig (Figure | None)

Return type:

Figure

Examples

>>> pca.spe_plot()
>>> pca.spe_plot(settings={"conf_level": 0.99, "show_labels": True})
process_improve.multivariate.plots.t2_plot(model, with_a=-1, items_to_highlight=None, settings=None, fig=None)[source]#

Generate a Hotelling’s T2 (T^2) plot for the given latent variable model using with_a number of latent variables. The default will use the total number of latent variables which have already been fitted.

Parameters:
  • model (MVmodel object (PCA, or PLS)) – A latent variable model generated by this library.

  • with_a (int, optional) – Uses this many number of latent variables, and therefore shows the SPE after this number of model components. By default the total number of components fitted will be used.

  • items_to_highlight (dict, optional) –

    Keys are JSON strings parseable by json.loads into a Plotly line specifier; values are lists of index names to highlight. For example:

    items_to_highlight = {'{"color": "red", "symbol": "cross"}': items_in_red}
    

    will highlight the items in items_in_red with the given colour and shape.

  • settings (dict) –

    Default settings:

    {
        "show_limit": True,            # bool: show the T2 confidence limit line
        "conf_level": 0.95,            # float: confidence level for limit (< 1.00)
        "title": "T2 plot ...",         # str: overall plot title
        "default_marker": {...},        # dict: e.g. dict(color="darkblue", size=7)
        "show_labels": False,           # bool: add a label for each observation
        "show_legend": False,           # bool: show clickable legend
        "html_image_height": 500,       # int: image height in pixels
        "html_aspect_ratio_w_over_h": 16/9,  # float: width as ratio of height
    }
    

  • fig (Figure | None)

Return type:

Figure

Examples

>>> pca.t2_plot()
>>> pca.t2_plot(settings={"conf_level": 0.99, "show_labels": True})