Multivariate Analysis#
Models#
PCA#
- class process_improve.multivariate.methods.PCA(n_components, *, algorithm='auto', missing_data_settings=None)[source]#
Bases:
TransformerMixin,BaseEstimatorPrincipal Component Analysis with support for missing data.
- Parameters:
n_components (int) – Number of principal components to extract.
algorithm (str, default="auto") – Algorithm to use for fitting the model. -
"auto": Uses SVD when data is complete, NIPALS when data has missing values. -"svd": Singular Value Decomposition. Requires complete data. -"nipals": Non-linear Iterative Partial Least Squares. Handles missing data. -"tsr": Trimmed Score Regression. Handles missing data.missing_data_settings (dict or None, default=None) – Settings for iterative missing data algorithms (NIPALS, TSR). Keys:
md_tol(convergence tolerance),md_max_iter(max iterations).fitting) (Attributes (after)
--------------------------
scores (pd.DataFrame of shape (n_samples, n_components)) – The score matrix (T).
loadings (pd.DataFrame of shape (n_features, n_components)) – The loading matrix (P).
r2_per_component (pd.Series of length n_components) – Fractional R² explained by each component.
r2_cumulative (pd.Series of length n_components) – Cumulative R² after each component.
r2_per_variable (pd.DataFrame of shape (n_features, n_components)) – Per-variable cumulative R² after each component.
spe (pd.DataFrame of shape (n_samples, n_components)) – Squared Prediction Error (stored as sqrt of row sum-of-squares).
hotellings_t2 (pd.DataFrame of shape (n_samples, n_components)) – Cumulative Hotelling’s T² statistic.
explained_variance (np.ndarray of shape (n_components,)) – Variance explained by each component.
scaling_factor_for_scores (pd.Series of length n_components) – Standard deviation per score (sqrt of explained variance).
has_missing_data (bool) – Whether the training data contained missing values.
fitting_info (dict) – Timing and iteration info from the fitting algorithm.
- predict(X)[source]#
Project new data and compute diagnostics (scores, Hotelling’s T², SPE).
- Parameters:
X (array-like of shape (n_samples, n_features))
- Returns:
result – With keys
scores,hotellings_t2,spe.- Return type:
- score(X, y=None)[source]#
Negative mean squared reconstruction error (higher is better).
Follows the sklearn convention where higher scores indicate better model fit. This makes PCA compatible with
cross_val_score,GridSearchCV, and other sklearn model-selection utilities.- Parameters:
X (array-like of shape (n_samples, n_features)) – Test data to score.
y (ignored)
- Returns:
score – Negative mean squared reconstruction error.
- Return type:
Examples
>>> from sklearn.model_selection import cross_val_score >>> scores = cross_val_score(PCA(n_components=2), X_scaled, cv=5) >>> print(f"Mean CV score: {scores.mean():.4f}")
- classmethod select_n_components(X, *, max_components=None, cv=5, threshold=0.95, **pca_kwargs)[source]#
Select the number of components via PRESS cross-validation.
Fits PCA models with 1, 2, …,
max_componentscomponents, evaluates each with K-fold cross-validation, and recommends the optimal number using Wold’s criterion: stop adding components when PRESS_a / PRESS_{a-1} >threshold.- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
max_components (int, optional) – Maximum number of components to evaluate. Default is
min(n_samples - 1, n_features).cv (int or sklearn CV splitter, default 5) – Number of cross-validation folds, or an sklearn splitter object (e.g.,
KFold(n_splits=10, shuffle=True)).threshold (float, default 0.95) – Wold’s criterion threshold. If PRESS_a / PRESS_{a-1} exceeds this value, component
ais deemed not significant. Lower values are more aggressive (fewer components).**pca_kwargs – Additional keyword arguments passed to the
PCA()constructor (e.g.,algorithm="nipals"for data with missing values).
- Returns:
result – With keys:
n_components- recommended number of components (int)press- PRESS per component count (pd.Series, indexed 1..A_max)press_ratio- PRESS_a / PRESS_{a-1} (pd.Series, indexed 2..A_max)cv_scores- per-fold scores (pd.DataFrame, A_max rows x cv cols)
- Return type:
- score_contributions(t_start, t_end=None, components=None, *, weighted=False)[source]#
Contribution of each variable to a score-space movement.
Decomposes the difference (t_end - t_start) back into variable space via the loadings for the selected components.
- Parameters:
t_start (array-like of shape (n_components,)) – Score vector of the observation of interest. Can be a row from
self.scores_or frompredict(X_new).scores.t_end (array-like of shape (n_components,), optional) – Reference point in score space. Default is the model center (a vector of zeros), which is the most common choice.
components (list of int, optional) – 1-based component indices to decompose over, matching the model’s column convention. Examples:
[2, 3]for a PC2-vs-PC3 score plot, orNone(default) for all components - appropriate for Hotelling’s T² contributions.weighted (bool, default False) – If True, scale the score difference by 1/sqrt(explained_variance) per component before back-projecting. This gives contributions to the T² statistic rather than to the Euclidean score distance.
- Returns:
contributions – One value per variable; sign indicates direction. Index contains the variable (feature) names.
- Return type:
pd.Series of shape (n_features,)
Examples
>>> pca = PCA(n_components=3).fit(X_scaled) >>> # Why does observation 0 differ from the model center? >>> contrib = pca.score_contributions(pca.scores_.iloc[0].values) >>> # Which variables drive the difference between obs 0 and obs 5? >>> contrib = pca.score_contributions( ... pca.scores_.iloc[0].values, pca.scores_.iloc[5].values ... )
- detect_outliers(conf_level=0.95)[source]#
Detect outlier observations using SPE and Hotelling’s T² diagnostics.
Combines two approaches:
Statistical limits - observations exceeding the SPE or T² limit at
conf_levelare flagged.Robust ESD test - the generalized ESD test (with robust median/MAD variant) identifies observations that are unusual relative to the rest of the data, even if they fall below the statistical limit.
An observation can be flagged for one or both reasons.
- Parameters:
conf_level (float, default 0.95) – Confidence level in [0.8, 0.999]. Controls both the statistical limits and the ESD test’s significance level (alpha = 1 - conf_level).
- Returns:
outliers – Sorted from most severe to least. Each dict contains:
observation- index label of the observationoutlier_types- list of"spe"and/or"hotellings_t2"spe- SPE value for this observationhotellings_t2- T² value for this observationspe_limit- SPE limit at the given confidence levelhotellings_t2_limit- T² limit at the given confidence levelseverity- max(spe/spe_limit, t2/t2_limit)
- Return type:
Examples
>>> pca = PCA(n_components=3).fit(X_scaled) >>> outliers = pca.detect_outliers(conf_level=0.95) >>> for o in outliers: ... print(f"{o['observation']}: {o['outlier_types']} (severity={o['severity']})")
PLS#
- class process_improve.multivariate.methods.PLS(n_components, *, scale=True, max_iter=1000, tol=np.float64(1.4901161193847656e-08), copy=True, missing_data_settings=None)[source]#
Bases:
RegressorMixin,TransformerMixin,BaseEstimatorProjection to Latent Structures (PLS) regression with diagnostics.
Implements PLS via the NIPALS algorithm with production diagnostics: SPE, Hotelling’s T², score contributions, and outlier detection. The API mirrors
PCAso thatmodel.scores_,model.spe_, andmodel.detect_outliers()work identically for both model types.- Parameters:
n_components (int) – Number of latent components to extract.
scale (bool, default=True) – Whether to scale X and Y to unit variance (sklearn internal scaling). When using
MCUVScalerexternally, setscale=Falseto avoid double scaling.max_iter (int, default=1000) – Maximum number of iterations for the NIPALS algorithm.
tol (float, default=sqrt(machine epsilon)) – Convergence tolerance for the NIPALS algorithm.
copy (bool, default=True) – Whether to copy X and Y before fitting.
missing_data_settings (dict or None, default=None) – Settings for missing data algorithms (NIPALS/TSR for PLS). Keys:
md_method("tsr","scp","nipals"),md_tol,md_max_iter.fitting) (Attributes (after)
--------------------------
scores (pd.DataFrame of shape (n_samples, n_components)) – X-block score matrix (T). This is the primary score matrix; equivalent to
x_scoresin older versions.y_scores (pd.DataFrame of shape (n_samples, n_components)) – Y-block score matrix (U).
x_loadings (pd.DataFrame of shape (n_features, n_components)) – X-block loading matrix (P).
y_loadings (pd.DataFrame of shape (n_targets, n_components)) – Y-block loading matrix (C).
x_weights (pd.DataFrame of shape (n_features, n_components)) – X-block weight matrix (W).
y_weights (pd.DataFrame of shape (n_targets, n_components)) – Y-block weight matrix.
direct_weights (pd.DataFrame of shape (n_features, n_components)) – Direct (W*) weights:
W (P'W)^{-1}. Used for direct projectionT = X @ W*.beta_coefficients (pd.DataFrame of shape (n_features, n_targets)) – Regression coefficients linking X directly to Y.
predictions (pd.DataFrame of shape (n_samples, n_targets)) – Y predictions from the training data.
spe (pd.DataFrame of shape (n_samples, n_components)) – Squared Prediction Error (stored as sqrt of row sum-of-squares).
hotellings_t2 (pd.DataFrame of shape (n_samples, n_components)) – Cumulative Hotelling’s T² statistic.
r2_per_component (pd.Series of length n_components) – Fractional R² (on Y) explained by each component.
r2_cumulative (pd.Series of length n_components) – Cumulative R² (on Y) after each component.
r2_per_variable (pd.DataFrame of shape (n_features, n_components)) – Per-variable cumulative R² for X after each component.
r2y_per_variable (pd.DataFrame of shape (n_targets, n_components)) – Per-variable R² for Y after each component.
rmse (pd.DataFrame of shape (n_targets, n_components)) – Root mean squared error of Y predictions per component.
explained_variance (np.ndarray of shape (n_components,)) – Variance explained by each component in X.
scaling_factor_for_scores (pd.Series of length n_components) – Standard deviation per score (sqrt of explained variance).
has_missing_data (bool) – Whether the training data contained missing values.
fitting_info (dict) – Timing and iteration info from the fitting algorithm.
See also
PCAPrincipal Component Analysis.
MCUVScalerMean-center unit-variance scaler.
References
Abdi, “Partial least squares regression and projection on latent structure regression (PLS Regression)”, 2010, DOI: 10.1002/wics.51
Examples
>>> import pandas as pd >>> from process_improve.multivariate.methods import PLS, MCUVScaler >>> X = pd.DataFrame({"A": [1, 2, 3, 4], "B": [4, 3, 2, 1]}) >>> Y = pd.DataFrame({"y": [2.1, 3.9, 6.2, 7.8]}) >>> pls = PLS(n_components=1) >>> pls = pls.fit(MCUVScaler().fit_transform(X), MCUVScaler().fit_transform(Y)) >>> pls.scores_.shape (4, 1)
- fit(X, Y)[source]#
Fit a projection to latent structures (PLS) model to the data.
- Parameters:
X (array-like, shape (n_samples, n_features)) – Training data, where n_samples is the number of samples (rows) and n_features is the number of features (columns).
Y (array-like, shape (n_samples, n_targets)) – Training data, where n_samples is the number of samples (rows) and n_targets is the number of target outputs (columns).
- Returns:
Model object.
- Return type:
References
Abdi, “Partial least squares regression and projection on latent structure regression (PLS Regression)”, 2010, DOI: 10.1002/wics.51
- transform(X, Y=None)[source]#
Project X (and optionally Y) into the latent space.
- Parameters:
- Returns:
X_scores – Projected X data (scores).
- Return type:
pd.DataFrame of shape (n_samples, n_components)
- predict(X)[source]#
Project new data and compute diagnostics.
- Parameters:
X (array-like of shape (n_samples, n_features))
- Returns:
result – With keys
scores,hotellings_t2,spe,y_hat.- Return type:
Examples
>>> result = pls.predict(scaler_x.transform(X_new)) >>> result.y_hat # Predicted Y values >>> result.spe # SPE for each new observation >>> result.hotellings_t2 # T² for each new observation
- score_contributions(t_start, t_end=None, components=None, *, weighted=False)[source]#
Contribution of each X variable to a score-space movement.
Identical to
PCA.score_contributionsbut uses the PLS X-loadings (P matrix) for back-projection.- Parameters:
t_start (array-like of shape (n_components,)) – Score vector of the observation of interest.
t_end (array-like of shape (n_components,), optional) – Reference point in score space. Default is the model center (zeros).
components (list of int, optional) – 1-based component indices. Default: all components.
weighted (bool, default False) – If True, scale by 1/sqrt(explained_variance) for T² contributions.
- Returns:
contributions
- Return type:
pd.Series of shape (n_features,)
Examples
>>> pls = PLS(n_components=3).fit(X_scaled, Y_scaled) >>> contrib = pls.score_contributions(pls.scores_.iloc[0].values) >>> contrib.abs().sort_values(ascending=False).head() # top contributors
- detect_outliers(conf_level=0.95)[source]#
Detect outlier observations using SPE and Hotelling’s T² diagnostics.
Same approach as
PCA.detect_outliers: combines statistical limits with the robust generalized ESD test.- Parameters:
conf_level (float, default 0.95) – Confidence level in [0.8, 0.999].
- Returns:
outliers – Sorted from most severe to least. Each dict contains
observation,outlier_types,spe,hotellings_t2,spe_limit,hotellings_t2_limit,severity.- Return type:
Examples
>>> pls = PLS(n_components=3).fit(X_scaled, Y_scaled) >>> outliers = pls.detect_outliers(conf_level=0.95) >>> for o in outliers: ... print(f"{o['observation']}: {o['outlier_types']}")
- set_score_request(*, sample_weight='$UNCHANGED$')#
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
TPLS#
- class process_improve.multivariate.methods.TPLS(n_components, d_matrix, max_iter=500, skip_f_matrix_preprocessing=False)[source]#
Bases:
RegressorMixin,BaseEstimatorTPLS algorithm for T-shaped data structures (we also include standard pre-processing of the data inside this class).
Source: Garcia-Munoz, https://doi.org/10.1016/j.chemolab.2014.02.006, Chem.Intell.Lab.Sys. v133, p 49 to 62, 2014.
We change the notation from the original paper to avoid confusion with a generic “X” matrix, and match symbols that are more natural for our use.
Notation mapping (paper → this code):
X^T → D:
d_matrix(external),d_mats(internal) - Database of propertiesX → D^T: transposed D (not used directly)
R → F:
f_mats- Formula matricesZ → Z:
z_mats- Process conditionsY → Y:
y_mats- Quality indicators
Notes 1. Matrices in F, Z and Y must all have the same number of rows. 2. Columns in F must be the same as the rows in D. 3. Conditions in Z may be missing (turning it into an L-shaped data structure).
- Parameters:
n_components (int) – A parameter used to specify the number of components.
d_matrix (dict[str, dict[str, pd.DataFrame]]) – A dictionary containing the properties of each group of materials. Keys are group names; values are DataFrames with properties as columns and materials as rows. This “D” matrix is provided once at construction and reused for fitting, prediction and cross-validation.
max_iter (int, optional) – The maximum number of iterations for the TPLS algorithm. Default is 500.
skip_f_matrix_preprocessing (bool)
Notes
The input
Xis a dictionary with 4 keys:D: Database of DataFrames (properties in columns, materials in rows).D = {"Group A": df_props_a, "Group B": df_props_b, ...}F: Formula matrices (rows = blends, columns = materials).F = {"Group A": df_formulas_a, "Group B": df_formulas_b, ...}Z: Process conditions - one row per blend, one column per condition.Y: Product quality indicators - one row per blend, one column per indicator.
- n_substances#
The number of substances (columns) in the training data, i.e. the number of materials in the F matrix.
- Type:
Example
>>> import numpy as np >>> import pandas as pd >>> rng = np.random.default_rng() >>> >>> n_props_a, n_props_b = 6, 4 # Two groups of properties: A and B. >>> n_materials_a, n_materials_b = 12, 8 # Number of materials in each group. >>> n_formulas = 40 # Number of formulas in matrix F. >>> n_outputs = 3 >>> n_conditions = 2 >>> >>> properties = { >>> "Group A": pd.DataFrame(rng.standard_normal((n_materials_a, n_props_a))), >>> "Group B": pd.DataFrame(rng.standard_normal((n_materials_b, n_props_b))), >>> } >>> formulas = { >>> "Group A": pd.DataFrame(rng.standard_normal((n_formulas, n_materials_a))), >>> "Group B": pd.DataFrame(rng.standard_normal((n_formulas, n_materials_b))), >>> } >>> process_conditions = {"Conditions": pd.DataFrame(rng.standard_normal((n_formulas, n_conditions)))} >>> quality_indicators = {"Quality": pd.DataFrame(rng.standard_normal((n_formulas, n_outputs)))} >>> all_data = {"Z": process_conditions, "D": properties, "F": formulas, "Y": quality_indicators} >>> estimator = TPLS(n_components=4) >>> estimator.fit(all_data)
- fit(X, y=None)[source]#
Fit the preprocessing parameters and also the latent variable model from the training data.
- Parameters:
X ({dictionary of dataframes}, keys that must be present: "D", "F", "Z", and "Y") – The training input samples. See documentation in the class definition for more information on each matrix.
y (None)
- Returns:
self – Returns self.
- Return type:
- predict(X)[source]#
Model inference on new data.
This will pre-process the new data and apply those subsequently to the latent variable model.
Example
# Training phase: estimator = TPLS(n_components=2).fit(training_data)
# Testing/inference phase: new_data = {“Z”: …, “F”: …} # you need at least the F block for a new prediction. “Z” is optional. predictions = estimator.predict(new_data_pp)
- Parameters:
X (DataFrameDict) – The input samples.
- Returns:
y – Returns an array of prediction objects. More details to come here later. Please ask.
- Return type:
- score(X, y=None, sample_weight=None)[source]#
Return r2_score` on test data.
See RegressorMixin.score for more details.
- Parameters:
X (DataFrameDict) – Test samples.
y (Not used. In the X input, there is a already a “Y” block. This will be the Y-data.)
sample_weight (Not used.)
- Returns:
score – \(R^2\) of
self.predict(X).- Return type:
- help()[source]#
Help for the TPLS Estimator.
Data organization#
Quick tips#
Build model: tpls = TPLS(n_components=2, d_matrix=d_matrix).fit(X) Get model’s predictions: tpls.hat <– the hat-matrix, i.e., the predictions Predict on new data: tpls.predict(X_new) See model summary: tpls.display_results() This help page: tpls.help()
Statistical values#
.t_scores_super Super scores for the entire model [pd.DataFrame] .hotellings_t2 Hotelling’s T2 values for each observation, per component [pd.DataFrame] .spe Squared prediction error for each block [dict of pd.DataFrames]
.hotellings_t2_limit() Returns the Hotelling’s T2 limit for the model [float] .spe_limit[block]() Return the SPE limit for the block; e.g. .spe_limit[“Y”]() [float]
TODO: self.hotellings_t2: pd.DataFrame = pd.DataFrame() self.hotellings_t2_limit: Callable = hotellings_t2_limit self.scaling_factor_for_scores = pd.Series() self.ellipse_coordinates: Callable = ellipse_coordinates
- Return type:
- vip(block=None)[source]#
Return Variable Importance in Projection (VIP) scores for TPLS blocks.
VIP scores are computed during fitting for the D-block (material properties) and F-block (formulation variables) and stored in
feature_importance.- Parameters:
block (str or None, default=None) – Which block to return. Must be
"D"or"F", orNoneto return all blocks.- Returns:
If block is
None:{"D": {group: pd.Series, ...}, "F": {group: pd.Series, ...}}. If block is"D"or"F": the inner dict{group: pd.Series, ...}for that block, where eachpd.Seriesis indexed by feature names.- Return type:
- Raises:
ValueError – If the model is not fitted or block is not
"D","F", orNone.
Examples
>>> tpls = TPLS(...).fit(data) >>> tpls.vip() # all blocks >>> tpls.vip("D") # D-block only → {group_name: pd.Series, ...}
- set_score_request(*, sample_weight='$UNCHANGED$')#
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
MBPLS#
Multi-block PLS in the hierarchical / superblock formulation of
Westerhuis, Kourti & MacGregor (1998). Each X-block is preprocessed
independently and weighted by 1/sqrt(K_b) before the inner NIPALS
loop, so blocks of unequal width contribute fairly to the consensus
super-score.
- class process_improve.multivariate.methods.MBPLS(n_components, *, max_iter=500, tol=None)[source]#
Bases:
RegressorMixin,BaseEstimatorMulti-block PLS (hierarchical / superblock formulation).
Generic multi-block PLS as described by Westerhuis, Kourti & MacGregor (1998) and Westerhuis & Smilde (2001). Each X-block is preprocessed independently (mean-centred and unit-variance scaled), then divided by
sqrt(K_b)so that blocks of unequal width contribute fairly to the super-score.- Parameters:
n_components (int) – Number of latent variables to extract.
max_iter (int, default=500) – Maximum NIPALS iterations per latent variable.
tol (float or None, default=None) – Convergence tolerance on the change in the Y-block score. If
None,np.finfo(float).eps ** (6/7)is used (matching the legacy multi-block reference implementation).fitting) (Attributes (after)
--------------------------
block_names (list[str]) – Ordered list of X-block names (the keys of the input dict).
block_widths (dict[str, int]) – Number of variables in each X-block.
super_scores (pd.DataFrame, shape (n_samples, n_components)) – Super-block (consensus) X-scores
T.super_y_scores (pd.DataFrame, shape (n_samples, n_components)) – Super-block Y-scores
U.super_weights (pd.DataFrame, shape (n_blocks, n_components)) – Super-block weights
w_super; rows indexed by block name.super_y_loadings (pd.DataFrame, shape (n_targets, n_components)) – Y-block loadings
c.block_scores (dict[str, pd.DataFrame]) – Per-block X-scores
t_b, each shape(n_samples, n_components).block_weights (dict[str, pd.DataFrame]) – Per-block X-weights
w_b, each shape(K_b, n_components). Each column has unit norm.block_loadings (dict[str, pd.DataFrame]) – Per-block X-loadings
p_b(used for deflation), each shape(K_b, n_components).predictions (pd.DataFrame, shape (n_samples, n_targets)) – In-sample Y predictions on the original scale.
explained_variance (np.ndarray, shape (n_components,)) – Variance of the super-score per component (ddof=1).
scaling_factor_for_super_scores (pd.Series) –
sqrt(explained_variance_)per component.fitting_info (dict) – Per-component iteration count and timing.
has_missing_data (bool) – Whether any X-block or Y had NaN values.
Notes
Block weighting uses the convention \(X_b / \sqrt{K_b}\) so that every block contributes the same total sum of squares to the super-score, regardless of how many variables it has.
References
Westerhuis, J. A., Kourti, T. & MacGregor, J. F. Analysis of multiblock and hierarchical PCA and PLS models. Journal of Chemometrics, 12 (1998), 301-321.
Westerhuis, J. A. & Smilde, A. K. Deflation in multiblock PLS. Journal of Chemometrics, 15 (2001), 485-493.
- block_spe_limit(block, conf_level=0.95)[source]#
SPE limit for one X-block using the Nomikos & MacGregor chi-square approximation.
Operates on the same scale as
block_spe_[block](sqrt of row sum of squares), so the value can be drawn directly on a SPE plot.
- super_spe_limit(conf_level=0.95)[source]#
SPE limit for the merged super-block (sum of per-block SPE squared).
- spe_contributions(X)[source]#
Per-variable squared residuals for each X-block (SPE contributions).
For each new observation and each X-block, reconstruct the block as
T_super @ P_b^T(matching the deflation step used during fit) and return the squared per-variable residuals. Useful for fault diagnosis: the variable with the largest contribution is the most likely culprit for a high SPE.
- score_contributions(t_super_start, t_super_end=None, components=None, *, weighted=False)[source]#
Per-block per-variable contributions to a super-score movement.
The multi-block analogue of
PCA.score_contributions(). Decomposes a super-score-space delta back into preprocessed-scale variable-space deltas, one per X-block.For MBPLS, the super-score at component a is built from the per-block scores (which themselves are weighted regressions of the block on
w_b), so the back-projection throughw_super[b,a] * w_b[:,a] / sqrt(K_b)gives the variable contribution.- Parameters:
t_super_start (array-like, shape (n_components,)) – Super-score row of the observation of interest. Typically a row from
self.super_scores_or frompredict(X_new).super_scores.t_super_end (array-like, optional) – Reference point in super-score space. Defaults to the model centre (zeros).
components (list of int, optional) – 1-based component indices to decompose over.
None(default) uses all components - appropriate for Hotelling’s T² contributions.weighted (bool, default=False) – If
True, divide the super-score delta bysqrt(explained_variance_)per component before back-projecting, giving contributions to the T² statistic instead of the Euclidean super-score distance.
- Returns:
One Series per X-block (length
K_b), indexed by variable (column) name.- Return type:
- super_weights_bar_plot(component=1)[source]#
Bar plot of super-weights
w_superfor a single component.- Parameters:
component (int)
- Return type:
Figure
- predictions_vs_observed_plot(y_observed, variable=None)[source]#
Scatter plot of predicted vs observed Y, with y=x reference and RMSEE annotation.
- Parameters:
y_observed (pd.DataFrame) – The observed Y on the original scale, same columns as the training Y.
variable (str or None, default=None) – If given, plot only that Y-variable. If
None, plot the first one.
- Return type:
Figure
- display_results(show_cumulative=True)[source]#
Format a short text summary of per-block R²X, overall R²Y, iterations and timing.
- predict(X)[source]#
Project new data and predict Y on the original scale.
Returns a
sklearn.utils.Bunchwith fieldssuper_scores,block_scores(dict[str, DataFrame]) andpredictions(DataFrame on original Y scale).
- set_score_request(*, sample_weight='$UNCHANGED$')#
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- process_improve.multivariate.methods.randomization_test_mbpls(model, X, y, n_permutations=200, *, seed=None)[source]#
Randomization (permutation) test for component significance in MBPLS.
For each component
a, the null hypothesis is “there is no real relationship between X and Y at this component”; the test permutes the rows ofy, refits a fresh MBPLS with the same number of components, and recomputes the test statistic. The risk is the fraction of permutations whose statistic equals or exceeds the original model’s.Statistic: per-component absolute correlation between the super X-score and the super Y-score,
|t_super(:,a)' u_super(:,a)| / (||t|| * ||u||), matching the legacy ConnectMV randomization-objective for PLS.- Parameters:
model (MBPLS) – A fitted MBPLS model.
X (dict[str, DataFrame], DataFrame) – The same training data used to fit
model.y (dict[str, DataFrame], DataFrame) – The same training data used to fit
model.n_permutations (int, default=200) – Number of Y-row permutations to evaluate.
seed (int or None, default=None) – Seed for the permutation RNG (
Noneuses non-reproducible randomness).
- Returns:
Indexed by component
1..Awith columns:observed: the actual model’s per-component statistic.risk_pct: fraction (in %) of permutations with statistic>= observed. Low values (e.g. < 5%) suggest the component is significant; values near 50% suggest the component is no better than chance.
- Return type:
pd.DataFrame
References
Wiklund, S., Nilsson, D., Eriksson, L., Sjöström, M., Wold, S. & Faber, K. A randomization test for PLS component selection. J. Chemometrics, 21 (2007), 427-439.
MBPCA#
Multi-block PCA / consensus-PCA. Same dict-of-DataFrames API as
MBPLS; no Y-block.
- class process_improve.multivariate.methods.MBPCA(n_components, *, max_iter=500, tol=None)[source]#
Bases:
TransformerMixin,BaseEstimatorMulti-block PCA (hierarchical / consensus PCA).
Generic multi-block PCA following the consensus-PCA / hierarchical PCA formulation of Westerhuis, Kourti & MacGregor (1998). Each X-block is preprocessed independently (mean-centred and unit-variance scaled), then divided by
sqrt(K_b)so blocks of unequal width contribute fairly to the consensus super-score.The hierarchical NIPALS loop alternates: (i) regress each block on the super-score to get block loadings and block scores, (ii) collect block scores into a super-block, (iii) regress the super-block to get a new super-score / super-loading, repeat to convergence. After convergence, deflate every block by the super-score and the corresponding block loading scaled by the super-loading element.
- Parameters:
n_components (int)
max_iter (int, default=500)
tol (float or None, default=None) – Convergence tolerance on the super-score change.
Noneusesnp.finfo(float).eps ** (9/10)(matches the legacy reference).fitting) (Attributes (after)
--------------------------
block_names
MBPLS) (block_widths (as)
A) (super_loadings DataFrame (B x)
A)
block_scores
dict[str (r2_x_per_variable)
DataFrame]
r2_x_per_block_cumulative
r2_x_per_block_per_component
dict[str
DataFrame]
block_vip
super_vip
block_spe
block_hotellings_t2
super_hotellings_t2
explained_variance
scaling_factor_for_super_scores
fitting_info
has_missing_data
Notes
The deflation step is \(X_b \leftarrow X_b - t_{\rm super}\, (p_b\,p_s[b]\,\sqrt{K_b})^\top\), derived in Westerhuis et al. 1998. The legacy MATLAB
mbpca.mhad this step marked as broken by the original author; this implementation re-derives it directly from the paper and is independently validated against the pure-numpy reference oracles in the test suite.References
Westerhuis, J. A., Kourti, T. & MacGregor, J. F. Analysis of multiblock and hierarchical PCA and PLS models. J. Chemometrics, 12 (1998), 301-321.
- block_spe_limit(block, conf_level=0.95)[source]#
SPE limit for one X-block (Nomikos & MacGregor chi-square approximation).
- spe_contributions(X)[source]#
Per-variable squared residuals for each X-block (SPE contributions).
Reconstruction matches the MBPCA deflation step:
X_b = T_super @ (P_b * p_super[b] * sqrt(K_b))^Tsummed over components. Returns squared residuals on the preprocessed scale; sum across columns equalsblock_spe_[b].iloc[:, -1] ** 2.
- score_contributions(t_super_start, t_super_end=None, components=None, *, weighted=False)[source]#
Per-block per-variable contributions to a super-score movement (MBPCA).
Decomposes a super-score-space delta into preprocessed-scale variable contributions per X-block. The MBPCA back-projection mirrors the deflation step used during fit:
\[\text{contrib}_{b,j} = \sum_a (\Delta t_\mathrm{super}[a]) \cdot P_b[j, a] \cdot p_\mathrm{super}[b, a] \cdot \sqrt{K_b}\]See
MBPLS.score_contributions()for the parameter and return descriptions; the API is identical.
- super_score_plot(pc_horiz=1, pc_vert=2)[source]#
Scatter plot of MBPCA super-scores for two components.
Preprocessing#
- class process_improve.multivariate.methods.MCUVScaler[source]#
Bases:
BaseEstimator,TransformerMixinCreate our own mean centering and scaling to unit variance (MCUV) class The default scaler in sklearn does not handle small datasets accurately, with ddof.
- process_improve.multivariate.methods.center(X, func=<function mean>, axis=0, extra_output=False)[source]#
Perform centering of data, using a function, func (default: np.mean). The function, if supplied, but return a vector with as many columns as the matrix X.
axis [optional; default=0] {integer or None}
This specifies the axis along which the centering vector will be calculated if not provided. The function is applied along the axis: 0=down the columns; 1 = across the rows.
Missing values: The sample mean is computed by taking the sum along the axis, skipping any missing data, and dividing by N = number of values which are present. Values which were missing before, are left as missing after.
- process_improve.multivariate.methods.scale(X, func=<function std>, axis=0, extra_output=False, **kwargs)[source]#
Scales the data (does NOT do any centering); scales to unit variance by default.
- func [optional; default=np.std] {a function}
The default (np.std) will use NumPy to calculate the sample standard deviation of the data, and use that as scale.
TODO: provide a scaling vector. The sample standard deviation is computed along the required axis, skipping over any missing data, and dividing by N-1, where N = number of values which are present, i.e. not counting missing values.
- axis [optional; default=0] {integer}
Transformations are applied on slices of data. This specifies the axis along which the transformation will be applied.
#`markers` [optional; default=None] # A vector (or slice) used to store indices (the markers) where the ## variance needs to be replaced with the low_variance_replacement # #`variance_tolerance` [optional; default=1E-7] {floating point} # A slice is considered to have no variance when the actual variance of # that slice is smaller than this value. # #`low_variance_replacement` [optional; default=0.0] {floating point} # Used to replace values in the output where the markers indicates no # or low variance.
Usage#
X = … # data matrix X = scale(center(X)) my_scale = np.mad X = scale(center(X), func=my_scale)
Plots#
- process_improve.multivariate.plots.score_plot(model, pc_horiz=1, pc_vert=2, pc_depth=-1, items_to_highlight=None, settings=None, fig=None)[source]#
Generate a 2-dimensional score plot for the given latent variable model.
- Parameters:
model (MVmodel object (PCA, or PLS)) – A latent variable model generated by this library.
pc_horiz (int, optional) – Which component to plot on the horizontal axis, by default 1 (the first component)
pc_vert (int, optional) – Which component to plot on the vertical axis, by default 2 (the second component)
pc_depth (int, optional) – If pc_depth >= 1, then a 3D score plot is generated, with this component on the 3rd axis
items_to_highlight (dict, optional) –
Keys are JSON strings parseable by
json.loadsinto a Plotly line specifier; values are lists of index names to highlight. For example:items_to_highlight = {'{"color": "red", "symbol": "cross"}': items_in_red}
will highlight the items in
items_in_redwith the given colour and shape.settings (dict) –
Default settings:
{ "show_ellipse": True, # bool: show the Hotelling's T2 ellipse "ellipse_conf_level": 0.95, # float: ellipse confidence level (< 1.00) "title": "Score plot of ...", # str: overall plot title "show_labels": False, # bool: add a label for each observation "show_legend": True, # bool: show clickable legend "html_image_height": 500, # int: image height in pixels "html_aspect_ratio_w_over_h": 16/9, # float: width as ratio of height }
fig (Figure | None)
- Return type:
Figure
Examples
>>> pca = PCA(n_components=3).fit(X_scaled) >>> pca.score_plot() # PC1 vs PC2 >>> pca.score_plot(pc_horiz=1, pc_vert=3) # PC1 vs PC3 >>> pca.score_plot(pc_horiz=1, pc_vert=2, pc_depth=3) # 3D
- process_improve.multivariate.plots.loading_plot(model, loadings_type='p', pc_horiz=1, pc_vert=2, settings=None, fig=None)[source]#
Generate a 2-dimensional loadings for the given latent variable model.
- Parameters:
model (MVmodel object (PCA, or PLS)) – A latent variable model generated by this library.
loadings_type (str, optional) –
- A choice of the following:
’p’ : (default for PCA) : the P (projection) loadings: only option possible for PCA ‘w’ : the W loadings: Suitable for PLS ‘w*’ : (default for PLS) the W* (or R) loadings: Suitable for PLS ‘w*c’ : the W* (from X-space) with C loadings from the Y-space: Suitable for PLS ‘c’ : the C loadings from the Y-space: Suitable for PLS
For PCA model any other choice besides ‘p’ will be ignored.
pc_horiz (int, optional) – Which component to plot on the horizontal axis, by default 1 (the first component)
pc_vert (int, optional) – Which component to plot on the vertical axis, by default 2 (the second component)
settings (dict) –
Default settings:
{ "title": "Loadings plot ...", # str: overall plot title "show_labels": True, # bool: add a label for each variable "html_image_height": 500, # int: image height in pixels "html_aspect_ratio_w_over_h": 16/9, # float: width as ratio of height }
fig (Figure | None)
- Return type:
Figure
Examples
>>> pca.loading_plot() # P loadings, PC1 vs PC2 >>> pls.loading_plot(loadings_type="w*c") # W* and C loadings >>> pls.loading_plot(loadings_type="w", pc_vert=3) # W loadings, PC1 vs PC3
- process_improve.multivariate.plots.spe_plot(model, with_a=-1, items_to_highlight=None, settings=None, fig=None)[source]#
Generate a squared-prediction error (SPE) plot for the given latent variable model using with_a number of latent variables. The default will use the total number of latent variables which have already been fitted.
- Parameters:
model (MVmodel object (PCA, or PLS)) – A latent variable model generated by this library.
with_a (int, optional) – Uses this many number of latent variables, and therefore shows the SPE after this number of model components. By default the total number of components fitted will be used.
items_to_highlight (dict, optional) –
Keys are JSON strings parseable by
json.loadsinto a Plotly line specifier; values are lists of index names to highlight. For example:items_to_highlight = {'{"color": "red", "symbol": "cross"}': items_in_red}
will highlight the items in
items_in_redwith the given colour and shape.settings (dict) –
Default settings:
{ "show_limit": True, # bool: show the SPE confidence limit line "conf_level": 0.95, # float: confidence level for limit (< 1.00) "title": "SPE plot ...", # str: overall plot title "default_marker": {...}, # dict: e.g. dict(color="darkblue", size=7) "show_labels": False, # bool: add a label for each observation "show_legend": False, # bool: show clickable legend "html_image_height": 500, # int: image height in pixels "html_aspect_ratio_w_over_h": 16/9, # float: width as ratio of height }
fig (Figure | None)
- Return type:
Figure
Examples
>>> pca.spe_plot() >>> pca.spe_plot(settings={"conf_level": 0.99, "show_labels": True})
- process_improve.multivariate.plots.t2_plot(model, with_a=-1, items_to_highlight=None, settings=None, fig=None)[source]#
Generate a Hotelling’s T2 (T^2) plot for the given latent variable model using with_a number of latent variables. The default will use the total number of latent variables which have already been fitted.
- Parameters:
model (MVmodel object (PCA, or PLS)) – A latent variable model generated by this library.
with_a (int, optional) – Uses this many number of latent variables, and therefore shows the SPE after this number of model components. By default the total number of components fitted will be used.
items_to_highlight (dict, optional) –
Keys are JSON strings parseable by
json.loadsinto a Plotly line specifier; values are lists of index names to highlight. For example:items_to_highlight = {'{"color": "red", "symbol": "cross"}': items_in_red}
will highlight the items in
items_in_redwith the given colour and shape.settings (dict) –
Default settings:
{ "show_limit": True, # bool: show the T2 confidence limit line "conf_level": 0.95, # float: confidence level for limit (< 1.00) "title": "T2 plot ...", # str: overall plot title "default_marker": {...}, # dict: e.g. dict(color="darkblue", size=7) "show_labels": False, # bool: add a label for each observation "show_legend": False, # bool: show clickable legend "html_image_height": 500, # int: image height in pixels "html_aspect_ratio_w_over_h": 16/9, # float: width as ratio of height }
fig (Figure | None)
- Return type:
Figure
Examples
>>> pca.t2_plot() >>> pca.t2_plot(settings={"conf_level": 0.99, "show_labels": True})