Multivariate Analysis#

Models#

PCA#

class process_improve.multivariate.methods.PCA(n_components, *, algorithm='auto', missing_data_settings=None)[source]#

Bases: _LatentVariableModel, TransformerMixin, BaseEstimator

Principal Component Analysis with support for missing data.

Parameters:
  • n_components (int) – Number of principal components to extract.

  • algorithm (str, default="auto") – Algorithm to use for fitting the model. - "auto": Uses SVD when data is complete, NIPALS when data has missing values. - "svd": Singular Value Decomposition. Requires complete data. - "nipals": Non-linear Iterative Partial Least Squares. Handles missing data. - "tsr": Trimmed Score Regression. Handles missing data.

  • missing_data_settings (dict or None, default=None) – Settings for iterative missing data algorithms (NIPALS, TSR). Keys: md_tol (convergence tolerance), md_max_iter (max iterations).

  • fitting) (Attributes (after)

  • --------------------------

  • scores (pd.DataFrame of shape (n_samples, n_components)) – The score matrix (T).

  • loadings (pd.DataFrame of shape (n_features, n_components)) – The loading matrix (P).

  • r2_per_component (pd.Series of length n_components) – Fractional R² explained by each component.

  • r2_cumulative (pd.Series of length n_components) – Cumulative R² after each component.

  • r2_per_variable (pd.DataFrame of shape (n_features, n_components)) – Per-variable cumulative R² after each component.

  • spe (pd.DataFrame of shape (n_samples, n_components)) – Per-row SPE diagnostic; stored as the square root of the row sum-of-squared X-residuals (so it is on the residual scale, not the squared scale).

  • hotellings_t2 (pd.DataFrame of shape (n_samples, n_components)) – Cumulative Hotelling’s T² statistic.

  • explained_variance (np.ndarray of shape (n_components,)) – Variance explained by each component.

  • scaling_factor_for_scores (pd.Series of length n_components) – Standard deviation per score (sqrt of explained variance).

  • has_missing_data (bool) – Whether the training data contained missing values.

  • fitting_info (dict) – Timing and iteration info from the fitting algorithm.

fitting_info_: dict[str, ndarray | int | float]#
scores_#

Expose a private ndarray as a lazily-built, cached pandas.DataFrame (ENG-18).

Declared as a class attribute, e.g.:

scores_   = _LazyFrame("_scores",   index="_sample_index",  columns="_component_names")
loadings_ = _LazyFrame("_loadings", index="_feature_names", columns="_component_names")

The private ndarray (self._scores) is the source of truth; the public DataFrame is built on first access from the ndarray plus the index/column metadata attributes, cached in self.__dict__["_frame_cache"] (so repeated access returns the same object and is cheap), and excluded from pickling by _LatentVariableModel.__getstate__(). Internal math reads the ndarray directly and avoids the per-call .values conversion.

On an unfitted model the backing ndarray is absent, so getattr raises AttributeError - the same “not fitted” signal as before this change, so hasattr / check_is_fitted behave identically.

loadings_#

Expose a private ndarray as a lazily-built, cached pandas.DataFrame (ENG-18).

Declared as a class attribute, e.g.:

scores_   = _LazyFrame("_scores",   index="_sample_index",  columns="_component_names")
loadings_ = _LazyFrame("_loadings", index="_feature_names", columns="_component_names")

The private ndarray (self._scores) is the source of truth; the public DataFrame is built on first access from the ndarray plus the index/column metadata attributes, cached in self.__dict__["_frame_cache"] (so repeated access returns the same object and is cheap), and excluded from pickling by _LatentVariableModel.__getstate__(). Internal math reads the ndarray directly and avoids the per-call .values conversion.

On an unfitted model the backing ndarray is absent, so getattr raises AttributeError - the same “not fitted” signal as before this change, so hasattr / check_is_fitted behave identically.

spe_#

Expose a private ndarray as a lazily-built, cached pandas.DataFrame (ENG-18).

Declared as a class attribute, e.g.:

scores_   = _LazyFrame("_scores",   index="_sample_index",  columns="_component_names")
loadings_ = _LazyFrame("_loadings", index="_feature_names", columns="_component_names")

The private ndarray (self._scores) is the source of truth; the public DataFrame is built on first access from the ndarray plus the index/column metadata attributes, cached in self.__dict__["_frame_cache"] (so repeated access returns the same object and is cheap), and excluded from pickling by _LatentVariableModel.__getstate__(). Internal math reads the ndarray directly and avoids the per-call .values conversion.

On an unfitted model the backing ndarray is absent, so getattr raises AttributeError - the same “not fitted” signal as before this change, so hasattr / check_is_fitted behave identically.

get_feature_names_out(input_features=None)[source]#

Return the output column names of transform().

PCA’s transform produces scores, one column per component, named ["PC1", "PC2", ..., "PC{n_components}"]. The input_features argument is accepted (Pipeline introspection passes it through) but unused: the output column count is the fitted n_components, not the input feature count.

Used by set_output() (sklearn 1.2+) to label the DataFrame view of the scores when set_output(transform="pandas") is on, and by Pipeline introspection.

Return type:

ndarray

fit(X, y=None)[source]#

Fit a principal component analysis (PCA) model to the data.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Training data. May contain NaN values for missing data (the NIPALS / TSR algorithms thread them through; the SVD path rejects them).

  • y (ignored)

Returns:

self

Return type:

PCA

transform(X)[source]#

Project new data onto the fitted PCA model to obtain scores.

Parameters:

X (array-like of shape (n_samples, n_features)) – New data to project. Must have the same number of features as the training data.

Returns:

scores

Return type:

pd.DataFrame of shape (n_samples, n_components)

fit_transform(X, y=None)[source]#

Fit the model and return the training scores.

Parameters:
Return type:

DataFrame

diagnose(X)[source]#

Project new data and compute diagnostics (scores, Hotelling’s T², SPE).

The same logic that historically lived in predict(). The rename (since 1.38.1, #396) matches PLS.diagnose() and clears the predict name for a future return-type contract that does what its sklearn-convention name implies (a regression-style prediction). predict() is kept as a deprecation shim for now.

Parameters:

X (array-like of shape (n_samples, n_features))

Returns:

result – With keys scores, hotellings_t2, spe.

Return type:

sklearn.utils.Bunch

predict(X)[source]#

Forward to diagnose(); emits a DeprecationWarning.

Deprecated since version 1.38.1: Use PCA.diagnose() instead. predict matches the sklearn-convention name (a regression-style prediction), but PCA isn’t a regressor; the historical return is a diagnostics Bunch. The rename aligns with PLS.diagnose() and frees the name for a future contract. Will be removed in 2.0.0.

Parameters:

X (ndarray | DataFrame)

Return type:

Bunch

score(X, y=None)[source]#

Negative mean squared reconstruction error (higher is better).

Follows the sklearn convention where higher scores indicate better model fit. This makes PCA compatible with cross_val_score, GridSearchCV, and other sklearn model-selection utilities.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Test data to score.

  • y (ignored)

Returns:

score – Negative mean squared reconstruction error.

Return type:

float

Examples

>>> from sklearn.model_selection import cross_val_score
>>> scores = cross_val_score(PCA(n_components=2), X_scaled, cv=5)
>>> print(f"Mean CV score: {scores.mean():.4f}")
classmethod minka_mle(X)[source]#

Minka (2000) automatic-dimensionality estimate for PCA.

Closed-form Bayesian model selection on the PPCA evidence (Minka, T. P. 2000. Automatic Choice of Dimensionality for PCA. NIPS 13, pp. 598-604). Operates only on the covariance eigenvalues of X and is therefore very cheap; in the simulations Minka reports it beats cross-validation. Use it alongside the ekf-CV recommendation from select_n_components() as a fast cross-check.

Internally X is mean-centred before estimation (a PPCA assumption); it is not unit-variance scaled, because dividing each column by its standard deviation compresses the noise eigenvalues to near-zero values the MLE misreads as additional latent signal. If your columns are on wildly different scales, pass the analysis-scale X produced by your own preprocessing (e.g. SNV for spectral data) and accept the centring this method applies.

Parameters:

X (array-like of shape (n_samples, n_features)) – Data matrix.

Returns:

n_components – The MLE estimate of the effective dimensionality.

Return type:

int

References

Minka, T. P. (2000). Automatic Choice of Dimensionality for PCA. Advances in Neural Information Processing Systems, 13, 598-604.

See also

parallel_analysis

Horn (1965) eigenvalue-vs-null retention.

select_n_components

ekf cross-validation; pass return_consensus=True to report all three side by side.

classmethod parallel_analysis(X, *, n_simulations=200, quantile=0.95, scale=True, random_state=None)[source]#

Horn (1965) parallel analysis component-count estimate.

Generates n_simulations random matrices of the same shape as X, computes their eigenvalues, and retains every observed component whose eigenvalue exceeds the quantile of the null distribution at the same rank. Widely regarded in psychometrics as the best simple retention rule for PCA.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Data matrix.

  • n_simulations (int, default 200) – Number of random matrices drawn to build the null eigenvalue distribution.

  • quantile (float, default 0.95) – Quantile of the null eigenvalues used as the retention threshold. Horn’s original proposal was the mean (0.5); the more conservative 95th-percentile threshold is the modern recommendation.

  • scale (bool, default True) – Mean-centre and unit-variance scale X before estimation (matches minka_mle()).

  • random_state (int, optional) – Seed for the null-matrix simulations.

Returns:

result – With keys:

  • n_components - number of components retained (can be 0 on pure noise).

  • observed_eigenvalues - eigenvalues of X after centring/scaling (np.ndarray of length min(n, p)).

  • null_threshold - per-rank quantile of the null eigenvalue distribution (same length as observed_eigenvalues).

Return type:

sklearn.utils.Bunch

References

Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis. Psychometrika, 30(2), 179-185.

See also

minka_mle

closed-form PPCA evidence rule.

select_n_components

ekf cross-validation; pass return_consensus=True to report all three side by side.

classmethod select_n_components(X, *, max_components=None, cv=5, cv_scheme='ekf', n_repeats=1, selection_rule='min', min_q2_increase=0.01, scale_inside_folds=True, n_iter=50, tol=1e-06, random_state=None, return_consensus=False, threshold=None, **pca_kwargs)[source]#

Select the number of PCA components via cross-validation.

Evaluates every component count 1, 2, ..., max_components and recommends one via the configured selection_rule. The default cv_scheme="ekf" is the element-wise k-fold algorithm of Bro, Kjeldahl, Smilde & Kiers (2008, Anal. Bioanal. Chem. 390:1241-1251), which holds out individual cells of X and predicts them via EM-style imputation from a model that never sees their true values. This restores the prediction-independence requirement the legacy row-wise scheme violates, fixing the trivial-fit pathology where PRESS shrinks monotonically with components.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Training data. Should already be on the analysis scale (e.g. mean-centred and unit-variance via MCUVScaler).

  • max_components (int, optional) – Maximum number of components to evaluate. Default is min(n_samples - 1, n_features).

  • cv (int or sklearn CV splitter, default 5) – For cv_scheme="ekf": the integer number of element-folds (splitter objects are ignored). For cv_scheme="row_wise": either an integer K (fed to KFold) or any sklearn splitter.

  • cv_scheme ({"ekf", "row_wise"}, default "ekf") – "ekf" is the element-wise k-fold scheme of Bro et al. 2008 with EM imputation; the research-recommended default. The legacy "row_wise" scheme is preserved for back-compat but emits a SpecificationWarning because it over-selects (see the warning admonition below).

  • n_repeats (int, default 1) – Repeat the ekf pass with a fresh random fold permutation this many times. Each repeat covers every cell exactly once; n_repeats > 1 narrows the per-component PRESS standard error (helpful when the 1-SE rule sits on a borderline) at roughly linear extra runtime. Ignored under cv_scheme="row_wise".

  • selection_rule ({"min", "1se", "q2_increment"}, default "min") – How the recommended component count is chosen. "min" is the GlobalMin criterion Bro 2008 pairs with ekf - the component count with the lowest pooled PRESS. "1se" is the one- standard-error rule (needs per_fold_press, available under both schemes). "q2_increment" is the Wold’s-R-style cumulative-\(Q^2\) threshold from PR #371; min_q2_increase sets the threshold.

  • min_q2_increase (float, default 0.01) – Threshold used only when selection_rule="q2_increment".

  • scale_inside_folds (bool, default True) – With the default, mean-centring and unit-variance scaling constants are fit on each fold’s in-fold cells and applied to the whole matrix before EM, removing the centring/scaling leakage of the prior implementation. Set to False to reproduce the previous behaviour (column mean recomputed each EM iteration from the imputed matrix, no scaling); this is useful only when X is already pre-scaled. Ignored under cv_scheme="row_wise".

  • n_iter (int and float, default 50 and 1e-6) – EM iteration cap and convergence tolerance for the ekf imputation step. Ignored under cv_scheme="row_wise".

  • tol (int and float, default 50 and 1e-6) – EM iteration cap and convergence tolerance for the ekf imputation step. Ignored under cv_scheme="row_wise".

  • random_state (int, optional) – Seed for the ekf element-fold permutation.

  • threshold (float, optional) – Deprecated. The original Wold PRESS-ratio cutoff. Passing it emits a DeprecationWarning; the value is ignored. Use selection_rule="q2_increment" (and tune min_q2_increase) for a comparable parsimony preference.

  • **pca_kwargs – Additional keyword arguments passed to the PCA() constructor under cv_scheme="row_wise" (e.g. algorithm="nipals"). Ignored under cv_scheme="ekf" because ekf runs its own SVD loop.

  • return_consensus (bool)

Returns:

result – With keys:

  • n_components - recommended number of components (int).

  • press - pooled PRESS per component count (pd.Series, indexed 1..A_max).

  • per_fold_press - per-fold PRESS contributions (pd.DataFrame, A_max rows x n_folds columns).

  • se_press - standard error of the per-fold PRESS curve (pd.Series, indexed 1..A_max). Drives the 1-SE rule.

  • q2_se - the same standard error rescaled onto the Q2 scale (se_press / null_model_ss, pd.Series indexed 1..A_max), i.e. the half-width of a +/-1 SE band around q2.

  • press_ratio - PRESS_a / PRESS_{a-1} for inspection (pd.Series, indexed 2..A_max).

  • q2 - cross-validated \(R^2_X\) per component count (pd.Series, indexed 1..A_max). Computed as 1 - press / (n_samples * n_features * mean_cell_ss), so it is directly comparable to r2_cumulative_ and to PLS’s r2y_validated.

  • cv_scores - alias of per_fold_press under ekf, or per-fold negative MSE from cross_val_score under row-wise (preserved for back-compat).

  • cv_scheme - the scheme used ("ekf" or "row_wise").

  • selection_rule - the rule used to pick n_components.

Return type:

sklearn.utils.Bunch

References

Bro, R., Kjeldahl, K., Smilde, A. K., & Kiers, H. A. L. (2008). Cross-validation of component models: a critical look at current methods. Anal. Bioanal. Chem., 390(5), 1241-1251.

Camacho, J., & Ferrer, A. (2012). Cross-validation in PCA models with the element-wise k-fold (ekf) algorithm: theoretical aspects. J. Chemometrics, 26(7), 361-373.

Warning

cv_scheme="row_wise" is preserved only for back-compat and emits a SpecificationWarning. It suffers from the trivial-fit problem: holding out whole rows and projecting them back via transform() lets the held-out row’s own values reach its prediction, so PRESS shrinks monotonically with the component count and the recommendation tends to run to the maximum. Prefer the default "ekf".

score_contributions(t_start, t_end=None, components=None, *, weighted=False)[source]#

Contribution of each variable to a score-space movement.

Decomposes the difference (t_end - t_start) back into variable space via the loadings for the selected components.

Parameters:
  • t_start (array-like of shape (n_components,)) – Score vector of the observation of interest. Can be a row from self.scores_ or from predict(X_new).scores.

  • t_end (array-like of shape (n_components,), optional) – Reference point in score space. Default is the model center (a vector of zeros), which is the most common choice.

  • components (list of int, optional) – 1-based component indices to decompose over, matching the model’s column convention. Examples: [2, 3] for a PC2-vs-PC3 score plot, or None (default) for all components - appropriate for Hotelling’s T² contributions.

  • weighted (bool, default False) – If True, scale the score difference by 1/sqrt(explained_variance) per component before back-projecting. This gives contributions to the T² statistic rather than to the Euclidean score distance.

Returns:

contributions – One value per variable; sign indicates direction. Index contains the variable (feature) names.

Return type:

pd.Series of shape (n_features,)

Examples

>>> pca = PCA(n_components=3).fit(X_scaled)
>>> # Why does observation 0 differ from the model center?
>>> contrib = pca.score_contributions(pca.scores_.iloc[0].values)
>>> # Which variables drive the difference between obs 0 and obs 5?
>>> contrib = pca.score_contributions(
...     pca.scores_.iloc[0].values, pca.scores_.iloc[5].values
... )

See also

observation_contributions

The per-observation counterpart. Despite the similar name, the two answer different questions and are not interchangeable. score_contributions is per-variable: it decomposes a single observation’s movement in score space back onto the original variables, answering “which variables explain why this observation sits where it does?”. It returns one signed value per variable. observation_contributions is per-observation: it reports each observation’s share of a component’s total inertia, answering “which observations most strongly shape this component?”. It returns a non-negative sample-by-component table whose columns each sum to 1. The two are orthogonal views of the same score matrix - one decomposes across variables, the other across observations.

detect_outliers(conf_level=0.95)[source]#

Detect outlier observations using SPE and Hotelling’s T² diagnostics.

Combines two approaches:

  1. Statistical limits - observations exceeding the SPE or T² limit at conf_level are flagged.

  2. Robust ESD test - the generalized ESD test (with robust median/MAD variant) identifies observations that are unusual relative to the rest of the data, even if they fall below the statistical limit.

An observation can be flagged for one or both reasons.

Parameters:

conf_level (float, default 0.95) – Confidence level in [0.8, 0.999]. Controls both the statistical limits and the ESD test’s significance level (alpha = 1 - conf_level).

Returns:

outliers – Sorted from most severe to least. Each dict contains:

  • observation - index label of the observation

  • outlier_types - list of "spe" and/or "hotellings_t2"

  • spe - SPE value for this observation

  • hotellings_t2 - T² value for this observation

  • spe_limit - SPE limit at the given confidence level

  • hotellings_t2_limit - T² limit at the given confidence level

  • severity - max(spe/spe_limit, t2/t2_limit)

Return type:

list of dict

Examples

>>> pca = PCA(n_components=3).fit(X_scaled)
>>> outliers = pca.detect_outliers(conf_level=0.95)
>>> for o in outliers:
...     print(f"{o['observation']}: {o['outlier_types']} (severity={o['severity']})")

PLS#

class process_improve.multivariate.methods.PLS(n_components, *, scale=True, max_iter=1000, tol=np.float64(1.4901161193847656e-08), copy=True, missing_data_settings=None)[source]#

Bases: _LatentVariableModel, RegressorMixin, TransformerMixin, BaseEstimator

Projection to Latent Structures (PLS) regression with diagnostics.

Implements PLS via the NIPALS algorithm with production diagnostics: SPE, Hotelling’s T², score contributions, and outlier detection. The API mirrors PCA so that model.scores_, model.spe_, and model.detect_outliers() work identically for both model types.

Parameters:
  • n_components (int) – Number of latent components to extract.

  • scale (bool, default=True) – Whether to scale X and Y to unit variance (sklearn internal scaling). When using MCUVScaler externally, set scale=False to avoid double scaling.

  • max_iter (int, default=1000) – Maximum number of iterations for the NIPALS algorithm.

  • tol (float, default=sqrt(machine epsilon)) – Convergence tolerance for the NIPALS algorithm.

  • copy (bool, default=True) – Whether to copy X and Y before fitting.

  • missing_data_settings (dict or None, default=None) – Settings for missing data algorithms (NIPALS/TSR for PLS). Keys: md_method ("tsr", "scp", "nipals"), md_tol, md_max_iter.

  • fitting) (Attributes (after)

  • --------------------------

  • scores (pd.DataFrame of shape (n_samples, n_components)) – X-block score matrix (T). This is the primary score matrix; equivalent to x_scores in older versions.

  • y_scores (pd.DataFrame of shape (n_samples, n_components)) – Y-block score matrix (U).

  • x_loadings (pd.DataFrame of shape (n_features, n_components)) – X-block loading matrix (P).

  • y_loadings (pd.DataFrame of shape (n_targets, n_components)) – Y-block loading matrix (C).

  • x_weights (pd.DataFrame of shape (n_features, n_components)) – X-block weight matrix (W).

  • y_weights (pd.DataFrame of shape (n_targets, n_components)) – Y-block weight matrix.

  • direct_weights (pd.DataFrame of shape (n_features, n_components)) – Direct (W*) weights: W (P'W)^{-1}. Used for direct projection T = X @ W*.

  • beta_coefficients (pd.DataFrame of shape (n_features, n_targets)) – Regression coefficients linking X directly to Y.

  • predictions (pd.DataFrame of shape (n_samples, n_targets)) – Y predictions from the training data.

  • spe (pd.DataFrame of shape (n_samples, n_components)) – Per-row SPE diagnostic; stored as the square root of the row sum-of-squared X-residuals (so it is on the residual scale, not the squared scale).

  • hotellings_t2 (pd.DataFrame of shape (n_samples, n_components)) – Cumulative Hotelling’s T² statistic.

  • r2_per_component (pd.Series of length n_components) – Fractional R² (on Y) explained by each component.

  • r2_cumulative (pd.Series of length n_components) – Cumulative R² (on Y) after each component.

  • r2_per_variable (pd.DataFrame of shape (n_features, n_components)) – Per-variable cumulative R² for X after each component.

  • r2y_per_variable (pd.DataFrame of shape (n_targets, n_components)) – Per-variable R² for Y after each component.

  • rmse (pd.DataFrame of shape (n_targets, n_components)) – Root mean squared error of Y predictions per component.

  • explained_variance (np.ndarray of shape (n_components,)) – Variance explained by each component in X.

  • scaling_factor_for_scores (pd.Series of length n_components) – Standard deviation per score (sqrt of explained variance).

  • has_missing_data (bool) – Whether the training data contained missing values.

  • fitting_info (dict) – Timing and iteration info from the fitting algorithm.

See also

PCA

Principal Component Analysis.

MCUVScaler

Mean-center unit-variance scaler.

References

Abdi, “Partial least squares regression and projection on latent structure regression (PLS Regression)”, 2010, DOI: 10.1002/wics.51

Examples

>>> import pandas as pd
>>> from process_improve.multivariate.methods import PLS, MCUVScaler
>>> X = pd.DataFrame({"A": [1, 2, 3, 4], "B": [4, 3, 2, 1]})
>>> Y = pd.DataFrame({"y": [2.1, 3.9, 6.2, 7.8]})
>>> pls = PLS(n_components=1)
>>> pls = pls.fit(MCUVScaler().fit_transform(X), MCUVScaler().fit_transform(Y))
>>> pls.scores_.shape
(4, 1)
get_feature_names_out(input_features=None)[source]#

Return the output column names of transform().

PLS’s transform returns the X scores (T matrix), labelled ["T1", "T2", ..., "T{n_components}"]. The input_features argument is accepted (Pipeline introspection passes it through) but unused: the output column count is the fitted n_components, not the input feature count.

Used by set_output() (sklearn 1.2+) to label the DataFrame view of the scores when set_output(transform="pandas") is on, and by Pipeline introspection.

Return type:

ndarray

predictions_vs_observed_plot(*, y_observed, variable=None, settings=None, fig=None)#

Generate an observed-vs-predicted (parity) plot for a fitted PLS model.

Plots the calibration predictions against the observed Y values, with a y = x reference line and an RMSE annotation. Points lying close to the reference line indicate good predictions.

Parameters:
  • model (PLS object) – A fitted PLS model generated by this library.

  • y_observed (array-like of shape (n_samples, n_targets)) – The observed Y values, on the same scale as the data used to fit the model (for example the scaled Y from MCUVScaler).

  • variable (str, optional) – Which Y-variable to plot. Defaults to the first Y-variable.

  • settings (dict) –

    Default settings:

    {
        "title": "Observed vs predicted ...",  # str: overall plot title
        "marker_color": None,           # str|None: data-marker colour; None uses the theme
        "reference_color": "#9CA3AF",   # str: colour of the y = x line
        "html_image_height": 500,       # int: image height in pixels
        "html_aspect_ratio_w_over_h": 1.0,   # float: width as ratio of height
        "template": "pi_journal",         # str: registered Plotly theme name
    }
    

  • fig (go.Figure, optional) – An existing figure to draw onto. A new figure is created if omitted.

Return type:

Figure

Examples

>>> pls.predictions_vs_observed_plot(y_observed=Y_scaled)
>>> pls.predictions_vs_observed_plot(y_observed=Y_scaled, variable="quality")
coefficient_plot(variable=None, settings=None, fig=None)#

Generate a bar plot of the PLS regression coefficients.

Shows beta_coefficients_ for one Y-variable: one bar per X-variable, mapping the (preprocessed) X onto the predicted Y. Tall bars mark the X-variables that most strongly drive the prediction.

Parameters:
  • model (PLS object) – A fitted PLS model generated by this library.

  • variable (str, optional) – Which Y-variable’s coefficients to plot. Defaults to the first one.

  • settings (dict) –

    Default settings:

    {
        "title": "Regression coefficients ...",  # str: overall plot title
        "bar_color": None,              # str|None: bar colour; None uses the theme
        "html_image_height": 500,       # int: image height in pixels
        "html_aspect_ratio_w_over_h": 16/9,  # float: width as ratio of height
        "template": "pi_journal",         # str: registered Plotly theme name
    }
    

  • fig (go.Figure, optional) – An existing figure to draw onto. A new figure is created if omitted.

Return type:

Figure

Examples

>>> pls.coefficient_plot()
>>> pls.coefficient_plot(variable="quality")
y_scores_: ndarray | DataFrame#
y_weights_: ndarray | DataFrame#
y_loadings_: ndarray | DataFrame#
fitting_info_: dict[str, ndarray | int | float]#
scores_#

Expose a private ndarray as a lazily-built, cached pandas.DataFrame (ENG-18).

Declared as a class attribute, e.g.:

scores_   = _LazyFrame("_scores",   index="_sample_index",  columns="_component_names")
loadings_ = _LazyFrame("_loadings", index="_feature_names", columns="_component_names")

The private ndarray (self._scores) is the source of truth; the public DataFrame is built on first access from the ndarray plus the index/column metadata attributes, cached in self.__dict__["_frame_cache"] (so repeated access returns the same object and is cheap), and excluded from pickling by _LatentVariableModel.__getstate__(). Internal math reads the ndarray directly and avoids the per-call .values conversion.

On an unfitted model the backing ndarray is absent, so getattr raises AttributeError - the same “not fitted” signal as before this change, so hasattr / check_is_fitted behave identically.

spe_#

Expose a private ndarray as a lazily-built, cached pandas.DataFrame (ENG-18).

Declared as a class attribute, e.g.:

scores_   = _LazyFrame("_scores",   index="_sample_index",  columns="_component_names")
loadings_ = _LazyFrame("_loadings", index="_feature_names", columns="_component_names")

The private ndarray (self._scores) is the source of truth; the public DataFrame is built on first access from the ndarray plus the index/column metadata attributes, cached in self.__dict__["_frame_cache"] (so repeated access returns the same object and is cheap), and excluded from pickling by _LatentVariableModel.__getstate__(). Internal math reads the ndarray directly and avoids the per-call .values conversion.

On an unfitted model the backing ndarray is absent, so getattr raises AttributeError - the same “not fitted” signal as before this change, so hasattr / check_is_fitted behave identically.

x_loadings_#

Expose a private ndarray as a lazily-built, cached pandas.DataFrame (ENG-18).

Declared as a class attribute, e.g.:

scores_   = _LazyFrame("_scores",   index="_sample_index",  columns="_component_names")
loadings_ = _LazyFrame("_loadings", index="_feature_names", columns="_component_names")

The private ndarray (self._scores) is the source of truth; the public DataFrame is built on first access from the ndarray plus the index/column metadata attributes, cached in self.__dict__["_frame_cache"] (so repeated access returns the same object and is cheap), and excluded from pickling by _LatentVariableModel.__getstate__(). Internal math reads the ndarray directly and avoids the per-call .values conversion.

On an unfitted model the backing ndarray is absent, so getattr raises AttributeError - the same “not fitted” signal as before this change, so hasattr / check_is_fitted behave identically.

x_weights_#

Expose a private ndarray as a lazily-built, cached pandas.DataFrame (ENG-18).

Declared as a class attribute, e.g.:

scores_   = _LazyFrame("_scores",   index="_sample_index",  columns="_component_names")
loadings_ = _LazyFrame("_loadings", index="_feature_names", columns="_component_names")

The private ndarray (self._scores) is the source of truth; the public DataFrame is built on first access from the ndarray plus the index/column metadata attributes, cached in self.__dict__["_frame_cache"] (so repeated access returns the same object and is cheap), and excluded from pickling by _LatentVariableModel.__getstate__(). Internal math reads the ndarray directly and avoids the per-call .values conversion.

On an unfitted model the backing ndarray is absent, so getattr raises AttributeError - the same “not fitted” signal as before this change, so hasattr / check_is_fitted behave identically.

fit(X, Y, sample_weight=None)[source]#

Fit a projection to latent structures (PLS) model to the data.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – Training data, where n_samples is the number of samples (rows) and n_features is the number of features (columns).

  • Y (array-like, shape (n_samples, n_targets)) – Training data, where n_samples is the number of samples (rows) and n_targets is the number of target outputs (columns).

  • sample_weight (array-like of shape (n_samples,), optional) – Non-negative row weights for a weighted PLS fit (#394). NIPALS is run on sqrt(w)-rescaled X and Y, which is equivalent to weighting the cross-products X' W u and Y' W t. Loadings, weights and beta are computed correctly; scores are returned on the original sample scale. Zero weights effectively exclude the corresponding rows (sample_weight=[1,1,0,0,1] reproduces the unweighted fit on rows [0,1,4]). Forwarded to score() / r2_score for any caller that also threads it through.

Returns:

Model object.

Return type:

PLS

References

Abdi, “Partial least squares regression and projection on latent structure regression (PLS Regression)”, 2010, DOI: 10.1002/wics.51

transform(X, Y=None)[source]#

Project X (and optionally Y) into the latent space.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Data to transform.

  • Y (array-like of shape (n_samples, n_targets), optional) – Ignored. Present for API compatibility with sklearn pipelines.

Returns:

X_scores – Projected X data (scores).

Return type:

pd.DataFrame of shape (n_samples, n_components)

fit_transform(X, Y=None)[source]#

Fit the model and return X scores.

Parameters:
  • X (array-like of shape (n_samples, n_features))

  • Y (array-like of shape (n_samples, n_targets))

Returns:

X_scores

Return type:

pd.DataFrame of shape (n_samples, n_components)

score(X, Y, sample_weight=None)[source]#

Return the R² score for the prediction.

Parameters:
  • X (array-like of shape (n_samples, n_features))

  • Y (array-like of shape (n_samples, n_targets)) – True target values.

  • sample_weight (array-like of shape (n_samples,), optional)

Returns:

score – R² of self.predict(X) w.r.t. Y.

Return type:

float

predict(X)[source]#

Predict Y for new observations.

Returns just the predicted y_hat so the call satisfies the scikit-learn RegressorMixin contract (and therefore composes inside Pipeline, cross_val_score(), and GridSearchCV). For the rich diagnostic view (scores, Hotelling’s T², SPE, plus y_hat), see diagnose().

Parameters:

X (array-like of shape (n_samples, n_features))

Returns:

y_hat – Predicted target values, indexed by X’s rows and labelled with the target column names captured during fit.

Return type:

pd.DataFrame of shape (n_samples, n_targets)

See also

diagnose

richer per-prediction diagnostics.

Examples

>>> y_pred = pls.predict(X_new)
>>> diag = pls.diagnose(X_new)   # for scores / T² / SPE
diagnose(X)[source]#

Project new data and compute predictions plus diagnostics.

This is the rich view that predict() used to return before 1.35.0: alongside y_hat it reports the X scores, cumulative Hotelling’s T², and SPE for every row of X so the user can flag out-of-model observations and read their predicted Y from one call.

Parameters:

X (array-like of shape (n_samples, n_features))

Returns:

result – With keys scores, hotellings_t2, spe, y_hat.

Return type:

sklearn.utils.Bunch

See also

predict

sklearn-compatible call returning just y_hat.

Examples

>>> result = pls.diagnose(scaler_x.transform(X_new))
>>> result.y_hat           # Predicted Y values
>>> result.spe             # SPE for each new observation
>>> result.hotellings_t2   # T² for each new observation
classmethod select_n_components(X, Y, *, max_components=None, cv=5, n_repeats=None, random_state=None, selection_rule='1se', scale_inside_folds=True, min_q2_increase=0.01, n_permutations=999, alpha=0.01, stability_threshold=0.6, **pls_kwargs)[source]#

Select the number of PLS components via cross-validation.

Fits PLS models on cross-validation training folds and evaluates the out-of-fold prediction error for every component count 1, 2, ..., max_components. Reports per-fold and pooled RMSECV plus the validated cumulative R² curves, and recommends a component count from one of three rules (see selection_rule below).

The defaults are the research-backed combination: the one-standard-error rule on top of repeated, shuffled K-fold CV, with MCUVScaler re-fit inside every training fold so test data never leaks into the centring/scaling estimates.

Unlike the calibration statistics stored on a fitted model (rmse_, r2_cumulative_), the metrics returned here estimate performance on unseen data and are therefore suitable for choosing n_components.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Training X. With the default scale_inside_folds=True the raw, unscaled X may be passed; scaling is fit inside every training fold.

  • Y (array-like of shape (n_samples, n_targets)) – Training Y. Same treatment as X under scale_inside_folds.

  • max_components (int, optional) – Maximum number of components to evaluate. Default is the largest value supported by every cross-validation training fold, min(min_fold_size, n_features).

  • cv (int or sklearn CV splitter, default 5) – If an integer, used as the n_splits of a shuffled KFold (or RepeatedKFold when n_repeats > 1). Any sklearn splitter object (e.g. KFold(10, shuffle=True) or LeaveOneOut()) is also accepted and is used as-is (n_repeats is then ignored).

  • n_repeats (int, optional) – Number of times the K-fold split is repeated with a fresh shuffle, used only when cv is an integer. Default 10 (giving a cv * 10 per-fold sample for the 1-SE rule); pass 1 to disable repeats. Repeated K-fold’s standard errors are slightly optimistic because test folds overlap across repeats; that is fine for the 1-SE selection rule but should not be reported as an unbiased generalisation variance.

  • random_state (int, optional) – Seed forwarded to KFold / RepeatedKFold for reproducible shuffling. Ignored when cv is a pre-built splitter.

  • selection_rule ({"1se", "min", "q2_increment", "randomization"}, default "1se") – How the recommended component count is chosen. See SelectionRule for the rule semantics. "1se" is the default; "min" is the argmin RMSECV (the pre-1.28 default, prone to running to the maximum component count); "q2_increment" is the Wold’s-R-style cumulative-Q² threshold; "randomization" is Van der Voet’s (1994) permutation test (uses n_permutations and alpha) that picks the smallest model whose predictive ability is statistically indistinguishable from the reference (argmin RMSECV) one.

  • scale_inside_folds (bool, default True) – When True (the default), fit a fresh MCUVScaler on each training fold’s X and Y, apply it to the held-out rows, fit PLS in scaled space, then inverse-transform the predictions so RMSECV is reported on the original Y scale. This removes the centring / scaling leakage of the prior default. Set to False to keep the pre-1.28 behaviour, in which case X and Y should already be scaled; a SpecificationWarning is emitted.

  • min_q2_increase (float, default 0.01) – Threshold used only when selection_rule="q2_increment": the smallest increase in cumulative validated \(Q^2_Y\) that justifies keeping an extra component.

  • n_permutations (int, default 999) – Used only when selection_rule="randomization": number of sign-flip permutations driving the Van der Voet test.

  • alpha (float, default 0.01) – Used only when selection_rule="randomization": significance level. The smallest component count whose Van der Voet p-value exceeds alpha is recommended. R’s pls::selectNcomp uses the same default; smaller values pick more parsimonious models.

  • stability_threshold (float, default 0.6) – For the per-repeat stability-selection diagnostic ("1se" / "min" rules with n_repeats > 1 only): the recommendation is judged selection_is_stable=True iff the modal vote share in selection_distribution is at least this fraction. Meinshausen & Bühlmann (2010, JRSS-B) suggest 0.6-0.9 for their variable-selection analogue; we default to the permissive end.

  • **pls_kwargs – Additional keyword arguments passed to the PLS() constructor (e.g. missing_data_settings).

Returns:

result – With keys:

  • n_components - recommended number of components (int).

  • rmsecv - pooled RMSECV per component count (pd.DataFrame, indexed 1..A; columns are the Y-variable names plus "total").

  • per_fold_rmsecv - per-fold total RMSECV (pd.DataFrame, indexed 1..A; one column per fold across all repeats). Drives the 1-SE rule.

  • se_rmsecv - standard error of the per-fold RMSECV per component count (pd.Series, indexed 1..A).

  • q2_se - standard error on the Q2 scale (the per-fold total PRESS standard error divided by the total Y sum-of-squares), i.e. the half-width of a +/-1 SE band around r2y_validated["total"] (pd.Series, indexed 1..A).

  • r2y_validated - validated cumulative \(R^2_Y\) (pd.DataFrame, same shape as rmsecv).

  • r2x_validated - validated cumulative \(R^2_X\) (pd.DataFrame, indexed 1..A; columns are the X-variable names plus "total").

  • press - pooled Y prediction error sum of squares per component count (pd.Series, indexed 1..A).

  • cv_predictions - out-of-fold predictions of Y at the recommended component count, on the original Y scale (pd.DataFrame). For repeated K-fold, the first repeat’s held-out predictions are reported so each row appears exactly once.

  • selection_rule - the rule used to pick n_components.

  • randomization_pvalues - per-component Van der Voet right-tail p-values when selection_rule="randomization"; None otherwise.

  • selection_distribution - per-repeat vote share over candidate component counts (pd.Series indexed 1..A). Populated only for selection_rule in {"1se", "min"} and n_repeats > 1; None otherwise. A concentrated distribution signals a confident recommendation; a flat or multi-modal one flags it for review.

  • selection_mode - the most-voted component count, or None when selection_distribution is.

  • selection_is_stable - True iff the modal vote share meets stability_threshold; None when no distribution was computed.

Return type:

sklearn.utils.Bunch

Notes

The pooled RMSECV in rmsecv["total"] is the square root of the total PRESS over all fold-test rows divided by (N_eff * M) where N_eff = N * n_repeats under repeated CV; the per_fold_rmsecv column for fold f is the square root of fold-f’s sum-of-squared residuals over its own test rows.

References

Breiman, Friedman, Olshen & Stone (1984), CART, sec.3.4.3 (1-SE rule). Hastie, Tibshirani & Friedman, ESL, sec.7.10. Kohavi (1995, IJCAI) recommends 10-fold stratified CV for model selection.

Examples

>>> from sklearn.model_selection import KFold
>>> # Default: 1-SE on 10 x 5-fold repeated CV with in-fold scaling.
>>> result = PLS.select_n_components(X, Y, max_components=6, random_state=0)
>>> result.n_components, result.selection_rule
>>> # Opt-in to the older argmin-RMSECV rule:
>>> PLS.select_n_components(X, Y, max_components=6, selection_rule="min")
>>> # Caller-supplied splitter (n_repeats is ignored here):
>>> PLS.select_n_components(X, Y, cv=KFold(10, shuffle=True, random_state=0))
classmethod nested_cv(X, Y, *, max_components=None, outer_cv=5, inner_cv=5, n_inner_repeats=10, selection_rule='1se', scale_inside_folds=True, min_q2_increase=0.01, n_permutations=999, alpha=0.01, random_state=None, **pls_kwargs)[source]#

Nested cross-validation for an honest PLS performance estimate.

Outer loop splits the data into outer-train / outer-test; the inner loop runs select_n_components() on the outer-train (with the configured selection_rule over inner_cv * n_inner_repeats folds) to pick the component count; a final PLS is fit on the outer-train at that count and used to predict the outer-test. The accumulated out-of-fold predictions give RMSEP that is not optimism-biased by the selection decision - the headline number to report when a clean test set is not available.

Parameters:
  • X (array-like) – Training data. Treated as in select_n_components() (raw if scale_inside_folds=True, pre-scaled otherwise).

  • Y (array-like) – Training data. Treated as in select_n_components() (raw if scale_inside_folds=True, pre-scaled otherwise).

  • max_components (int, optional) – Forwarded to the inner select_n_components().

  • outer_cv (int or sklearn splitter, default 5) – Number of outer folds (or a custom splitter).

  • inner_cv (int, default 5) – Number of inner folds passed to select_n_components().

  • n_inner_repeats (int, default 10) – Number of inner-CV repeats per outer fold; the inner random_state is offset by the outer-fold index so each outer fold sees a fresh inner shuffle.

  • selection_rule (str, default "1se") – Selection rule applied inside the inner loop. See SelectionRule.

  • scale_inside_folds (bool, default True) – Mirrors select_n_components(). Also applied to the final outer-train fit, with the test-fold predictions inverse- transformed to the original Y scale before RMSEP accumulates.

  • min_q2_increase (float) – Forwarded to the inner select_n_components() per rule.

  • n_permutations (int) – Forwarded to the inner select_n_components() per rule.

  • alpha (float) – Forwarded to the inner select_n_components() per rule.

  • random_state (int, optional) – Seed for the outer-fold shuffle and the inner CV. The inner seed is offset per outer fold so each outer split sees a fresh shuffled inner CV.

  • **pls_kwargs – Forwarded to PLS for both the inner CV and the final outer-train fits.

Returns:

result – With keys:

  • rmsep - honest held-out RMSEP per Y column plus a "total" entry (pd.Series).

  • q2y - validated \(Q^2_Y\) per Y column plus "total" (pd.Series).

  • cv_predictions - out-of-fold predictions of Y at the per-outer-fold selected component counts (pd.DataFrame on the original Y scale).

  • selected_components_per_fold - list of inner recommendations, one per outer fold.

  • selected_components_distribution - vote share over candidate counts (pd.Series).

Return type:

sklearn.utils.Bunch

Notes

Runtime is roughly outer_cv * inner_cv * n_inner_repeats * max_components PLS fits. With the defaults that is 5 * 5 * 10 * max_components fits per call; for max_components=10 and a moderate dataset that completes in seconds. Drop n_inner_repeats if you need to bring it down further.

Examples

>>> from process_improve.multivariate import PLS
>>> result = PLS.nested_cv(X, Y, max_components=8, random_state=0)
>>> result.rmsep["total"]
score_contributions(t_start, t_end=None, components=None, *, weighted=False)[source]#

Contribution of each X variable to a score-space movement.

Identical to PCA.score_contributions but uses the PLS X-loadings (P matrix) for back-projection.

Parameters:
  • t_start (array-like of shape (n_components,)) – Score vector of the observation of interest.

  • t_end (array-like of shape (n_components,), optional) – Reference point in score space. Default is the model center (zeros).

  • components (list of int, optional) – 1-based component indices. Default: all components.

  • weighted (bool, default False) – If True, scale by 1/sqrt(explained_variance) for T² contributions.

Returns:

contributions

Return type:

pd.Series of shape (n_features,)

Examples

>>> pls = PLS(n_components=3).fit(X_scaled, Y_scaled)
>>> contrib = pls.score_contributions(pls.scores_.iloc[0].values)
>>> contrib.abs().sort_values(ascending=False).head()  # top contributors

See also

observation_contributions

The per-observation counterpart. Despite the similar name, the two answer different questions and are not interchangeable. score_contributions is per-variable: it decomposes a single observation’s movement in score space back onto the original X variables, answering “which variables explain why this observation sits where it does?”. It returns one signed value per variable. observation_contributions is per-observation: it reports each observation’s share of a component’s total inertia, answering “which observations most strongly shape this component?”. It returns a non-negative sample-by-component table whose columns each sum to 1. The two are orthogonal views of the same score matrix - one decomposes across variables, the other across observations.

detect_outliers(conf_level=0.95)[source]#

Detect outlier observations using SPE and Hotelling’s T² diagnostics.

Same approach as PCA.detect_outliers: combines statistical limits with the robust generalized ESD test.

Parameters:

conf_level (float, default 0.95) – Confidence level in [0.8, 0.999].

Returns:

outliers – Sorted from most severe to least. Each dict contains observation, outlier_types, spe, hotellings_t2, spe_limit, hotellings_t2_limit, severity.

Return type:

list of dict

Examples

>>> pls = PLS(n_components=3).fit(X_scaled, Y_scaled)
>>> outliers = pls.detect_outliers(conf_level=0.95)
>>> for o in outliers:
...     print(f"{o['observation']}: {o['outlier_types']}")
cross_validate(X, Y, *, cv='loo', n_bootstrap=0, conf_level=0.95, random_state=None, show_progress=True, sample_weight=None)[source]#

Cross-validate the PLS model and compute error bars for beta coefficients.

Refits the model on data subsets (jackknife, K-fold, or bootstrap), collects beta_coefficients_ from each refit, and computes confidence intervals. Also returns cross-validated predictions and prediction-error metrics (RMSE, Q²).

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Predictor matrix (same data used for fit).

  • Y (array-like of shape (n_samples, n_targets)) – Response matrix (same data used for fit).

  • cv (int or "loo", default "loo") –

    Cross-validation strategy:

    • "loo" - leave-one-out (jackknife). Produces N resamples.

    • int - number of folds for K-fold CV.

  • n_bootstrap (int, default 0) – If > 0, use bootstrap resampling instead of CV folds. The value specifies the number of bootstrap rounds. Overrides the cv parameter when set.

  • conf_level (float, default 0.95) – Confidence level for the beta-coefficient intervals, in (0, 1).

  • random_state (int or None, default None) – Random seed for reproducibility (K-fold shuffle and bootstrap).

  • show_progress (bool, default True) – Whether to display a tqdm progress bar.

  • sample_weight (ndarray | None)

Returns:

result – Dictionary-like object with the following keys:

Beta-coefficient uncertainty

beta_samplesnp.ndarray of shape (n_resamples, n_features, n_targets)

Raw beta coefficients from every resample.

beta_meanpd.DataFrame of shape (n_features, n_targets)

Mean beta across resamples.

beta_stdpd.DataFrame of shape (n_features, n_targets)

Standard error of the beta coefficients.

beta_ci_lowerpd.DataFrame of shape (n_features, n_targets)

Lower bound of the confidence interval.

beta_ci_upperpd.DataFrame of shape (n_features, n_targets)

Upper bound of the confidence interval.

significantpd.DataFrame of shape (n_features, n_targets)

True where the confidence interval excludes zero.

Prediction metrics

y_hat_cvpd.DataFrame of shape (n_samples, n_targets)

Cross-validated predictions (out-of-fold). Only available for jackknife and K-fold; None for bootstrap.

pressfloat

Prediction Error Sum of Squares (sum over all Y elements). Only for jackknife / K-fold.

rmse_cvpd.Series of length n_targets

Root-mean-square error per Y variable (cross-validated). Only for jackknife / K-fold.

q_squaredpd.Series of length n_targets

Cross-validated R² (Q²) per Y variable. Only for jackknife / K-fold.

Metadata

n_resamplesint

Number of resamples performed.

methodstr

"jackknife", "kfold", or "bootstrap".

conf_levelfloat

The confidence level used.

Return type:

Bunch

Examples

>>> from process_improve.multivariate import PLS, MCUVScaler
>>> scaler_x = MCUVScaler().fit(X)
>>> scaler_y = MCUVScaler().fit(Y)
>>> X_s, Y_s = scaler_x.transform(X), scaler_y.transform(Y)
>>> pls = PLS(n_components=2).fit(X_s, Y_s)
>>> cv_results = pls.cross_validate(X_s, Y_s, cv="loo")
>>> cv_results.beta_mean          # mean beta across LOO resamples
>>> cv_results.significant        # which betas are significantly != 0
>>> cv_results.q_squared          # cross-validated R²
prediction_interval(X, *, conf_level=0.95, cv_result=None)[source]#

Prediction interval for the Y predictions of new observations.

The interval combines the residual error variance with the leverage of each new observation in the latent-variable space. For a new observation the prediction-interval half-width on target m is

t * s_E[m] * sqrt(1 + 1/N + T2_new / (N - 1))

where s_E is the residual error standard deviation, T2_new is the Hotelling’s T² of the new observation, N is the number of calibration samples, and t is the Student-t quantile.

Parameters:
  • X (array-like of shape (n_new, n_features)) – New observations, pre-processed the same way as the training data.

  • conf_level (float, default=0.95) – Confidence level for the interval, in (0.5, 1.0).

  • cv_result (sklearn.utils.Bunch or None, default=None) – The result of cross_validate(). When supplied, its cross-validated RMSE (rmse_cv) is used for the error variance, which is preferable to the optimistic calibration RMSE used otherwise.

Returns:

With keys y_hat (point predictions), lower and upper (prediction-interval bounds) - each a DataFrame of shape (n_new, n_targets) - and conf_level.

Return type:

sklearn.utils.Bunch

set_fit_request(*, sample_weight='$UNCHANGED$')#

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

  • self (PLS)

Returns:

self – The updated object.

Return type:

object

set_score_request(*, sample_weight='$UNCHANGED$')#

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

  • self (PLS)

Returns:

self – The updated object.

Return type:

object

TPLS#

class process_improve.multivariate.methods.TPLS(n_components, d_matrix, max_iter=500, skip_f_matrix_preprocessing=False)[source]#

Bases: RegressorMixin, BaseEstimator

TPLS algorithm for T-shaped data structures (we also include standard pre-processing of the data inside this class).

Source: Garcia-Munoz, https://doi.org/10.1016/j.chemolab.2014.02.006, Chem.Intell.Lab.Sys. v133, p 49 to 62, 2014.

We change the notation from the original paper to avoid confusion with a generic “X” matrix, and match symbols that are more natural for our use.

Notation mapping (paper → this code):

  • X^T → D: d_matrix (external), d_mats (internal) - Database of properties

  • X → D^T: transposed D (not used directly)

  • R → F: f_mats - Formula matrices

  • Z → Z: z_mats - Process conditions

  • Y → Y: y_mats - Quality indicators

Notes 1. Matrices in F, Z and Y must all have the same number of rows. 2. Columns in F must be the same as the rows in D. 3. Conditions in Z may be missing (turning it into an L-shaped data structure).

Parameters:
  • n_components (int) – A parameter used to specify the number of components.

  • d_matrix (dict[str, pd.DataFrame]) – A dictionary containing the properties of each group of materials. Keys are group names; values are DataFrames with properties as columns and materials as rows. This “D” matrix is provided once at construction and reused for fitting, prediction and cross-validation.

  • max_iter (int, optional) – The maximum number of iterations for the TPLS algorithm. Default is 500.

  • skip_f_matrix_preprocessing (bool, optional) – If True, the F (formula) matrices are used as-is, skipping the internal centering and scaling of the F block. Default is False.

Notes

The input X passed to fit() and predict() is a dictionary with 3 keys:

  • F: Formula matrices (rows = blends, columns = materials). F = {"Group A": df_formulas_a, "Group B": df_formulas_b, ...}

  • Z: Process conditions - one row per blend, one column per condition.

  • Y: Product quality indicators - one row per blend, one column per indicator.

The D matrix (database of material properties) is supplied once at construction via the d_matrix argument; it is not part of X.

n_samples#

The number of samples (rows) in the training data

Type:

int

n_substances#

The number of substances (columns) in the training data, i.e. the number of materials in the F matrix.

Type:

int

Example

>>> import numpy as np
>>> import pandas as pd
>>> rng = np.random.default_rng()
>>>
>>> n_props_a, n_props_b = 6, 4            # Two groups of properties: A and B.
>>> n_materials_a, n_materials_b = 12, 8   # Number of materials in each group.
>>> n_formulas = 40                        # Number of formulas in matrix F.
>>> n_outputs = 3
>>> n_conditions = 2
>>>
>>> properties = {
>>>     "Group A": pd.DataFrame(rng.standard_normal((n_materials_a, n_props_a))),
>>>     "Group B": pd.DataFrame(rng.standard_normal((n_materials_b, n_props_b))),
>>> }
>>> formulas = {
>>>     "Group A": pd.DataFrame(rng.standard_normal((n_formulas, n_materials_a))),
>>>     "Group B": pd.DataFrame(rng.standard_normal((n_formulas, n_materials_b))),
>>> }
>>> process_conditions = {"Conditions": pd.DataFrame(rng.standard_normal((n_formulas, n_conditions)))}
>>> quality_indicators = {"Quality":    pd.DataFrame(rng.standard_normal((n_formulas, n_outputs)))}
>>> all_data = {"Z": process_conditions, "F": formulas, "Y": quality_indicators}
>>> estimator = TPLS(n_components=4, d_matrix=properties)
>>> estimator.fit(all_data)
hotellings_t2_limit(conf_level=0.95)[source]#

Hotelling’s T2 limit at the given confidence level (see hotellings_t2_limit()).

Parameters:

conf_level (float)

Return type:

float

ellipse_coordinates(score_horiz, score_vert, conf_level=0.95, n_points=100)[source]#

Coordinates of the T2 confidence ellipse (see ellipse_coordinates()).

Parameters:
  • score_horiz (int)

  • score_vert (int)

  • conf_level (float)

  • n_points (int)

Return type:

tuple[ndarray, ndarray]

fit(X, y=None)[source]#

Fit the preprocessing parameters and also the latent variable model from the training data.

Parameters:
  • X ({dictionary of dataframes}, keys that must be present: "F", "Z", and "Y") – The training input samples. See documentation in the class definition for more information on each matrix.

  • y (None)

Returns:

self – Returns self.

Return type:

object

predict(X)[source]#

Forward to diagnose(); emits a DeprecationWarning.

Deprecated since version 1.38.4: Use TPLS.diagnose() instead. predict matches the sklearn-convention name (regression-style ndarray return), but TPLS’s natural return is the rich per-group / per-block diagnostics Bunch and TPLS cannot be placed in a standard sklearn Pipeline anyway (its input is a nested DataFrameDict). The rename aligns with PLS.diagnose() and PCA.diagnose(). Will be removed in 2.0.0.

Parameters:

X (DataFrameDict)

Return type:

Bunch

diagnose(X)[source]#

Model inference on new data.

This will pre-process the new data and apply those subsequently to the latent variable model.

Example

# Training phase: estimator = TPLS(n_components=2).fit(training_data)

# Testing/inference phase: new_data = {“Z”: …, “F”: …} # you need at least the F block for a new prediction. “Z” is optional. predictions = estimator.predict(new_data)

Parameters:

X (DataFrameDict) – The input samples.

Returns:

A bunch with the following fields:

  • hat (dict[str, DataFrame]) : predicted Y per Y-group, on the original (un-scaled) scale, indexed by the observation names of X.

  • t_scores_super (DataFrame, shape (n_new, n_components)) : super-scores for the new observations.

  • spe (dict[str, dict[str, DataFrame]]) : per-component squared prediction error for the "Z" and "F" blocks, keyed first by block name and then by group; each inner DataFrame has shape (n_new, n_components).

  • hotellings_t2 (DataFrame, shape (n_new, n_components)) : cumulative Hotelling’s T² per new observation after each component.

Return type:

sklearn.utils.Bunch

display_results(show_cumulative_stats=True)[source]#

Display the results of the model fitting.

Parameters:

show_cumulative_stats (bool)

Return type:

str

score(X, y=None, sample_weight=None)[source]#

Return r2_score` on test data.

See RegressorMixin.score for more details.

Parameters:
  • X (DataFrameDict) – Test samples.

  • y (Not used. In the X input, there is a already a “Y” block. This will be the Y-data.)

  • sample_weight (Not used.)

Returns:

score\(R^2\) of self.diagnose(X).

Return type:

float

help()[source]#

Help for the TPLS Estimator.

Data organization#

Quick tips#

Build model: tpls = TPLS(n_components=2, d_matrix=d_matrix).fit(X) Get model’s predictions: tpls.hat <– the hat-matrix, i.e., the predictions Predict on new data: tpls.predict(X_new) See model summary: tpls.display_results() This help page: tpls.help()

Statistical values#

.t_scores_super Super scores for the entire model [pd.DataFrame] .hotellings_t2 Hotelling’s T2 values for each observation, per component [pd.DataFrame] .spe Squared prediction error for each block [dict of pd.DataFrames]

.hotellings_t2_limit() Returns the Hotelling’s T2 limit for the model [float] .spe_limit[block]() Return the SPE limit for the block; e.g. .spe_limit[“Y”]() [float]

.vip() Variable importance (VIP) for the D- and F-blocks [dict]

Return type:

str

vip(block=None, method='vip')[source]#

Return Variable Importance in Projection (VIP) scores for TPLS blocks.

VIP scores are computed during fitting for the D-block (material properties) and F-block (formulation variables) and stored in feature_importance.

Parameters:
  • block (str or None, default=None) – Which block to return. Must be "D" or "F", or None to return all blocks.

  • method ({"vip", "deflated"}, default="vip") –

    How importance is measured.

    • "vip" (default): standard VIP on the raw block loadings, as reported by Garcia-Munoz (2014). These are the values stored in feature_importance.

    • "deflated": VIP computed on the direct (rotated) weights that map the original variables onto the scores while accounting for the deflation across components, S(V^T S)^-1 for the D-block and P(P^T P)^-1 for the F-block. This is an alternative importance measure; it does not change feature_importance.

Returns:

If block is None: {"D": {group: pd.Series, ...}, "F": {group: pd.Series, ...}}. If block is "D" or "F": the inner dict {group: pd.Series, ...} for that block, where each pd.Series is indexed by feature names.

Return type:

dict

Raises:

ValueError – If the model is not fitted, block is not "D", "F", or None, or method is not "vip" or "deflated".

Examples

>>> tpls = TPLS(...).fit(data)
>>> tpls.vip()                      # all blocks, standard VIP
>>> tpls.vip("D")                   # D-block only → {group_name: pd.Series, ...}
>>> tpls.vip("D", method="deflated")  # D-block deflated direct-weights importance
property d_block_scaling_: dict[str, float]#

Block-scaling factor applied to each D-block (read-only).

After column-wise centring and auto-scaling, every D-block X_i is additionally divided by sqrt(P_i * M_i) (P_i = number of lots/rows, M_i = number of properties/columns) so that trace(X_i^T X_i) ~= 1, removing bias toward blocks with more lots or properties (Garcia-Munoz, 2014, section 2.1).

Returns:

Mapping of D-block group name to its scalar block-scaling factor.

Return type:

dict[str, float]

Raises:

AttributeError – If the model has not been fitted yet.

set_score_request(*, sample_weight='$UNCHANGED$')#

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

  • self (TPLS)

Returns:

self – The updated object.

Return type:

object

MBPLS#

Multi-block PLS in the hierarchical / superblock formulation of Westerhuis, Kourti & MacGregor (1998). Each X-block is preprocessed independently and weighted by 1/sqrt(K_b) before the inner NIPALS loop, so blocks of unequal width contribute fairly to the consensus super-score.

class process_improve.multivariate.methods.MBPLS(n_components, *, max_iter=500, tol=None, algorithm='auto', missing_data_settings=None)[source]#

Bases: _HotellingsT2LimitMixin, RegressorMixin, BaseEstimator

Multi-block PLS (hierarchical / superblock formulation).

Generic multi-block PLS as described by Westerhuis, Kourti & MacGregor (1998) and Westerhuis & Smilde (2001). Each X-block is preprocessed independently (mean-centred and unit-variance scaled), then divided by sqrt(K_b) so that blocks of unequal width contribute fairly to the super-score.

Parameters:
  • n_components (int) – Number of latent variables to extract.

  • max_iter (int, default=500) – Maximum NIPALS iterations per latent variable.

  • tol (float or None, default=None) – Convergence tolerance on the change in the Y-block score. If None, np.finfo(float).eps ** (6/7) is used (matching the legacy multi-block reference implementation).

  • algorithm (str) –

    Algorithm to use for fitting the model.

    • "auto": dense vectorised hierarchical NIPALS when every block (X and Y) is complete; mask-aware NIPALS when any block contains missing values.

    • "dense": dense vectorised hierarchical NIPALS. Raises if any block contains missing values.

    • "nipals": mask-aware hierarchical NIPALS. Always uses the NaN-tolerant inner-loop primitives, even when the data is complete (slower than "dense" but produces equivalent results).

  • missing_data_settings (dict or None, default=None) – Settings for the iterative "nipals" path. Keys: md_tol (convergence tolerance on the score-vector change between iterations), md_max_iter (maximum NIPALS iterations per component). Defaults to {"md_tol": epsqrt, "md_max_iter": 1000}.

  • fitting) (Attributes (after)

  • --------------------------

  • block_names (list[str]) – Ordered list of X-block names (the keys of the input dict).

  • block_widths (dict[str, int]) – Number of variables in each X-block.

  • super_scores (pd.DataFrame, shape (n_samples, n_components)) – Super-block (consensus) X-scores T.

  • super_y_scores (pd.DataFrame, shape (n_samples, n_components)) – Super-block Y-scores U.

  • super_weights (pd.DataFrame, shape (n_blocks, n_components)) – Super-block weights w_super; rows indexed by block name.

  • super_y_loadings (pd.DataFrame, shape (n_targets, n_components)) – Y-block loadings c.

  • block_scores (dict[str, pd.DataFrame]) – Per-block X-scores t_b, each shape (n_samples, n_components).

  • block_weights (dict[str, pd.DataFrame]) – Per-block X-weights w_b, each shape (K_b, n_components). Each column has unit norm.

  • block_loadings (dict[str, pd.DataFrame]) – Per-block X-loadings p_b (used for deflation), each shape (K_b, n_components).

  • predictions (pd.DataFrame, shape (n_samples, n_targets)) – In-sample Y predictions on the original scale.

  • explained_variance (np.ndarray, shape (n_components,)) – Variance of the super-score per component (ddof=1).

  • scaling_factor_for_super_scores (pd.Series) – sqrt(explained_variance_) per component.

  • fitting_info (dict) – Per-component iteration count and timing.

  • has_missing_data (bool) – Whether any X-block or Y had NaN values.

  • algorithm – The resolved algorithm actually used for the fit. With algorithm="auto", this is "dense" for complete data and "nipals" for NaN-containing data.

Notes

Block weighting uses the convention \(X_b / \sqrt{K_b}\) so that every block contributes the same total sum of squares to the super-score, regardless of how many variables it has.

Missing data#

When any X-block or Y contains NaN entries, the "auto" algorithm routes to a mask-aware NIPALS variant. The X-block weights, block scores, block loadings used for deflation, Y-block loadings and Y-block scores are each computed as a regression that uses only the observed entries; the masked sum-of-squares is used as the denominator so missing values neither bias the latent direction nor contribute to the score. The mask is preserved across components automatically because deflation propagates NaN through subtraction. This is the standard skip-NaN NIPALS update; see Walczak & Massart (2001) and Arteaga & Ferrer (2002).

The fit refuses to run if any X-block or Y has a column with all entries missing, or a row with all entries missing for that block; either case leaves the masked denominator at zero. Drop or impute such rows or columns before fitting. Predict-time score estimation for new observations with NaN (Trimmed Score Regression / Projection to the Model Plane) is a separate follow-up.

References

Westerhuis, J. A., Kourti, T. & MacGregor, J. F. Analysis of multiblock and hierarchical PCA and PLS models. Journal of Chemometrics, 12 (1998), 301-321.

Westerhuis, J. A. & Smilde, A. K. Deflation in multiblock PLS. Journal of Chemometrics, 15 (2001), 485-493.

Walczak, B. & Massart, D. L. Dealing with missing data: Part I. Chemom. Intell. Lab. Syst., 58 (2001), 15-27.

Arteaga, F. & Ferrer, A. Dealing with missing data in MSPC: several methods, different interpretations, some examples. J. Chemometrics, 16 (2002), 408-418.

fit(X, y)[source]#

Fit the multi-block PLS model.

Parameters:
  • X (dict[str, pd.DataFrame]) – X-blocks. Keys are block names; values are DataFrames sharing the same row index (and row count). Each block is preprocessed independently.

  • y (pd.DataFrame) – Y-block. Same row index / row count as the X-blocks.

Return type:

MBPLS

block_spe_limit(block, conf_level=0.95)[source]#

SPE limit for one X-block using the Nomikos & MacGregor chi-square approximation.

Operates on the same scale as block_spe_[block] (sqrt of row sum of squares), so the value can be drawn directly on a SPE plot.

Parameters:
Return type:

float

super_spe_limit(conf_level=0.95)[source]#

SPE limit for the merged super-block (sum of per-block SPE squared).

Parameters:

conf_level (float)

Return type:

float

spe_contributions(X)[source]#

Per-variable squared residuals for each X-block (SPE contributions).

For each new observation and each X-block, reconstruct the block as T_super @ P_b^T (matching the deflation step used during fit) and return the squared per-variable residuals. Useful for fault diagnosis: the variable with the largest contribution is the most likely culprit for a high SPE.

Returns:

One DataFrame per block, shape (n_samples, K_b). Values are preprocessed-scale squared residuals (centred and scaled inside the model). Sum across columns equals block_spe_[b].iloc[:, -1] ** 2.

Return type:

dict[str, pd.DataFrame]

Parameters:

X (dict[str, DataFrame])

score_contributions(t_super_start, t_super_end=None, components=None, *, weighted=False)[source]#

Per-block per-variable contributions to a super-score movement.

The multi-block analogue of PCA.score_contributions(). Decomposes a super-score-space delta back into preprocessed-scale variable-space deltas, one per X-block.

For MBPLS, the super-score at component a is built from the per-block scores (which themselves are weighted regressions of the block on w_b), so the back-projection through w_super[b,a] * w_b[:,a] / sqrt(K_b) gives the variable contribution.

Parameters:
  • t_super_start (array-like, shape (n_components,)) – Super-score row of the observation of interest. Typically a row from self.super_scores_ or from predict(X_new).super_scores.

  • t_super_end (array-like, optional) – Reference point in super-score space. Defaults to the model centre (zeros).

  • components (list of int, optional) – 1-based component indices to decompose over. None (default) uses all components - appropriate for Hotelling’s T² contributions.

  • weighted (bool, default=False) – If True, divide the super-score delta by sqrt(explained_variance_) per component before back-projecting, giving contributions to the T² statistic instead of the Euclidean super-score distance.

Returns:

One Series per X-block (length K_b), indexed by variable (column) name.

Return type:

dict[str, pd.Series]

super_score_plot(pc_horiz=1, pc_vert=2)[source]#

Scatter plot of super-scores for two components.

Parameters:
  • pc_horiz (int)

  • pc_vert (int)

Return type:

Figure

super_weights_bar_plot(component=1)[source]#

Bar plot of super-weights w_super for a single component.

Parameters:

component (int)

Return type:

Figure

predictions_vs_observed_plot(y_observed, variable=None)[source]#

Scatter plot of predicted vs observed Y, with y=x reference and RMSEE annotation.

Parameters:
  • y_observed (pd.DataFrame) – The observed Y on the original scale, same columns as the training Y.

  • variable (str or None, default=None) – If given, plot only that Y-variable. If None, plot the first one.

Return type:

Figure

display_results(show_cumulative=True)[source]#

Format a short text summary of per-block R²X, overall R²Y, iterations and timing.

Parameters:

show_cumulative (bool)

Return type:

str

transform(X)[source]#

Project new data to super-scores using the fitted model.

Parameters:

X (dict[str, DataFrame])

Return type:

DataFrame

diagnose(X)[source]#

Project new data and return the full diagnostics Bunch.

Returns a sklearn.utils.Bunch with fields super_scores (DataFrame, n_samples x n_components), block_scores (dict[str, DataFrame]), predictions (DataFrame on original Y scale), block_spe (dict[str, Series], per-block SPE of the new observations) and hotellings_t2 (Series of cumulative Hotelling’s T² over all components, per new observation).

The rename (since 1.38.4, #395) matches PLS.diagnose() and PCA.diagnose; predict() is kept as a deprecation shim.

Parameters:

X (dict[str, DataFrame])

Return type:

Bunch

predict(X)[source]#

Forward to diagnose(); emits a DeprecationWarning.

Deprecated since version 1.38.4: Use MBPLS.diagnose() instead. The sklearn-convention predict name suggests a regression-style ndarray return, but the historical return is the rich diagnostics Bunch. The rename aligns with PLS.diagnose() and PCA.diagnose() and frees the name for a future contract that returns just the predictions field. Will be removed in 2.0.0.

Parameters:

X (dict[str, DataFrame])

Return type:

Bunch

set_score_request(*, sample_weight='$UNCHANGED$')#

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

  • self (MBPLS)

Returns:

self – The updated object.

Return type:

object

process_improve.multivariate.methods.randomization_test_mbpls(model, X, y, n_permutations=200, *, seed=None)[source]#

Randomization (permutation) test for component significance in MBPLS.

For each component a, the null hypothesis is “there is no real relationship between X and Y at this component”; the test permutes the rows of y, refits a fresh MBPLS with the same number of components, and recomputes the test statistic. The risk is the fraction of permutations whose statistic equals or exceeds the original model’s.

Statistic: per-component absolute correlation between the super X-score and the super Y-score, |t_super(:,a)' u_super(:,a)| / (||t|| * ||u||), matching the legacy ConnectMV randomization-objective for PLS.

Parameters:
  • model (MBPLS) – A fitted MBPLS model.

  • X (dict[str, DataFrame], DataFrame) – The same training data used to fit model.

  • y (dict[str, DataFrame], DataFrame) – The same training data used to fit model.

  • n_permutations (int, default=200) – Number of Y-row permutations to evaluate.

  • seed (int or None, default=None) – Seed for the permutation RNG (None uses non-reproducible randomness).

Returns:

Indexed by component 1..A with columns:

  • observed : the actual model’s per-component statistic.

  • risk_pct : fraction (in %) of permutations with statistic >= observed. Low values (e.g. < 5%) suggest the component is significant; values near 50% suggest the component is no better than chance.

Return type:

pd.DataFrame

References

Wiklund, S., Nilsson, D., Eriksson, L., Sjöström, M., Wold, S. & Faber, K. A randomization test for PLS component selection. J. Chemometrics, 21 (2007), 427-439.

MBPCA#

Multi-block PCA / consensus-PCA. Same dict-of-DataFrames API as MBPLS; no Y-block.

class process_improve.multivariate.methods.MBPCA(n_components, *, max_iter=500, tol=None, algorithm='auto', missing_data_settings=None)[source]#

Bases: _HotellingsT2LimitMixin, TransformerMixin, BaseEstimator

Multi-block PCA (hierarchical / consensus PCA).

Generic multi-block PCA following the consensus-PCA / hierarchical PCA formulation of Westerhuis, Kourti & MacGregor (1998). Each X-block is preprocessed independently (mean-centred and unit-variance scaled), then divided by sqrt(K_b) so blocks of unequal width contribute fairly to the consensus super-score.

The hierarchical NIPALS loop alternates: (i) regress each block on the super-score to get block loadings and block scores, (ii) collect block scores into a super-block, (iii) regress the super-block to get a new super-score / super-loading, repeat to convergence. After convergence, deflate every block by the super-score and the corresponding block loading scaled by the super-loading element.

Parameters:
  • n_components (int)

  • max_iter (int, default=500)

  • tol (float or None, default=None) – Convergence tolerance on the super-score change. None uses np.finfo(float).eps ** (9/10) (matches the legacy reference).

  • algorithm (str, default="auto") –

    Algorithm to use for fitting the model.

    • "auto": dense vectorised hierarchical NIPALS when the data is complete; mask-aware NIPALS (NaN-tolerant) when any block contains missing values.

    • "dense": dense vectorised hierarchical NIPALS. Raises if any block contains missing values.

    • "nipals": mask-aware hierarchical NIPALS. Always uses the NaN-tolerant inner-loop primitives, even when the data is complete (slower than "dense" but produces equivalent results).

  • missing_data_settings (dict or None, default=None) – Settings for the iterative "nipals" path. Keys: md_tol (convergence tolerance on the score-vector change between iterations), md_max_iter (maximum NIPALS iterations per component). Defaults to {"md_tol": epsqrt, "md_max_iter": 1000}.

  • fitting) (Attributes (after)

  • --------------------------

  • block_names

  • MBPLS) (block_widths (as)

  • A) (super_loadings DataFrame (B x)

  • A)

  • block_scores

  • dict[str (r2_x_per_variable)

  • DataFrame]

  • r2_x_per_block_cumulative

  • r2_x_per_block_per_component

  • dict[str

  • DataFrame]

  • block_vip

  • super_vip

  • block_spe

  • block_hotellings_t2

  • super_hotellings_t2

  • explained_variance

  • scaling_factor_for_super_scores

  • fitting_info

  • has_missing_data

  • algorithm

Notes

The deflation step is \(X_b \leftarrow X_b - t_{\rm super}\, (p_b\,p_s[b]\,\sqrt{K_b})^\top\), derived in Westerhuis et al. 1998. The legacy MATLAB mbpca.m had this step marked as broken by the original author; this implementation re-derives it directly from the paper and is independently validated against the pure-numpy reference oracles in the test suite.

Missing data#

When any block contains NaN entries, the "auto" algorithm routes to a mask-aware NIPALS variant. Each per-block projection in the inner loop is computed as a regression that uses only the observed entries; the masked sum-of-squares is used as the denominator so missing values neither bias the loading direction nor contribute to the score. The mask is preserved across components automatically because deflation propagates NaN through subtraction. This is the standard skip-NaN NIPALS update; see Walczak & Massart (2001) and Arteaga & Ferrer (2002).

The fit refuses to run if any block has a column with all entries missing, or any block has a row with all entries missing for that block; either case leaves the masked denominator at zero. Drop or impute such rows or columns before fitting. Predict-time score estimation for new observations with NaN (Trimmed Score Regression / Projection to the Model Plane) is a separate follow-up.

References

Westerhuis, J. A., Kourti, T. & MacGregor, J. F. Analysis of multiblock and hierarchical PCA and PLS models. J. Chemometrics, 12 (1998), 301-321.

Walczak, B. & Massart, D. L. Dealing with missing data: Part I. Chemom. Intell. Lab. Syst., 58 (2001), 15-27.

Arteaga, F. & Ferrer, A. Dealing with missing data in MSPC: several methods, different interpretations, some examples. J. Chemometrics, 16 (2002), 408-418.

fit(X, y=None)[source]#

Fit the multi-block PCA model.

Parameters:
Return type:

MBPCA

transform(X)[source]#

Project new data to super-scores using the fitted model.

Parameters:

X (dict[str, DataFrame])

Return type:

DataFrame

diagnose(X)[source]#

Project new data; return super_scores, block_scores, block_spe, hotellings_t2.

The rename (since 1.38.4, #395) matches PCA.diagnose() and PLS.diagnose(); predict() is kept as a deprecation shim.

Parameters:

X (dict[str, DataFrame])

Return type:

Bunch

predict(X)[source]#

Forward to diagnose(); emits a DeprecationWarning.

Deprecated since version 1.38.4: Use MBPCA.diagnose() instead. predict matches the sklearn-convention name but MBPCA isn’t a regressor; the historical return is a diagnostics Bunch. The rename aligns with PCA.diagnose(). Will be removed in 2.0.0.

Parameters:

X (dict[str, DataFrame])

Return type:

Bunch

block_spe_limit(block, conf_level=0.95)[source]#

SPE limit for one X-block (Nomikos & MacGregor chi-square approximation).

Parameters:
Return type:

float

super_spe_limit(conf_level=0.95)[source]#

SPE limit for the merged super-block.

Parameters:

conf_level (float)

Return type:

float

spe_contributions(X)[source]#

Per-variable squared residuals for each X-block (SPE contributions).

Reconstruction matches the MBPCA deflation step: X_b = T_super @ (P_b * p_super[b] * sqrt(K_b))^T summed over components. Returns squared residuals on the preprocessed scale; sum across columns equals block_spe_[b].iloc[:, -1] ** 2.

Parameters:

X (dict[str, DataFrame])

Return type:

dict[str, DataFrame]

score_contributions(t_super_start, t_super_end=None, components=None, *, weighted=False)[source]#

Per-block per-variable contributions to a super-score movement (MBPCA).

Decomposes a super-score-space delta into preprocessed-scale variable contributions per X-block. The MBPCA back-projection mirrors the deflation step used during fit:

\[\text{contrib}_{b,j} = \sum_a (\Delta t_\mathrm{super}[a]) \cdot P_b[j, a] \cdot p_\mathrm{super}[b, a] \cdot \sqrt{K_b}\]

See MBPLS.score_contributions() for the parameter and return descriptions; the API is identical.

Parameters:
Return type:

dict[str, Series]

super_score_plot(pc_horiz=1, pc_vert=2)[source]#

Scatter plot of MBPCA super-scores for two components.

Parameters:
  • pc_horiz (int)

  • pc_vert (int)

Return type:

Figure

super_loadings_bar_plot(component=1)[source]#

Bar plot of MBPCA super-loadings for a single component.

Parameters:

component (int)

Return type:

Figure

display_results(show_cumulative=True)[source]#

Format a short text summary of per-block R²X, iterations and timing.

Parameters:

show_cumulative (bool)

Return type:

str

Analysis#

process_improve.multivariate.methods.rv_coefficient(X, Y)[source]#

Compute the RV coefficient between two data blocks.

The RV coefficient (Robert and Escoufier, 1976) measures how much common structure two matrices, measured on the same observations, share. It is a multivariate generalisation of the squared Pearson correlation: it compares the observation-by-observation configuration matrices \(XX^T\) and \(YY^T\) rather than individual variables.

Parameters:
  • X (array-like of shape (n_samples, n_features_x)) – First data block.

  • Y (array-like of shape (n_samples, n_features_y)) – Second data block. Must have the same number of rows as X; the number of columns may differ.

Returns:

The RV coefficient in the range [0, 1]. A value of 1 means the two blocks describe the same configuration of observations up to a rotation and an overall scaling; 0 means no shared structure. nan is returned if either block has no variance.

Return type:

float

Notes

Each column is mean-centred internally, since the RV coefficient is defined on centred data. The blocks are not scaled; scale the columns yourself (for example with MCUVScaler) when the variables have different units.

For high-dimensional data (many more variables than observations) the RV coefficient is biased upwards and tends towards 1 even for unrelated blocks. Use rv2_coefficient() in that regime.

References

Robert, P. and Escoufier, Y. (1976). A unifying tool for linear multivariate statistical methods: the RV-coefficient. Journal of the Royal Statistical Society, Series C, 25(3), 257-265.

See also

rv2_coefficient

Modified RV coefficient, unbiased for high-dimensional data.

Examples

>>> rv_coefficient(X, Y)
>>> rv_coefficient(X, X)  # 1.0: a block is perfectly correlated with itself
process_improve.multivariate.methods.rv2_coefficient(X, Y)[source]#

Compute the modified RV coefficient (RV2) between two data blocks.

The modified RV coefficient (Smilde et al., 2009) is a variant of rv_coefficient() that removes the diagonals of the configuration matrices \(XX^T\) and \(YY^T\) before comparing them. This removes the upward bias that makes the ordinary RV coefficient tend towards 1 for high-dimensional data, so RV2 stays near 0 for genuinely unrelated blocks.

Parameters:
  • X (array-like of shape (n_samples, n_features_x)) – First data block.

  • Y (array-like of shape (n_samples, n_features_y)) – Second data block. Must have the same number of rows as X; the number of columns may differ.

Returns:

The modified RV coefficient, in the range [-1, 1]. A value of 1 means the two blocks describe the same configuration of observations; values near 0 mean no shared structure, and small negative values can occur. nan is returned if either block has no variance.

Return type:

float

Notes

Each column is mean-centred internally but the blocks are not scaled; scale the columns yourself (for example with MCUVScaler) when the variables have different units.

References

Smilde, A. K., Kiers, H. A. L., Bijlsma, S., Rubingh, C. M. and van Erk, M. J. (2009). Matrix correlations for high-dimensional data: the modified RV-coefficient. Bioinformatics, 25(3), 401-405.

See also

rv_coefficient

The original RV coefficient.

Examples

>>> rv2_coefficient(X, Y)

Preprocessing#

class process_improve.multivariate.methods.MCUVScaler[source]#

Bases: TransformerMixin, BaseEstimator

Mean-centre, unit-variance (MCUV) scaler.

Unlike sklearn.preprocessing.StandardScaler this uses the sample standard deviation (ddof=1), the convention for chemometric data analysis where the population is the training set itself rather than a sampled super-population.

The estimator follows the standard sklearn contract: n_features_in_ and feature_names_in_ are populated by fit; sparse / complex / object dtype / empty input are rejected with sklearn-style errors; NaN values pass through (the chemometric preprocessing pipeline expects to thread missing-data through to the downstream NIPALS estimator).

get_feature_names_out(input_features=None)[source]#

Return the output column names of transform().

MCUVScaler is column-preserving (centring + scaling leave the X column layout unchanged), so the returned names mirror those captured during fit() (or the input_features argument when no feature_names_in_ was captured - the standard sklearn fallback for ndarray-fit estimators).

Used by set_output() (sklearn 1.2+) to label the DataFrame view of the output when set_output(transform="pandas") is on, and by Pipeline introspection.

Return type:

ndarray

fit(X, y=None)[source]#

Compute the column means and sample standard deviations.

y is accepted (and ignored) so the scaler plugs into sklearn.pipeline.Pipeline, which threads y through every step’s fit even when (as for a transformer) it is unused.

Parameters:

X (ndarray | DataFrame)

Return type:

MCUVScaler

transform(X, y=None)[source]#

Mean-centre and unit-variance scale X.

y is accepted (and ignored) for Pipeline compatibility.

Parameters:

X (ndarray | DataFrame)

Return type:

DataFrame

inverse_transform(X)[source]#

Inverse the mean-centring and unit-variance scaling.

Parameters:

X (ndarray | DataFrame)

Return type:

DataFrame

process_improve.multivariate.methods.center(X, func=<function mean>, axis=0, extra_output=False)[source]#

Perform centering of data, using a function, func (default: np.mean). The function, if supplied, must return a vector with as many columns as the matrix X.

axis [optional; default=0] {integer or None}

This specifies the axis along which the centering vector will be calculated if not provided. The function is applied along the axis: 0=down the columns; 1 = across the rows.

Missing values: The sample mean is computed by taking the sum along the axis, skipping any missing data, and dividing by N = number of values which are present. Values which were missing before, are left as missing after.

Parameters:
Return type:

ndarray | DataFrame | tuple[ndarray | DataFrame, ndarray]

process_improve.multivariate.methods.scale(X, func=<function std>, axis=0, extra_output=False, ddof=0, **kwargs)[source]#

Scales the data (does NOT do any centering); scales to unit variance by default.

func [optional; default=np.std] {a function}

The default (np.std) uses NumPy to calculate the standard deviation of the data along the required axis, skipping over any missing data, and uses that as scale.

axis [optional; default=0] {integer}

Transformations are applied on slices of data. This specifies the axis along which the transformation will be applied.

ddof [optional; default=0] {integer}

Delta degrees of freedom, forwarded to np.std when func is the default np.std. The standard deviation is computed by dividing by N - ddof, where N is the number of values which are present. The default (ddof=0) divides by N (the population standard deviation); pass ddof=1 for the sample standard deviation (dividing by N-1).

Note: MCUVScaler uses ddof=1 and is the preferred scaler for fitting PCA / PLS models. Use scale(center(X), ddof=1) here to match it. The ddof argument is ignored when a custom func is supplied (forward your own keyword arguments via **kwargs instead).

Constant (zero-variance) columns are left unchanged: a zero entry in the computed scaling vector is replaced by 1.0 before inversion, mirroring MCUVScaler, so no inf / NaN is introduced.

Usage#

X = … # data matrix X = scale(center(X)) X = scale(center(X), ddof=1) # sample standard deviation, matches MCUVScaler my_scale = np.mad X = scale(center(X), func=my_scale)

returns:
  • scaled (DataMatrix) – The scaled data, returned when extra_output=False (the default).

  • (scaled, scale_vector) (tuple[DataMatrix, np.ndarray]) – When extra_output=True, a tuple of the scaled data and the per-column scaling vector (the reciprocal of func applied along axis, with zero entries replaced by 1.0 to leave constant columns unchanged) is returned instead.

Parameters:
Return type:

ndarray | DataFrame | tuple[ndarray | DataFrame, ndarray]

Diagnostics#

These functions work with fitted PCA and PLS models. Each is also bound as a convenience method on the model after fit().

Note

Two different “contributions” diagnostics. The library has two methods whose names both contain “contributions”; they are not interchangeable and answer different questions about the same fitted score matrix.

  • PCA.score_contributions() (and PLS.score_contributions()) is per-variable and signed. It decomposes a single observation’s movement in score space back onto the original variables, answering “which variables explain why this observation sits where it does?”. It returns one signed value per variable, and it takes an observation’s score vector as input.

  • observation_contributions() is per-observation and non-negative. It reports each observation’s share of a component’s total inertia (\(t_{ia}^2 / \sum_i t_{ia}^2\)), answering “which observations most strongly shape this component?”. It returns a sample-by-component table whose columns each sum to 1, and it takes no input beyond the fitted model.

In short, score_contributions decomposes across variables while observation_contributions decomposes across observations.

process_improve.multivariate.methods.vip(model, n_components=None)[source]#

Calculate Variable Importance in Projection (VIP) scores.

Works with fitted PCA and PLS models. For PCA the principal-component loadings loadings_ are used as the weight matrix; for PLS the X-block weights x_weights_ are used.

The formula is:

\[\begin{split}\\text{VIP}_j = \\sqrt{K \\cdot \\frac{\\sum_{a=1}^{A} r2_a \\cdot w_{ja}^2}{\\sum_{a=1}^{A} r2_a}}\end{split}\]

where \(K\) is the number of features, \(A\) the number of components, \(r2_a\) the fraction of variance explained by component \(a\), and \(w_{ja}\) the weight for feature \(j\) in component \(a\).

Parameters:
  • model (PCA or PLS) – A fitted PCA or PLS model.

  • n_components (int or None, default=None) – Number of components to include. None uses all fitted components.

Returns:

VIP scores indexed by feature names, named "VIP".

Return type:

pd.Series

Raises:

ValueError – If the model is not fitted, if neither x_weights_ nor loadings_ is found, or if n_components is out of range.

Examples

>>> pls = PLS(n_components=3).fit(X_scaled, Y_scaled)
>>> pls.vip()          # bound convenience method after fit()
>>> vip(pls)           # or call the standalone function directly
>>> pca = PCA(n_components=3).fit(X_scaled)
>>> pca.vip(n_components=2)
process_improve.multivariate.methods.squared_cosine(model, n_components=None)[source]#

Calculate the squared cosine (cos2): quality of representation of observations.

Works with fitted PCA and PLS models. The squared cosine of observation \(i\) on component \(a\) is the squared score divided by that observation’s total variation budget:

\[\begin{split}\\cos^2_{ia} = \\frac{t_{ia}^2} {\\sum_{a=1}^{A} t_{ia}^2 + \\text{SPE}_i^2}\end{split}\]

where \(t_{ia}\) is the score and \(\\text{SPE}_i\) the residual (squared prediction error) of the observation. Across all components the cos2 values plus the residual fraction sum to 1. A value close to 1 means the observation is well represented on that component. For PCA, whose loadings are orthonormal, the denominator equals the squared distance of the observation from the origin, matching the classical definition.

cos2 complements the existing diagnostics: Hotelling’s T² measures distance within the model plane, SPE measures distance to it, and cos2 reports how much of an observation’s total variation a given component captures.

Parameters:
  • model (PCA or PLS) – A fitted PCA or PLS model.

  • n_components (int or None, default=None) – Number of components to return. None returns all fitted components.

Returns:

cos2 values of shape (n_samples, n_components), indexed by sample.

Return type:

pd.DataFrame

Raises:

ValueError – If the model is not fitted, or if n_components is out of range.

Examples

>>> pca = PCA(n_components=3).fit(X_scaled)
>>> pca.squared_cosine()              # bound convenience method after fit()
>>> squared_cosine(pca, n_components=2)  # or call the function directly
process_improve.multivariate.methods.observation_contributions(model, n_components=None)[source]#

Calculate the contribution of each observation to each component.

Works with fitted PCA and PLS models. The contribution of observation \(i\) to component \(a\) is its squared score divided by the sum of squared scores of all observations on that component:

\[\begin{split}\\text{contribution}_{ia} = \\frac{t_{ia}^2}{\\sum_{i=1}^{N} t_{ia}^2}\end{split}\]

Values lie between 0 and 1 and each column sums to 1, so a contribution well above the average \(1/N\) flags an observation that strongly shapes that component.

Note that this is not the same diagnostic as the score_contributions method, despite the similar name. score_contributions is per-variable and signed: it decomposes one observation’s position in score space back onto the original variables (“which variables explain why this observation sits where it does?”). observation_contributions is per-observation and non-negative: it reports each observation’s share of a component’s total inertia (“which observations most strongly shape this component?”). The two are orthogonal views of the same score matrix and are not interchangeable.

Parameters:
  • model (PCA or PLS) – A fitted PCA or PLS model.

  • n_components (int or None, default=None) – Number of components to return. None returns all fitted components.

Returns:

Contributions of shape (n_samples, n_components), indexed by sample. Each column sums to 1.

Return type:

pd.DataFrame

Raises:

ValueError – If the model is not fitted, or if n_components is out of range.

Examples

>>> pca = PCA(n_components=3).fit(X_scaled)
>>> pca.observation_contributions()
>>> observation_contributions(pca, n_components=2)

See also

PCA.score_contributions

The per-variable counterpart - decomposes one observation’s score-space position back onto the original variables.

process_improve.multivariate.methods.eigenvalue_summary(model)[source]#

Summarize the variance captured by each component as a tidy table.

Works with fitted PCA and PLS models. Returns one row per component, collecting explained_variance_, r2_per_component_ and r2_cumulative_ into a single table.

Parameters:

model (PCA or PLS) – A fitted PCA or PLS model.

Returns:

Indexed by component, with columns eigenvalue (the variance of the component scores), percent_variance and cumulative_percent. For PCA the percentages refer to variance in X; for PLS they refer to the variance in Y explained by each component.

Return type:

pd.DataFrame

Raises:

ValueError – If the model is not fitted.

Examples

>>> pca = PCA(n_components=3).fit(X_scaled)
>>> pca.eigenvalue_summary()
>>> eigenvalue_summary(pca)
process_improve.multivariate.methods.project_variables(model, supplementary_data)[source]#

Project supplementary (passive) variables onto a fitted model.

Works with fitted PCA and PLS models. Supplementary variables are extra columns that did not take part in fitting the model but were measured on the same observations. Each supplementary variable is represented by its correlation with each component’s scores, the standard representation for passive quantitative variables. This is the column-wise counterpart of transform, which projects supplementary rows (new observations).

Parameters:
  • model (PCA or PLS) – A fitted PCA or PLS model.

  • supplementary_data (array-like of shape (n_samples, n_supplementary)) – Passive variables measured on the same observations used to fit the model. Must have the same number of rows as the training data.

Returns:

Correlations of shape (n_supplementary, n_components): the coordinate of each supplementary variable on each component.

Return type:

pd.DataFrame

Raises:

ValueError – If the model is not fitted, or if supplementary_data does not have the same number of rows as the training data.

Examples

>>> pca = PCA(n_components=3).fit(X_scaled)
>>> pca.project_variables(passive_columns)
>>> project_variables(pca, passive_columns)

Plots#

process_improve.multivariate.plots.score_plot(model, pc_horiz=1, pc_vert=2, pc_depth=-1, items_to_highlight=None, settings=None, fig=None)[source]#

Generate a 2D or 3D score plot for the given latent variable model.

A 2D scatter on (pc_horiz, pc_vert) is produced by default. Supplying pc_depth >= 1 adds a third score axis and switches the underlying trace to Scatter3d.

Parameters:
  • model (MVmodel object (PCA, or PLS)) – A latent variable model generated by this library.

  • pc_horiz (int, optional) – Which component to plot on the horizontal axis, by default 1 (the first component)

  • pc_vert (int, optional) – Which component to plot on the vertical axis, by default 2 (the second component)

  • pc_depth (int, optional) – If pc_depth >= 1, then a 3D score plot is generated, with this component on the 3rd axis

  • items_to_highlight (dict, optional) –

    Keys are JSON strings parseable by json.loads into a Plotly line specifier; values are lists of index names to highlight. For example:

    items_to_highlight = {'{"color": "red", "symbol": "cross"}': items_in_red}
    

    will highlight the items in items_in_red with the given colour and shape.

  • settings (dict) –

    Default settings:

    {
        "show_ellipse": True,          # bool: show the Hotelling's T2 ellipse
        "ellipse_conf_level": 0.95,    # float: ellipse confidence level (< 1.00)
        "title": "Score plot of ...",  # str: overall plot title
        "show_labels": False,          # bool: add a label for each observation
        "show_legend": True,           # bool: show clickable legend
        "html_image_height": 500,      # int: image height in pixels
        "html_aspect_ratio_w_over_h": 16/9,  # float: width as ratio of height
        "template": "pi_journal",        # str: registered Plotly theme name
    }
    

  • fig (Figure | None)

Return type:

Figure

Examples

>>> pca = PCA(n_components=3).fit(X_scaled)
>>> pca.score_plot()                          # PC1 vs PC2
>>> pca.score_plot(pc_horiz=1, pc_vert=3)     # PC1 vs PC3
>>> pca.score_plot(pc_horiz=1, pc_vert=2, pc_depth=3)  # 3D
process_improve.multivariate.plots.loading_plot(model, loadings_type='p', pc_horiz=1, pc_vert=2, settings=None, fig=None)[source]#

Generate a 2-dimensional loadings for the given latent variable model.

Parameters:
  • model (MVmodel object (PCA, or PLS)) – A latent variable model generated by this library.

  • loadings_type (str, optional) –

    A choice of the following:

    ’p’ : (default for PCA) : the P (projection) loadings: only option possible for PCA ‘w’ : the W loadings: Suitable for PLS ‘w*’ : (default for PLS) the W* (or R) loadings: Suitable for PLS ‘w*c’ : the W* (from X-space) with C loadings from the Y-space: Suitable for PLS ‘c’ : the C loadings from the Y-space: Suitable for PLS

    For PCA model any other choice besides ‘p’ will be ignored.

  • pc_horiz (int, optional) – Which component to plot on the horizontal axis, by default 1 (the first component)

  • pc_vert (int, optional) – Which component to plot on the vertical axis, by default 2 (the second component)

  • settings (dict) –

    Default settings:

    {
        "title": "Loadings plot ...",  # str: overall plot title
        "show_labels": True,           # bool: add a label for each variable
        "html_image_height": 500,      # int: image height in pixels
        "html_aspect_ratio_w_over_h": 16/9,  # float: width as ratio of height
        "template": "pi_journal",        # str: registered Plotly theme name
    }
    

  • fig (Figure | None)

Return type:

Figure

Examples

>>> pca.loading_plot()                                 # P loadings, PC1 vs PC2
>>> pls.loading_plot(loadings_type="w*c")              # W* and C loadings
>>> pls.loading_plot(loadings_type="w", pc_vert=3)     # W loadings, PC1 vs PC3
process_improve.multivariate.plots.spe_plot(model, with_a=-1, items_to_highlight=None, settings=None, fig=None)[source]#

Generate a squared-prediction error (SPE) plot for the given latent variable model using with_a number of latent variables. The default will use the total number of latent variables which have already been fitted.

Parameters:
  • model (MVmodel object (PCA, or PLS)) – A latent variable model generated by this library.

  • with_a (int, optional) – Uses this many number of latent variables, and therefore shows the SPE after this number of model components. By default the total number of components fitted will be used.

  • items_to_highlight (dict, optional) –

    Keys are JSON strings parseable by json.loads into a Plotly line specifier; values are lists of index names to highlight. For example:

    items_to_highlight = {'{"color": "red", "symbol": "cross"}': items_in_red}
    

    will highlight the items in items_in_red with the given colour and shape.

  • settings (dict) –

    Default settings:

    {
        "show_limit": True,            # bool: show the SPE confidence limit line
        "conf_level": 0.95,            # float: confidence level for limit (< 1.00)
        "title": "SPE plot ...",        # str: overall plot title
        "default_marker": {...},        # dict: e.g. dict(symbol="circle", size=7)
        "show_labels": False,           # bool: add a label for each observation
        "show_legend": False,           # bool: show clickable legend
        "html_image_height": 500,       # int: image height in pixels
        "html_aspect_ratio_w_over_h": 16/9,  # float: width as ratio of height
        "template": "pi_journal",         # str: registered Plotly theme name
    }
    

  • fig (Figure | None)

Return type:

Figure

Examples

>>> pca.spe_plot()
>>> pca.spe_plot(settings={"conf_level": 0.99, "show_labels": True})
process_improve.multivariate.plots.t2_plot(model, with_a=-1, items_to_highlight=None, settings=None, fig=None)[source]#

Generate a Hotelling’s T2 (T^2) plot for the given latent variable model using with_a number of latent variables. The default will use the total number of latent variables which have already been fitted.

Parameters:
  • model (MVmodel object (PCA, or PLS)) – A latent variable model generated by this library.

  • with_a (int, optional) – Uses this many number of latent variables, and therefore shows the SPE after this number of model components. By default the total number of components fitted will be used.

  • items_to_highlight (dict, optional) –

    Keys are JSON strings parseable by json.loads into a Plotly line specifier; values are lists of index names to highlight. For example:

    items_to_highlight = {'{"color": "red", "symbol": "cross"}': items_in_red}
    

    will highlight the items in items_in_red with the given colour and shape.

  • settings (dict) –

    Default settings:

    {
        "show_limit": True,            # bool: show the T2 confidence limit line
        "conf_level": 0.95,            # float: confidence level for limit (< 1.00)
        "title": "T2 plot ...",         # str: overall plot title
        "default_marker": {...},        # dict: e.g. dict(symbol="circle", size=7)
        "show_labels": False,           # bool: add a label for each observation
        "show_legend": False,           # bool: show clickable legend
        "html_image_height": 500,       # int: image height in pixels
        "html_aspect_ratio_w_over_h": 16/9,  # float: width as ratio of height
        "template": "pi_journal",         # str: registered Plotly theme name
    }
    

  • fig (Figure | None)

Return type:

Figure

Examples

>>> pca.t2_plot()
>>> pca.t2_plot(settings={"conf_level": 0.99, "show_labels": True})
process_improve.multivariate.plots.explained_variance_plot(model, settings=None, fig=None)[source]#

Generate an explained-variance plot for a fitted latent variable model.

Shows the variance explained by each component as bars, with the cumulative variance explained overlaid as a line. For PCA the variance refers to the X-block; for PLS it refers to the Y-block.

Parameters:
  • model (MVmodel object (PCA, or PLS)) – A fitted latent variable model generated by this library.

  • settings (dict) –

    Default settings:

    {
        "as_percentage": True,         # bool: y-axis as a percentage, else a fraction
        "title": "Variance explained ...",   # str: overall plot title
        "bar_color": None,              # str|None: bar colour; None uses the theme
        "line_color": None,             # str|None: line colour; None uses the theme
        "show_legend": True,            # bool: show clickable legend
        "html_image_height": 500,       # int: image height in pixels
        "html_aspect_ratio_w_over_h": 16/9,  # float: width as ratio of height
        "template": "pi_journal",         # str: registered Plotly theme name
    }
    

  • fig (go.Figure, optional) – An existing figure to draw onto. A new figure is created if omitted.

Return type:

Figure

Examples

>>> pca.explained_variance_plot()
>>> pls.explained_variance_plot(settings={"as_percentage": False})
process_improve.multivariate.plots.correlation_loadings_plot(model, pc_horiz=1, pc_vert=2, variance_ellipses=(0.5, 1.0), settings=None, fig=None)[source]#

Generate a correlation loadings plot for a fitted latent variable model.

Each variable is placed by its correlation with the scores of two components. A variable’s squared distance from the origin is the fraction of its variance explained by those two components, so every variable lies inside the unit circle. Concentric ellipses mark variance-explained thresholds: a variable beyond the 50% ellipse has at least half of its variance captured by the two components shown.

For PCA the X-variables are shown. For PLS both the X-variables and the Y-variables are overlaid against the X-scores, which reveals how process variables relate to quality variables.

Parameters:
  • model (MVmodel object (PCA, or PLS)) – A fitted latent variable model generated by this library.

  • pc_horiz (int, default 1) – Component shown on the horizontal axis (1-based).

  • pc_vert (int, default 2) – Component shown on the vertical axis (1-based).

  • variance_ellipses (sequence of float, default (0.5, 1.0)) – Variance-explained thresholds, each a fraction in (0, 1], at which to draw a concentric ellipse. The conventional choice is the 50% and 100% ellipses; any other thresholds (for example 0.75 and 0.95) are equally valid.

  • settings (dict) –

    Default settings:

    {
        "title": "Correlation loadings ...",  # str: overall plot title
        "x_marker_color": None,         # str|None: X-variable marker colour; None uses the theme
        "y_marker_color": None,         # str|None: Y-variable marker colour (PLS); None uses the theme
        "ellipse_color": "grey",        # str: colour of the variance ellipses
        "show_labels": True,            # bool: label each variable
        "show_legend": True,            # bool: show clickable legend (PLS only)
        "html_image_height": 600,       # int: image height in pixels
        "html_aspect_ratio_w_over_h": 1.0,   # float: width as ratio of height
        "template": "pi_journal",         # str: registered Plotly theme name
    }
    

  • fig (go.Figure, optional) – An existing figure to draw onto. A new figure is created if omitted.

Return type:

Figure

Examples

>>> pca.correlation_loadings_plot()
>>> pls.correlation_loadings_plot(pc_horiz=1, pc_vert=3)
>>> pca.correlation_loadings_plot(variance_ellipses=(0.75, 0.95))
process_improve.multivariate.plots.predictions_vs_observed_plot(model, *, y_observed, variable=None, settings=None, fig=None)[source]#

Generate an observed-vs-predicted (parity) plot for a fitted PLS model.

Plots the calibration predictions against the observed Y values, with a y = x reference line and an RMSE annotation. Points lying close to the reference line indicate good predictions.

Parameters:
  • model (PLS object) – A fitted PLS model generated by this library.

  • y_observed (array-like of shape (n_samples, n_targets)) – The observed Y values, on the same scale as the data used to fit the model (for example the scaled Y from MCUVScaler).

  • variable (str, optional) – Which Y-variable to plot. Defaults to the first Y-variable.

  • settings (dict) –

    Default settings:

    {
        "title": "Observed vs predicted ...",  # str: overall plot title
        "marker_color": None,           # str|None: data-marker colour; None uses the theme
        "reference_color": "#9CA3AF",   # str: colour of the y = x line
        "html_image_height": 500,       # int: image height in pixels
        "html_aspect_ratio_w_over_h": 1.0,   # float: width as ratio of height
        "template": "pi_journal",         # str: registered Plotly theme name
    }
    

  • fig (go.Figure, optional) – An existing figure to draw onto. A new figure is created if omitted.

Return type:

Figure

Examples

>>> pls.predictions_vs_observed_plot(y_observed=Y_scaled)
>>> pls.predictions_vs_observed_plot(y_observed=Y_scaled, variable="quality")
process_improve.multivariate.plots.coefficient_plot(model, variable=None, settings=None, fig=None)[source]#

Generate a bar plot of the PLS regression coefficients.

Shows beta_coefficients_ for one Y-variable: one bar per X-variable, mapping the (preprocessed) X onto the predicted Y. Tall bars mark the X-variables that most strongly drive the prediction.

Parameters:
  • model (PLS object) – A fitted PLS model generated by this library.

  • variable (str, optional) – Which Y-variable’s coefficients to plot. Defaults to the first one.

  • settings (dict) –

    Default settings:

    {
        "title": "Regression coefficients ...",  # str: overall plot title
        "bar_color": None,              # str|None: bar colour; None uses the theme
        "html_image_height": 500,       # int: image height in pixels
        "html_aspect_ratio_w_over_h": 16/9,  # float: width as ratio of height
        "template": "pi_journal",         # str: registered Plotly theme name
    }
    

  • fig (go.Figure, optional) – An existing figure to draw onto. A new figure is created if omitted.

Return type:

Figure

Examples

>>> pls.coefficient_plot()
>>> pls.coefficient_plot(variable="quality")