Multivariate Analysis#
Models#
PCA#
- class process_improve.multivariate.methods.PCA(n_components, *, algorithm='auto', missing_data_settings=None)[source]#
Bases:
_LatentVariableModel,TransformerMixin,BaseEstimatorPrincipal Component Analysis with support for missing data.
- Parameters:
n_components (int) – Number of principal components to extract.
algorithm (str, default="auto") – Algorithm to use for fitting the model. -
"auto": Uses SVD when data is complete, NIPALS when data has missing values. -"svd": Singular Value Decomposition. Requires complete data. -"nipals": Non-linear Iterative Partial Least Squares. Handles missing data. -"tsr": Trimmed Score Regression. Handles missing data.missing_data_settings (dict or None, default=None) – Settings for iterative missing data algorithms (NIPALS, TSR). Keys:
md_tol(convergence tolerance),md_max_iter(max iterations).fitting) (Attributes (after)
--------------------------
scores (pd.DataFrame of shape (n_samples, n_components)) – The score matrix (T).
loadings (pd.DataFrame of shape (n_features, n_components)) – The loading matrix (P).
r2_per_component (pd.Series of length n_components) – Fractional R² explained by each component.
r2_cumulative (pd.Series of length n_components) – Cumulative R² after each component.
r2_per_variable (pd.DataFrame of shape (n_features, n_components)) – Per-variable cumulative R² after each component.
spe (pd.DataFrame of shape (n_samples, n_components)) – Per-row SPE diagnostic; stored as the square root of the row sum-of-squared X-residuals (so it is on the residual scale, not the squared scale).
hotellings_t2 (pd.DataFrame of shape (n_samples, n_components)) – Cumulative Hotelling’s T² statistic.
explained_variance (np.ndarray of shape (n_components,)) – Variance explained by each component.
scaling_factor_for_scores (pd.Series of length n_components) – Standard deviation per score (sqrt of explained variance).
has_missing_data (bool) – Whether the training data contained missing values.
fitting_info (dict) – Timing and iteration info from the fitting algorithm.
- scores_#
Expose a private ndarray as a lazily-built, cached
pandas.DataFrame(ENG-18).Declared as a class attribute, e.g.:
scores_ = _LazyFrame("_scores", index="_sample_index", columns="_component_names") loadings_ = _LazyFrame("_loadings", index="_feature_names", columns="_component_names")
The private ndarray (
self._scores) is the source of truth; the publicDataFrameis built on first access from the ndarray plus the index/column metadata attributes, cached inself.__dict__["_frame_cache"](so repeated access returns the same object and is cheap), and excluded from pickling by_LatentVariableModel.__getstate__(). Internal math reads the ndarray directly and avoids the per-call.valuesconversion.On an unfitted model the backing ndarray is absent, so
getattrraisesAttributeError- the same “not fitted” signal as before this change, sohasattr/check_is_fittedbehave identically.
- loadings_#
Expose a private ndarray as a lazily-built, cached
pandas.DataFrame(ENG-18).Declared as a class attribute, e.g.:
scores_ = _LazyFrame("_scores", index="_sample_index", columns="_component_names") loadings_ = _LazyFrame("_loadings", index="_feature_names", columns="_component_names")
The private ndarray (
self._scores) is the source of truth; the publicDataFrameis built on first access from the ndarray plus the index/column metadata attributes, cached inself.__dict__["_frame_cache"](so repeated access returns the same object and is cheap), and excluded from pickling by_LatentVariableModel.__getstate__(). Internal math reads the ndarray directly and avoids the per-call.valuesconversion.On an unfitted model the backing ndarray is absent, so
getattrraisesAttributeError- the same “not fitted” signal as before this change, sohasattr/check_is_fittedbehave identically.
- spe_#
Expose a private ndarray as a lazily-built, cached
pandas.DataFrame(ENG-18).Declared as a class attribute, e.g.:
scores_ = _LazyFrame("_scores", index="_sample_index", columns="_component_names") loadings_ = _LazyFrame("_loadings", index="_feature_names", columns="_component_names")
The private ndarray (
self._scores) is the source of truth; the publicDataFrameis built on first access from the ndarray plus the index/column metadata attributes, cached inself.__dict__["_frame_cache"](so repeated access returns the same object and is cheap), and excluded from pickling by_LatentVariableModel.__getstate__(). Internal math reads the ndarray directly and avoids the per-call.valuesconversion.On an unfitted model the backing ndarray is absent, so
getattrraisesAttributeError- the same “not fitted” signal as before this change, sohasattr/check_is_fittedbehave identically.
- get_feature_names_out(input_features=None)[source]#
Return the output column names of
transform().PCA’stransformproduces scores, one column per component, named["PC1", "PC2", ..., "PC{n_components}"]. Theinput_featuresargument is accepted (Pipeline introspection passes it through) but unused: the output column count is the fittedn_components, not the input feature count.Used by
set_output()(sklearn 1.2+) to label theDataFrameview of the scores whenset_output(transform="pandas")is on, and by Pipeline introspection.- Return type:
- diagnose(X)[source]#
Project new data and compute diagnostics (scores, Hotelling’s T², SPE).
The same logic that historically lived in
predict(). The rename (since 1.38.1, #396) matchesPLS.diagnose()and clears thepredictname for a future return-type contract that does what its sklearn-convention name implies (a regression-style prediction).predict()is kept as a deprecation shim for now.- Parameters:
X (array-like of shape (n_samples, n_features))
- Returns:
result – With keys
scores,hotellings_t2,spe.- Return type:
- predict(X)[source]#
Forward to
diagnose(); emits aDeprecationWarning.Deprecated since version 1.38.1: Use
PCA.diagnose()instead.predictmatches the sklearn-convention name (a regression-style prediction), but PCA isn’t a regressor; the historical return is a diagnostics Bunch. The rename aligns withPLS.diagnose()and frees the name for a future contract. Will be removed in 2.0.0.
- score(X, y=None)[source]#
Negative mean squared reconstruction error (higher is better).
Follows the sklearn convention where higher scores indicate better model fit. This makes PCA compatible with
cross_val_score,GridSearchCV, and other sklearn model-selection utilities.- Parameters:
X (array-like of shape (n_samples, n_features)) – Test data to score.
y (ignored)
- Returns:
score – Negative mean squared reconstruction error.
- Return type:
Examples
>>> from sklearn.model_selection import cross_val_score >>> scores = cross_val_score(PCA(n_components=2), X_scaled, cv=5) >>> print(f"Mean CV score: {scores.mean():.4f}")
- classmethod minka_mle(X)[source]#
Minka (2000) automatic-dimensionality estimate for PCA.
Closed-form Bayesian model selection on the PPCA evidence (Minka, T. P. 2000. Automatic Choice of Dimensionality for PCA. NIPS 13, pp. 598-604). Operates only on the covariance eigenvalues of
Xand is therefore very cheap; in the simulations Minka reports it beats cross-validation. Use it alongside the ekf-CV recommendation fromselect_n_components()as a fast cross-check.Internally
Xis mean-centred before estimation (a PPCA assumption); it is not unit-variance scaled, because dividing each column by its standard deviation compresses the noise eigenvalues to near-zero values the MLE misreads as additional latent signal. If your columns are on wildly different scales, pass the analysis-scaleXproduced by your own preprocessing (e.g. SNV for spectral data) and accept the centring this method applies.- Parameters:
X (array-like of shape (n_samples, n_features)) – Data matrix.
- Returns:
n_components – The MLE estimate of the effective dimensionality.
- Return type:
References
Minka, T. P. (2000). Automatic Choice of Dimensionality for PCA. Advances in Neural Information Processing Systems, 13, 598-604.
See also
parallel_analysisHorn (1965) eigenvalue-vs-null retention.
select_n_componentsekf cross-validation; pass
return_consensus=Trueto report all three side by side.
- classmethod parallel_analysis(X, *, n_simulations=200, quantile=0.95, scale=True, random_state=None)[source]#
Horn (1965) parallel analysis component-count estimate.
Generates
n_simulationsrandom matrices of the same shape asX, computes their eigenvalues, and retains every observed component whose eigenvalue exceeds thequantileof the null distribution at the same rank. Widely regarded in psychometrics as the best simple retention rule for PCA.- Parameters:
X (array-like of shape (n_samples, n_features)) – Data matrix.
n_simulations (int, default 200) – Number of random matrices drawn to build the null eigenvalue distribution.
quantile (float, default 0.95) – Quantile of the null eigenvalues used as the retention threshold. Horn’s original proposal was the mean (0.5); the more conservative 95th-percentile threshold is the modern recommendation.
scale (bool, default True) – Mean-centre and unit-variance scale
Xbefore estimation (matchesminka_mle()).random_state (int, optional) – Seed for the null-matrix simulations.
- Returns:
result – With keys:
n_components- number of components retained (can be 0 on pure noise).observed_eigenvalues- eigenvalues ofXafter centring/scaling (np.ndarray of lengthmin(n, p)).null_threshold- per-rankquantileof the null eigenvalue distribution (same length asobserved_eigenvalues).
- Return type:
References
Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis. Psychometrika, 30(2), 179-185.
See also
minka_mleclosed-form PPCA evidence rule.
select_n_componentsekf cross-validation; pass
return_consensus=Trueto report all three side by side.
- classmethod select_n_components(X, *, max_components=None, cv=5, cv_scheme='ekf', n_repeats=1, selection_rule='min', min_q2_increase=0.01, scale_inside_folds=True, n_iter=50, tol=1e-06, random_state=None, return_consensus=False, threshold=None, **pca_kwargs)[source]#
Select the number of PCA components via cross-validation.
Evaluates every component count
1, 2, ..., max_componentsand recommends one via the configuredselection_rule. The defaultcv_scheme="ekf"is the element-wise k-fold algorithm of Bro, Kjeldahl, Smilde & Kiers (2008, Anal. Bioanal. Chem. 390:1241-1251), which holds out individual cells ofXand predicts them via EM-style imputation from a model that never sees their true values. This restores the prediction-independence requirement the legacy row-wise scheme violates, fixing the trivial-fit pathology where PRESS shrinks monotonically with components.- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data. Should already be on the analysis scale (e.g. mean-centred and unit-variance via
MCUVScaler).max_components (int, optional) – Maximum number of components to evaluate. Default is
min(n_samples - 1, n_features).cv (int or sklearn CV splitter, default 5) – For
cv_scheme="ekf": the integer number of element-folds (splitter objects are ignored). Forcv_scheme="row_wise": either an integer K (fed toKFold) or any sklearn splitter.cv_scheme ({"ekf", "row_wise"}, default "ekf") –
"ekf"is the element-wise k-fold scheme of Bro et al. 2008 with EM imputation; the research-recommended default. The legacy"row_wise"scheme is preserved for back-compat but emits aSpecificationWarningbecause it over-selects (see the warning admonition below).n_repeats (int, default 1) – Repeat the ekf pass with a fresh random fold permutation this many times. Each repeat covers every cell exactly once;
n_repeats > 1narrows the per-component PRESS standard error (helpful when the 1-SE rule sits on a borderline) at roughly linear extra runtime. Ignored undercv_scheme="row_wise".selection_rule ({"min", "1se", "q2_increment"}, default "min") – How the recommended component count is chosen.
"min"is the GlobalMin criterion Bro 2008 pairs with ekf - the component count with the lowest pooled PRESS."1se"is the one- standard-error rule (needsper_fold_press, available under both schemes)."q2_increment"is the Wold’s-R-style cumulative-\(Q^2\) threshold from PR #371;min_q2_increasesets the threshold.min_q2_increase (float, default 0.01) – Threshold used only when
selection_rule="q2_increment".scale_inside_folds (bool, default True) – With the default, mean-centring and unit-variance scaling constants are fit on each fold’s in-fold cells and applied to the whole matrix before EM, removing the centring/scaling leakage of the prior implementation. Set to
Falseto reproduce the previous behaviour (column mean recomputed each EM iteration from the imputed matrix, no scaling); this is useful only whenXis already pre-scaled. Ignored undercv_scheme="row_wise".n_iter (int and float, default 50 and 1e-6) – EM iteration cap and convergence tolerance for the ekf imputation step. Ignored under
cv_scheme="row_wise".tol (int and float, default 50 and 1e-6) – EM iteration cap and convergence tolerance for the ekf imputation step. Ignored under
cv_scheme="row_wise".random_state (int, optional) – Seed for the ekf element-fold permutation.
threshold (float, optional) – Deprecated. The original Wold PRESS-ratio cutoff. Passing it emits a
DeprecationWarning; the value is ignored. Useselection_rule="q2_increment"(and tunemin_q2_increase) for a comparable parsimony preference.**pca_kwargs – Additional keyword arguments passed to the
PCA()constructor undercv_scheme="row_wise"(e.g.algorithm="nipals"). Ignored undercv_scheme="ekf"because ekf runs its own SVD loop.return_consensus (bool)
- Returns:
result – With keys:
n_components- recommended number of components (int).press- pooled PRESS per component count (pd.Series, indexed1..A_max).per_fold_press- per-fold PRESS contributions (pd.DataFrame,A_maxrows xn_foldscolumns).se_press- standard error of the per-fold PRESS curve (pd.Series, indexed1..A_max). Drives the 1-SE rule.q2_se- the same standard error rescaled onto the Q2 scale (se_press / null_model_ss, pd.Series indexed1..A_max), i.e. the half-width of a +/-1 SE band aroundq2.press_ratio-PRESS_a / PRESS_{a-1}for inspection (pd.Series, indexed2..A_max).q2- cross-validated \(R^2_X\) per component count (pd.Series, indexed1..A_max). Computed as1 - press / (n_samples * n_features * mean_cell_ss), so it is directly comparable tor2_cumulative_and to PLS’sr2y_validated.cv_scores- alias ofper_fold_pressunder ekf, or per-fold negative MSE fromcross_val_scoreunder row-wise (preserved for back-compat).cv_scheme- the scheme used ("ekf"or"row_wise").selection_rule- the rule used to pickn_components.
- Return type:
References
Bro, R., Kjeldahl, K., Smilde, A. K., & Kiers, H. A. L. (2008). Cross-validation of component models: a critical look at current methods. Anal. Bioanal. Chem., 390(5), 1241-1251.
Camacho, J., & Ferrer, A. (2012). Cross-validation in PCA models with the element-wise k-fold (ekf) algorithm: theoretical aspects. J. Chemometrics, 26(7), 361-373.
Warning
cv_scheme="row_wise"is preserved only for back-compat and emits aSpecificationWarning. It suffers from the trivial-fit problem: holding out whole rows and projecting them back viatransform()lets the held-out row’s own values reach its prediction, so PRESS shrinks monotonically with the component count and the recommendation tends to run to the maximum. Prefer the default"ekf".
- score_contributions(t_start, t_end=None, components=None, *, weighted=False)[source]#
Contribution of each variable to a score-space movement.
Decomposes the difference (t_end - t_start) back into variable space via the loadings for the selected components.
- Parameters:
t_start (array-like of shape (n_components,)) – Score vector of the observation of interest. Can be a row from
self.scores_or frompredict(X_new).scores.t_end (array-like of shape (n_components,), optional) – Reference point in score space. Default is the model center (a vector of zeros), which is the most common choice.
components (list of int, optional) – 1-based component indices to decompose over, matching the model’s column convention. Examples:
[2, 3]for a PC2-vs-PC3 score plot, orNone(default) for all components - appropriate for Hotelling’s T² contributions.weighted (bool, default False) – If True, scale the score difference by 1/sqrt(explained_variance) per component before back-projecting. This gives contributions to the T² statistic rather than to the Euclidean score distance.
- Returns:
contributions – One value per variable; sign indicates direction. Index contains the variable (feature) names.
- Return type:
pd.Series of shape (n_features,)
Examples
>>> pca = PCA(n_components=3).fit(X_scaled) >>> # Why does observation 0 differ from the model center? >>> contrib = pca.score_contributions(pca.scores_.iloc[0].values) >>> # Which variables drive the difference between obs 0 and obs 5? >>> contrib = pca.score_contributions( ... pca.scores_.iloc[0].values, pca.scores_.iloc[5].values ... )
See also
observation_contributionsThe per-observation counterpart. Despite the similar name, the two answer different questions and are not interchangeable.
score_contributionsis per-variable: it decomposes a single observation’s movement in score space back onto the original variables, answering “which variables explain why this observation sits where it does?”. It returns one signed value per variable.observation_contributionsis per-observation: it reports each observation’s share of a component’s total inertia, answering “which observations most strongly shape this component?”. It returns a non-negative sample-by-component table whose columns each sum to 1. The two are orthogonal views of the same score matrix - one decomposes across variables, the other across observations.
- detect_outliers(conf_level=0.95)[source]#
Detect outlier observations using SPE and Hotelling’s T² diagnostics.
Combines two approaches:
Statistical limits - observations exceeding the SPE or T² limit at
conf_levelare flagged.Robust ESD test - the generalized ESD test (with robust median/MAD variant) identifies observations that are unusual relative to the rest of the data, even if they fall below the statistical limit.
An observation can be flagged for one or both reasons.
- Parameters:
conf_level (float, default 0.95) – Confidence level in [0.8, 0.999]. Controls both the statistical limits and the ESD test’s significance level (alpha = 1 - conf_level).
- Returns:
outliers – Sorted from most severe to least. Each dict contains:
observation- index label of the observationoutlier_types- list of"spe"and/or"hotellings_t2"spe- SPE value for this observationhotellings_t2- T² value for this observationspe_limit- SPE limit at the given confidence levelhotellings_t2_limit- T² limit at the given confidence levelseverity- max(spe/spe_limit, t2/t2_limit)
- Return type:
Examples
>>> pca = PCA(n_components=3).fit(X_scaled) >>> outliers = pca.detect_outliers(conf_level=0.95) >>> for o in outliers: ... print(f"{o['observation']}: {o['outlier_types']} (severity={o['severity']})")
PLS#
- class process_improve.multivariate.methods.PLS(n_components, *, scale=True, max_iter=1000, tol=np.float64(1.4901161193847656e-08), copy=True, missing_data_settings=None)[source]#
Bases:
_LatentVariableModel,RegressorMixin,TransformerMixin,BaseEstimatorProjection to Latent Structures (PLS) regression with diagnostics.
Implements PLS via the NIPALS algorithm with production diagnostics: SPE, Hotelling’s T², score contributions, and outlier detection. The API mirrors
PCAso thatmodel.scores_,model.spe_, andmodel.detect_outliers()work identically for both model types.- Parameters:
n_components (int) – Number of latent components to extract.
scale (bool, default=True) – Whether to scale X and Y to unit variance (sklearn internal scaling). When using
MCUVScalerexternally, setscale=Falseto avoid double scaling.max_iter (int, default=1000) – Maximum number of iterations for the NIPALS algorithm.
tol (float, default=sqrt(machine epsilon)) – Convergence tolerance for the NIPALS algorithm.
copy (bool, default=True) – Whether to copy X and Y before fitting.
missing_data_settings (dict or None, default=None) – Settings for missing data algorithms (NIPALS/TSR for PLS). Keys:
md_method("tsr","scp","nipals"),md_tol,md_max_iter.fitting) (Attributes (after)
--------------------------
scores (pd.DataFrame of shape (n_samples, n_components)) – X-block score matrix (T). This is the primary score matrix; equivalent to
x_scoresin older versions.y_scores (pd.DataFrame of shape (n_samples, n_components)) – Y-block score matrix (U).
x_loadings (pd.DataFrame of shape (n_features, n_components)) – X-block loading matrix (P).
y_loadings (pd.DataFrame of shape (n_targets, n_components)) – Y-block loading matrix (C).
x_weights (pd.DataFrame of shape (n_features, n_components)) – X-block weight matrix (W).
y_weights (pd.DataFrame of shape (n_targets, n_components)) – Y-block weight matrix.
direct_weights (pd.DataFrame of shape (n_features, n_components)) – Direct (W*) weights:
W (P'W)^{-1}. Used for direct projectionT = X @ W*.beta_coefficients (pd.DataFrame of shape (n_features, n_targets)) – Regression coefficients linking X directly to Y.
predictions (pd.DataFrame of shape (n_samples, n_targets)) – Y predictions from the training data.
spe (pd.DataFrame of shape (n_samples, n_components)) – Per-row SPE diagnostic; stored as the square root of the row sum-of-squared X-residuals (so it is on the residual scale, not the squared scale).
hotellings_t2 (pd.DataFrame of shape (n_samples, n_components)) – Cumulative Hotelling’s T² statistic.
r2_per_component (pd.Series of length n_components) – Fractional R² (on Y) explained by each component.
r2_cumulative (pd.Series of length n_components) – Cumulative R² (on Y) after each component.
r2_per_variable (pd.DataFrame of shape (n_features, n_components)) – Per-variable cumulative R² for X after each component.
r2y_per_variable (pd.DataFrame of shape (n_targets, n_components)) – Per-variable R² for Y after each component.
rmse (pd.DataFrame of shape (n_targets, n_components)) – Root mean squared error of Y predictions per component.
explained_variance (np.ndarray of shape (n_components,)) – Variance explained by each component in X.
scaling_factor_for_scores (pd.Series of length n_components) – Standard deviation per score (sqrt of explained variance).
has_missing_data (bool) – Whether the training data contained missing values.
fitting_info (dict) – Timing and iteration info from the fitting algorithm.
See also
PCAPrincipal Component Analysis.
MCUVScalerMean-center unit-variance scaler.
References
Abdi, “Partial least squares regression and projection on latent structure regression (PLS Regression)”, 2010, DOI: 10.1002/wics.51
Examples
>>> import pandas as pd >>> from process_improve.multivariate.methods import PLS, MCUVScaler >>> X = pd.DataFrame({"A": [1, 2, 3, 4], "B": [4, 3, 2, 1]}) >>> Y = pd.DataFrame({"y": [2.1, 3.9, 6.2, 7.8]}) >>> pls = PLS(n_components=1) >>> pls = pls.fit(MCUVScaler().fit_transform(X), MCUVScaler().fit_transform(Y)) >>> pls.scores_.shape (4, 1)
- get_feature_names_out(input_features=None)[source]#
Return the output column names of
transform().PLS’stransformreturns the X scores (T matrix), labelled["T1", "T2", ..., "T{n_components}"]. Theinput_featuresargument is accepted (Pipeline introspection passes it through) but unused: the output column count is the fittedn_components, not the input feature count.Used by
set_output()(sklearn 1.2+) to label theDataFrameview of the scores whenset_output(transform="pandas")is on, and by Pipeline introspection.- Return type:
- predictions_vs_observed_plot(*, y_observed, variable=None, settings=None, fig=None)#
Generate an observed-vs-predicted (parity) plot for a fitted PLS model.
Plots the calibration predictions against the observed Y values, with a
y = xreference line and an RMSE annotation. Points lying close to the reference line indicate good predictions.- Parameters:
model (PLS object) – A fitted PLS model generated by this library.
y_observed (array-like of shape (n_samples, n_targets)) – The observed Y values, on the same scale as the data used to fit the model (for example the scaled Y from
MCUVScaler).variable (str, optional) – Which Y-variable to plot. Defaults to the first Y-variable.
settings (dict) –
Default settings:
{ "title": "Observed vs predicted ...", # str: overall plot title "marker_color": None, # str|None: data-marker colour; None uses the theme "reference_color": "#9CA3AF", # str: colour of the y = x line "html_image_height": 500, # int: image height in pixels "html_aspect_ratio_w_over_h": 1.0, # float: width as ratio of height "template": "pi_journal", # str: registered Plotly theme name }
fig (go.Figure, optional) – An existing figure to draw onto. A new figure is created if omitted.
- Return type:
Figure
Examples
>>> pls.predictions_vs_observed_plot(y_observed=Y_scaled) >>> pls.predictions_vs_observed_plot(y_observed=Y_scaled, variable="quality")
- coefficient_plot(variable=None, settings=None, fig=None)#
Generate a bar plot of the PLS regression coefficients.
Shows
beta_coefficients_for one Y-variable: one bar per X-variable, mapping the (preprocessed) X onto the predicted Y. Tall bars mark the X-variables that most strongly drive the prediction.- Parameters:
model (PLS object) – A fitted PLS model generated by this library.
variable (str, optional) – Which Y-variable’s coefficients to plot. Defaults to the first one.
settings (dict) –
Default settings:
{ "title": "Regression coefficients ...", # str: overall plot title "bar_color": None, # str|None: bar colour; None uses the theme "html_image_height": 500, # int: image height in pixels "html_aspect_ratio_w_over_h": 16/9, # float: width as ratio of height "template": "pi_journal", # str: registered Plotly theme name }
fig (go.Figure, optional) – An existing figure to draw onto. A new figure is created if omitted.
- Return type:
Figure
Examples
>>> pls.coefficient_plot() >>> pls.coefficient_plot(variable="quality")
- scores_#
Expose a private ndarray as a lazily-built, cached
pandas.DataFrame(ENG-18).Declared as a class attribute, e.g.:
scores_ = _LazyFrame("_scores", index="_sample_index", columns="_component_names") loadings_ = _LazyFrame("_loadings", index="_feature_names", columns="_component_names")
The private ndarray (
self._scores) is the source of truth; the publicDataFrameis built on first access from the ndarray plus the index/column metadata attributes, cached inself.__dict__["_frame_cache"](so repeated access returns the same object and is cheap), and excluded from pickling by_LatentVariableModel.__getstate__(). Internal math reads the ndarray directly and avoids the per-call.valuesconversion.On an unfitted model the backing ndarray is absent, so
getattrraisesAttributeError- the same “not fitted” signal as before this change, sohasattr/check_is_fittedbehave identically.
- spe_#
Expose a private ndarray as a lazily-built, cached
pandas.DataFrame(ENG-18).Declared as a class attribute, e.g.:
scores_ = _LazyFrame("_scores", index="_sample_index", columns="_component_names") loadings_ = _LazyFrame("_loadings", index="_feature_names", columns="_component_names")
The private ndarray (
self._scores) is the source of truth; the publicDataFrameis built on first access from the ndarray plus the index/column metadata attributes, cached inself.__dict__["_frame_cache"](so repeated access returns the same object and is cheap), and excluded from pickling by_LatentVariableModel.__getstate__(). Internal math reads the ndarray directly and avoids the per-call.valuesconversion.On an unfitted model the backing ndarray is absent, so
getattrraisesAttributeError- the same “not fitted” signal as before this change, sohasattr/check_is_fittedbehave identically.
- x_loadings_#
Expose a private ndarray as a lazily-built, cached
pandas.DataFrame(ENG-18).Declared as a class attribute, e.g.:
scores_ = _LazyFrame("_scores", index="_sample_index", columns="_component_names") loadings_ = _LazyFrame("_loadings", index="_feature_names", columns="_component_names")
The private ndarray (
self._scores) is the source of truth; the publicDataFrameis built on first access from the ndarray plus the index/column metadata attributes, cached inself.__dict__["_frame_cache"](so repeated access returns the same object and is cheap), and excluded from pickling by_LatentVariableModel.__getstate__(). Internal math reads the ndarray directly and avoids the per-call.valuesconversion.On an unfitted model the backing ndarray is absent, so
getattrraisesAttributeError- the same “not fitted” signal as before this change, sohasattr/check_is_fittedbehave identically.
- x_weights_#
Expose a private ndarray as a lazily-built, cached
pandas.DataFrame(ENG-18).Declared as a class attribute, e.g.:
scores_ = _LazyFrame("_scores", index="_sample_index", columns="_component_names") loadings_ = _LazyFrame("_loadings", index="_feature_names", columns="_component_names")
The private ndarray (
self._scores) is the source of truth; the publicDataFrameis built on first access from the ndarray plus the index/column metadata attributes, cached inself.__dict__["_frame_cache"](so repeated access returns the same object and is cheap), and excluded from pickling by_LatentVariableModel.__getstate__(). Internal math reads the ndarray directly and avoids the per-call.valuesconversion.On an unfitted model the backing ndarray is absent, so
getattrraisesAttributeError- the same “not fitted” signal as before this change, sohasattr/check_is_fittedbehave identically.
- fit(X, Y, sample_weight=None)[source]#
Fit a projection to latent structures (PLS) model to the data.
- Parameters:
X (array-like, shape (n_samples, n_features)) – Training data, where n_samples is the number of samples (rows) and n_features is the number of features (columns).
Y (array-like, shape (n_samples, n_targets)) – Training data, where n_samples is the number of samples (rows) and n_targets is the number of target outputs (columns).
sample_weight (array-like of shape (n_samples,), optional) – Non-negative row weights for a weighted PLS fit (#394). NIPALS is run on
sqrt(w)-rescaled X and Y, which is equivalent to weighting the cross-productsX' W uandY' W t. Loadings, weights and beta are computed correctly; scores are returned on the original sample scale. Zero weights effectively exclude the corresponding rows (sample_weight=[1,1,0,0,1]reproduces the unweighted fit on rows[0,1,4]). Forwarded toscore()/r2_scorefor any caller that also threads it through.
- Returns:
Model object.
- Return type:
References
Abdi, “Partial least squares regression and projection on latent structure regression (PLS Regression)”, 2010, DOI: 10.1002/wics.51
- transform(X, Y=None)[source]#
Project X (and optionally Y) into the latent space.
- Parameters:
- Returns:
X_scores – Projected X data (scores).
- Return type:
pd.DataFrame of shape (n_samples, n_components)
- predict(X)[source]#
Predict Y for new observations.
Returns just the predicted
y_hatso the call satisfies the scikit-learnRegressorMixincontract (and therefore composes insidePipeline,cross_val_score(), andGridSearchCV). For the rich diagnostic view (scores, Hotelling’s T², SPE, plusy_hat), seediagnose().- Parameters:
X (array-like of shape (n_samples, n_features))
- Returns:
y_hat – Predicted target values, indexed by
X’s rows and labelled with the target column names captured duringfit.- Return type:
pd.DataFrame of shape (n_samples, n_targets)
See also
diagnosericher per-prediction diagnostics.
Examples
>>> y_pred = pls.predict(X_new) >>> diag = pls.diagnose(X_new) # for scores / T² / SPE
- diagnose(X)[source]#
Project new data and compute predictions plus diagnostics.
This is the rich view that
predict()used to return before 1.35.0: alongsidey_hatit reports the X scores, cumulative Hotelling’s T², and SPE for every row ofXso the user can flag out-of-model observations and read their predicted Y from one call.- Parameters:
X (array-like of shape (n_samples, n_features))
- Returns:
result – With keys
scores,hotellings_t2,spe,y_hat.- Return type:
See also
predictsklearn-compatible call returning just
y_hat.
Examples
>>> result = pls.diagnose(scaler_x.transform(X_new)) >>> result.y_hat # Predicted Y values >>> result.spe # SPE for each new observation >>> result.hotellings_t2 # T² for each new observation
- classmethod select_n_components(X, Y, *, max_components=None, cv=5, n_repeats=None, random_state=None, selection_rule='1se', scale_inside_folds=True, min_q2_increase=0.01, n_permutations=999, alpha=0.01, stability_threshold=0.6, **pls_kwargs)[source]#
Select the number of PLS components via cross-validation.
Fits PLS models on cross-validation training folds and evaluates the out-of-fold prediction error for every component count
1, 2, ..., max_components. Reports per-fold and pooled RMSECV plus the validated cumulative R² curves, and recommends a component count from one of three rules (seeselection_rulebelow).The defaults are the research-backed combination: the one-standard-error rule on top of repeated, shuffled K-fold CV, with
MCUVScalerre-fit inside every training fold so test data never leaks into the centring/scaling estimates.Unlike the calibration statistics stored on a fitted model (
rmse_,r2_cumulative_), the metrics returned here estimate performance on unseen data and are therefore suitable for choosingn_components.- Parameters:
X (array-like of shape (n_samples, n_features)) – Training X. With the default
scale_inside_folds=Truethe raw, unscaled X may be passed; scaling is fit inside every training fold.Y (array-like of shape (n_samples, n_targets)) – Training Y. Same treatment as
Xunderscale_inside_folds.max_components (int, optional) – Maximum number of components to evaluate. Default is the largest value supported by every cross-validation training fold,
min(min_fold_size, n_features).cv (int or sklearn CV splitter, default 5) – If an integer, used as the
n_splitsof a shuffledKFold(orRepeatedKFoldwhenn_repeats > 1). Any sklearn splitter object (e.g.KFold(10, shuffle=True)orLeaveOneOut()) is also accepted and is used as-is (n_repeatsis then ignored).n_repeats (int, optional) – Number of times the K-fold split is repeated with a fresh shuffle, used only when
cvis an integer. Default10(giving acv * 10per-fold sample for the 1-SE rule); pass1to disable repeats. Repeated K-fold’s standard errors are slightly optimistic because test folds overlap across repeats; that is fine for the 1-SE selection rule but should not be reported as an unbiased generalisation variance.random_state (int, optional) – Seed forwarded to
KFold/RepeatedKFoldfor reproducible shuffling. Ignored whencvis a pre-built splitter.selection_rule ({"1se", "min", "q2_increment", "randomization"}, default "1se") – How the recommended component count is chosen. See
SelectionRulefor the rule semantics."1se"is the default;"min"is the argmin RMSECV (the pre-1.28 default, prone to running to the maximum component count);"q2_increment"is the Wold’s-R-style cumulative-Q² threshold;"randomization"is Van der Voet’s (1994) permutation test (usesn_permutationsandalpha) that picks the smallest model whose predictive ability is statistically indistinguishable from the reference (argmin RMSECV) one.scale_inside_folds (bool, default True) – When True (the default), fit a fresh
MCUVScaleron each training fold’s X and Y, apply it to the held-out rows, fit PLS in scaled space, then inverse-transform the predictions so RMSECV is reported on the original Y scale. This removes the centring / scaling leakage of the prior default. Set to False to keep the pre-1.28 behaviour, in which caseXandYshould already be scaled; aSpecificationWarningis emitted.min_q2_increase (float, default 0.01) – Threshold used only when
selection_rule="q2_increment": the smallest increase in cumulative validated \(Q^2_Y\) that justifies keeping an extra component.n_permutations (int, default 999) – Used only when
selection_rule="randomization": number of sign-flip permutations driving the Van der Voet test.alpha (float, default 0.01) – Used only when
selection_rule="randomization": significance level. The smallest component count whose Van der Voet p-value exceedsalphais recommended. R’spls::selectNcompuses the same default; smaller values pick more parsimonious models.stability_threshold (float, default 0.6) – For the per-repeat stability-selection diagnostic (
"1se"/"min"rules withn_repeats > 1only): the recommendation is judgedselection_is_stable=Trueiff the modal vote share inselection_distributionis at least this fraction. Meinshausen & Bühlmann (2010, JRSS-B) suggest 0.6-0.9 for their variable-selection analogue; we default to the permissive end.**pls_kwargs – Additional keyword arguments passed to the
PLS()constructor (e.g.missing_data_settings).
- Returns:
result – With keys:
n_components- recommended number of components (int).rmsecv- pooled RMSECV per component count (pd.DataFrame, indexed1..A; columns are the Y-variable names plus"total").per_fold_rmsecv- per-fold total RMSECV (pd.DataFrame, indexed1..A; one column per fold across all repeats). Drives the 1-SE rule.se_rmsecv- standard error of the per-fold RMSECV per component count (pd.Series, indexed1..A).q2_se- standard error on the Q2 scale (the per-fold total PRESS standard error divided by the total Y sum-of-squares), i.e. the half-width of a +/-1 SE band aroundr2y_validated["total"](pd.Series, indexed1..A).r2y_validated- validated cumulative \(R^2_Y\) (pd.DataFrame, same shape asrmsecv).r2x_validated- validated cumulative \(R^2_X\) (pd.DataFrame, indexed1..A; columns are the X-variable names plus"total").press- pooled Y prediction error sum of squares per component count (pd.Series, indexed1..A).cv_predictions- out-of-fold predictions of Y at the recommended component count, on the original Y scale (pd.DataFrame). For repeated K-fold, the first repeat’s held-out predictions are reported so each row appears exactly once.selection_rule- the rule used to pickn_components.randomization_pvalues- per-component Van der Voet right-tail p-values whenselection_rule="randomization";Noneotherwise.selection_distribution- per-repeat vote share over candidate component counts (pd.Series indexed1..A). Populated only forselection_rule in {"1se", "min"}andn_repeats > 1;Noneotherwise. A concentrated distribution signals a confident recommendation; a flat or multi-modal one flags it for review.selection_mode- the most-voted component count, orNonewhenselection_distributionis.selection_is_stable-Trueiff the modal vote share meetsstability_threshold;Nonewhen no distribution was computed.
- Return type:
Notes
The pooled RMSECV in
rmsecv["total"]is the square root of the total PRESS over all fold-test rows divided by(N_eff * M)whereN_eff = N * n_repeatsunder repeated CV; theper_fold_rmsecvcolumn for fold f is the square root of fold-f’s sum-of-squared residuals over its own test rows.References
Breiman, Friedman, Olshen & Stone (1984), CART, sec.3.4.3 (1-SE rule). Hastie, Tibshirani & Friedman, ESL, sec.7.10. Kohavi (1995, IJCAI) recommends 10-fold stratified CV for model selection.
Examples
>>> from sklearn.model_selection import KFold >>> # Default: 1-SE on 10 x 5-fold repeated CV with in-fold scaling. >>> result = PLS.select_n_components(X, Y, max_components=6, random_state=0) >>> result.n_components, result.selection_rule >>> # Opt-in to the older argmin-RMSECV rule: >>> PLS.select_n_components(X, Y, max_components=6, selection_rule="min") >>> # Caller-supplied splitter (n_repeats is ignored here): >>> PLS.select_n_components(X, Y, cv=KFold(10, shuffle=True, random_state=0))
- classmethod nested_cv(X, Y, *, max_components=None, outer_cv=5, inner_cv=5, n_inner_repeats=10, selection_rule='1se', scale_inside_folds=True, min_q2_increase=0.01, n_permutations=999, alpha=0.01, random_state=None, **pls_kwargs)[source]#
Nested cross-validation for an honest PLS performance estimate.
Outer loop splits the data into outer-train / outer-test; the inner loop runs
select_n_components()on the outer-train (with the configuredselection_ruleoverinner_cv * n_inner_repeatsfolds) to pick the component count; a final PLS is fit on the outer-train at that count and used to predict the outer-test. The accumulated out-of-fold predictions give RMSEP that is not optimism-biased by the selection decision - the headline number to report when a clean test set is not available.- Parameters:
X (array-like) – Training data. Treated as in
select_n_components()(raw ifscale_inside_folds=True, pre-scaled otherwise).Y (array-like) – Training data. Treated as in
select_n_components()(raw ifscale_inside_folds=True, pre-scaled otherwise).max_components (int, optional) – Forwarded to the inner
select_n_components().outer_cv (int or sklearn splitter, default 5) – Number of outer folds (or a custom splitter).
inner_cv (int, default 5) – Number of inner folds passed to
select_n_components().n_inner_repeats (int, default 10) – Number of inner-CV repeats per outer fold; the inner
random_stateis offset by the outer-fold index so each outer fold sees a fresh inner shuffle.selection_rule (str, default "1se") – Selection rule applied inside the inner loop. See
SelectionRule.scale_inside_folds (bool, default True) – Mirrors
select_n_components(). Also applied to the final outer-train fit, with the test-fold predictions inverse- transformed to the original Y scale before RMSEP accumulates.min_q2_increase (float) – Forwarded to the inner
select_n_components()per rule.n_permutations (int) – Forwarded to the inner
select_n_components()per rule.alpha (float) – Forwarded to the inner
select_n_components()per rule.random_state (int, optional) – Seed for the outer-fold shuffle and the inner CV. The inner seed is offset per outer fold so each outer split sees a fresh shuffled inner CV.
**pls_kwargs – Forwarded to
PLSfor both the inner CV and the final outer-train fits.
- Returns:
result – With keys:
rmsep- honest held-out RMSEP per Y column plus a"total"entry (pd.Series).q2y- validated \(Q^2_Y\) per Y column plus"total"(pd.Series).cv_predictions- out-of-fold predictions of Y at the per-outer-fold selected component counts (pd.DataFrame on the original Y scale).selected_components_per_fold- list of inner recommendations, one per outer fold.selected_components_distribution- vote share over candidate counts (pd.Series).
- Return type:
Notes
Runtime is roughly
outer_cv * inner_cv * n_inner_repeats * max_componentsPLS fits. With the defaults that is 5 * 5 * 10 * max_components fits per call; formax_components=10and a moderate dataset that completes in seconds. Dropn_inner_repeatsif you need to bring it down further.Examples
>>> from process_improve.multivariate import PLS >>> result = PLS.nested_cv(X, Y, max_components=8, random_state=0) >>> result.rmsep["total"]
- score_contributions(t_start, t_end=None, components=None, *, weighted=False)[source]#
Contribution of each X variable to a score-space movement.
Identical to
PCA.score_contributionsbut uses the PLS X-loadings (P matrix) for back-projection.- Parameters:
t_start (array-like of shape (n_components,)) – Score vector of the observation of interest.
t_end (array-like of shape (n_components,), optional) – Reference point in score space. Default is the model center (zeros).
components (list of int, optional) – 1-based component indices. Default: all components.
weighted (bool, default False) – If True, scale by 1/sqrt(explained_variance) for T² contributions.
- Returns:
contributions
- Return type:
pd.Series of shape (n_features,)
Examples
>>> pls = PLS(n_components=3).fit(X_scaled, Y_scaled) >>> contrib = pls.score_contributions(pls.scores_.iloc[0].values) >>> contrib.abs().sort_values(ascending=False).head() # top contributors
See also
observation_contributionsThe per-observation counterpart. Despite the similar name, the two answer different questions and are not interchangeable.
score_contributionsis per-variable: it decomposes a single observation’s movement in score space back onto the original X variables, answering “which variables explain why this observation sits where it does?”. It returns one signed value per variable.observation_contributionsis per-observation: it reports each observation’s share of a component’s total inertia, answering “which observations most strongly shape this component?”. It returns a non-negative sample-by-component table whose columns each sum to 1. The two are orthogonal views of the same score matrix - one decomposes across variables, the other across observations.
- detect_outliers(conf_level=0.95)[source]#
Detect outlier observations using SPE and Hotelling’s T² diagnostics.
Same approach as
PCA.detect_outliers: combines statistical limits with the robust generalized ESD test.- Parameters:
conf_level (float, default 0.95) – Confidence level in [0.8, 0.999].
- Returns:
outliers – Sorted from most severe to least. Each dict contains
observation,outlier_types,spe,hotellings_t2,spe_limit,hotellings_t2_limit,severity.- Return type:
Examples
>>> pls = PLS(n_components=3).fit(X_scaled, Y_scaled) >>> outliers = pls.detect_outliers(conf_level=0.95) >>> for o in outliers: ... print(f"{o['observation']}: {o['outlier_types']}")
- cross_validate(X, Y, *, cv='loo', n_bootstrap=0, conf_level=0.95, random_state=None, show_progress=True, sample_weight=None)[source]#
Cross-validate the PLS model and compute error bars for beta coefficients.
Refits the model on data subsets (jackknife, K-fold, or bootstrap), collects
beta_coefficients_from each refit, and computes confidence intervals. Also returns cross-validated predictions and prediction-error metrics (RMSE, Q²).- Parameters:
X (array-like of shape (n_samples, n_features)) – Predictor matrix (same data used for
fit).Y (array-like of shape (n_samples, n_targets)) – Response matrix (same data used for
fit).cv (int or
"loo", default"loo") –Cross-validation strategy:
"loo"- leave-one-out (jackknife). Produces N resamples.int- number of folds for K-fold CV.
n_bootstrap (int, default 0) – If > 0, use bootstrap resampling instead of CV folds. The value specifies the number of bootstrap rounds. Overrides the
cvparameter when set.conf_level (float, default 0.95) – Confidence level for the beta-coefficient intervals, in (0, 1).
random_state (int or None, default None) – Random seed for reproducibility (K-fold shuffle and bootstrap).
show_progress (bool, default True) – Whether to display a
tqdmprogress bar.sample_weight (ndarray | None)
- Returns:
result – Dictionary-like object with the following keys:
Beta-coefficient uncertainty
- beta_samplesnp.ndarray of shape (n_resamples, n_features, n_targets)
Raw beta coefficients from every resample.
- beta_meanpd.DataFrame of shape (n_features, n_targets)
Mean beta across resamples.
- beta_stdpd.DataFrame of shape (n_features, n_targets)
Standard error of the beta coefficients.
- beta_ci_lowerpd.DataFrame of shape (n_features, n_targets)
Lower bound of the confidence interval.
- beta_ci_upperpd.DataFrame of shape (n_features, n_targets)
Upper bound of the confidence interval.
- significantpd.DataFrame of shape (n_features, n_targets)
Truewhere the confidence interval excludes zero.
Prediction metrics
- y_hat_cvpd.DataFrame of shape (n_samples, n_targets)
Cross-validated predictions (out-of-fold). Only available for jackknife and K-fold;
Nonefor bootstrap.- pressfloat
Prediction Error Sum of Squares (sum over all Y elements). Only for jackknife / K-fold.
- rmse_cvpd.Series of length n_targets
Root-mean-square error per Y variable (cross-validated). Only for jackknife / K-fold.
- q_squaredpd.Series of length n_targets
Cross-validated R² (Q²) per Y variable. Only for jackknife / K-fold.
Metadata
- n_resamplesint
Number of resamples performed.
- methodstr
"jackknife","kfold", or"bootstrap".- conf_levelfloat
The confidence level used.
- Return type:
Examples
>>> from process_improve.multivariate import PLS, MCUVScaler >>> scaler_x = MCUVScaler().fit(X) >>> scaler_y = MCUVScaler().fit(Y) >>> X_s, Y_s = scaler_x.transform(X), scaler_y.transform(Y) >>> pls = PLS(n_components=2).fit(X_s, Y_s) >>> cv_results = pls.cross_validate(X_s, Y_s, cv="loo") >>> cv_results.beta_mean # mean beta across LOO resamples >>> cv_results.significant # which betas are significantly != 0 >>> cv_results.q_squared # cross-validated R²
- prediction_interval(X, *, conf_level=0.95, cv_result=None)[source]#
Prediction interval for the Y predictions of new observations.
The interval combines the residual error variance with the leverage of each new observation in the latent-variable space. For a new observation the prediction-interval half-width on target
mist * s_E[m] * sqrt(1 + 1/N + T2_new / (N - 1))where
s_Eis the residual error standard deviation,T2_newis the Hotelling’s T² of the new observation,Nis the number of calibration samples, andtis the Student-t quantile.- Parameters:
X (array-like of shape (n_new, n_features)) – New observations, pre-processed the same way as the training data.
conf_level (float, default=0.95) – Confidence level for the interval, in (0.5, 1.0).
cv_result (sklearn.utils.Bunch or None, default=None) – The result of
cross_validate(). When supplied, its cross-validated RMSE (rmse_cv) is used for the error variance, which is preferable to the optimistic calibration RMSE used otherwise.
- Returns:
With keys
y_hat(point predictions),lowerandupper(prediction-interval bounds) - each a DataFrame of shape (n_new, n_targets) - andconf_level.- Return type:
- set_fit_request(*, sample_weight='$UNCHANGED$')#
Configure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- set_score_request(*, sample_weight='$UNCHANGED$')#
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
TPLS#
- class process_improve.multivariate.methods.TPLS(n_components, d_matrix, max_iter=500, skip_f_matrix_preprocessing=False)[source]#
Bases:
RegressorMixin,BaseEstimatorTPLS algorithm for T-shaped data structures (we also include standard pre-processing of the data inside this class).
Source: Garcia-Munoz, https://doi.org/10.1016/j.chemolab.2014.02.006, Chem.Intell.Lab.Sys. v133, p 49 to 62, 2014.
We change the notation from the original paper to avoid confusion with a generic “X” matrix, and match symbols that are more natural for our use.
Notation mapping (paper → this code):
X^T → D:
d_matrix(external),d_mats(internal) - Database of propertiesX → D^T: transposed D (not used directly)
R → F:
f_mats- Formula matricesZ → Z:
z_mats- Process conditionsY → Y:
y_mats- Quality indicators
Notes 1. Matrices in F, Z and Y must all have the same number of rows. 2. Columns in F must be the same as the rows in D. 3. Conditions in Z may be missing (turning it into an L-shaped data structure).
- Parameters:
n_components (int) – A parameter used to specify the number of components.
d_matrix (dict[str, pd.DataFrame]) – A dictionary containing the properties of each group of materials. Keys are group names; values are DataFrames with properties as columns and materials as rows. This “D” matrix is provided once at construction and reused for fitting, prediction and cross-validation.
max_iter (int, optional) – The maximum number of iterations for the TPLS algorithm. Default is 500.
skip_f_matrix_preprocessing (bool, optional) – If True, the F (formula) matrices are used as-is, skipping the internal centering and scaling of the F block. Default is False.
Notes
The input
Xpassed tofit()andpredict()is a dictionary with 3 keys:F: Formula matrices (rows = blends, columns = materials).F = {"Group A": df_formulas_a, "Group B": df_formulas_b, ...}Z: Process conditions - one row per blend, one column per condition.Y: Product quality indicators - one row per blend, one column per indicator.
The
Dmatrix (database of material properties) is supplied once at construction via thed_matrixargument; it is not part ofX.- n_substances#
The number of substances (columns) in the training data, i.e. the number of materials in the F matrix.
- Type:
Example
>>> import numpy as np >>> import pandas as pd >>> rng = np.random.default_rng() >>> >>> n_props_a, n_props_b = 6, 4 # Two groups of properties: A and B. >>> n_materials_a, n_materials_b = 12, 8 # Number of materials in each group. >>> n_formulas = 40 # Number of formulas in matrix F. >>> n_outputs = 3 >>> n_conditions = 2 >>> >>> properties = { >>> "Group A": pd.DataFrame(rng.standard_normal((n_materials_a, n_props_a))), >>> "Group B": pd.DataFrame(rng.standard_normal((n_materials_b, n_props_b))), >>> } >>> formulas = { >>> "Group A": pd.DataFrame(rng.standard_normal((n_formulas, n_materials_a))), >>> "Group B": pd.DataFrame(rng.standard_normal((n_formulas, n_materials_b))), >>> } >>> process_conditions = {"Conditions": pd.DataFrame(rng.standard_normal((n_formulas, n_conditions)))} >>> quality_indicators = {"Quality": pd.DataFrame(rng.standard_normal((n_formulas, n_outputs)))} >>> all_data = {"Z": process_conditions, "F": formulas, "Y": quality_indicators} >>> estimator = TPLS(n_components=4, d_matrix=properties) >>> estimator.fit(all_data)
- hotellings_t2_limit(conf_level=0.95)[source]#
Hotelling’s T2 limit at the given confidence level (see
hotellings_t2_limit()).
- ellipse_coordinates(score_horiz, score_vert, conf_level=0.95, n_points=100)[source]#
Coordinates of the T2 confidence ellipse (see
ellipse_coordinates()).
- fit(X, y=None)[source]#
Fit the preprocessing parameters and also the latent variable model from the training data.
- Parameters:
X ({dictionary of dataframes}, keys that must be present: "F", "Z", and "Y") – The training input samples. See documentation in the class definition for more information on each matrix.
y (None)
- Returns:
self – Returns self.
- Return type:
- predict(X)[source]#
Forward to
diagnose(); emits aDeprecationWarning.Deprecated since version 1.38.4: Use
TPLS.diagnose()instead.predictmatches the sklearn-convention name (regression-style ndarray return), but TPLS’s natural return is the rich per-group / per-block diagnostics Bunch and TPLS cannot be placed in a standard sklearn Pipeline anyway (its input is a nestedDataFrameDict). The rename aligns withPLS.diagnose()andPCA.diagnose(). Will be removed in 2.0.0.- Parameters:
X (DataFrameDict)
- Return type:
- diagnose(X)[source]#
Model inference on new data.
This will pre-process the new data and apply those subsequently to the latent variable model.
Example
# Training phase: estimator = TPLS(n_components=2).fit(training_data)
# Testing/inference phase: new_data = {“Z”: …, “F”: …} # you need at least the F block for a new prediction. “Z” is optional. predictions = estimator.predict(new_data)
- Parameters:
X (DataFrameDict) – The input samples.
- Returns:
A bunch with the following fields:
hat(dict[str, DataFrame]) : predicted Y per Y-group, on the original (un-scaled) scale, indexed by the observation names ofX.t_scores_super(DataFrame, shape(n_new, n_components)) : super-scores for the new observations.spe(dict[str, dict[str, DataFrame]]) : per-component squared prediction error for the"Z"and"F"blocks, keyed first by block name and then by group; each inner DataFrame has shape(n_new, n_components).hotellings_t2(DataFrame, shape(n_new, n_components)) : cumulative Hotelling’s T² per new observation after each component.
- Return type:
- score(X, y=None, sample_weight=None)[source]#
Return r2_score` on test data.
See RegressorMixin.score for more details.
- Parameters:
X (DataFrameDict) – Test samples.
y (Not used. In the X input, there is a already a “Y” block. This will be the Y-data.)
sample_weight (Not used.)
- Returns:
score – \(R^2\) of
self.diagnose(X).- Return type:
- help()[source]#
Help for the TPLS Estimator.
Data organization#
Quick tips#
Build model: tpls = TPLS(n_components=2, d_matrix=d_matrix).fit(X) Get model’s predictions: tpls.hat <– the hat-matrix, i.e., the predictions Predict on new data: tpls.predict(X_new) See model summary: tpls.display_results() This help page: tpls.help()
Statistical values#
.t_scores_super Super scores for the entire model [pd.DataFrame] .hotellings_t2 Hotelling’s T2 values for each observation, per component [pd.DataFrame] .spe Squared prediction error for each block [dict of pd.DataFrames]
.hotellings_t2_limit() Returns the Hotelling’s T2 limit for the model [float] .spe_limit[block]() Return the SPE limit for the block; e.g. .spe_limit[“Y”]() [float]
.vip() Variable importance (VIP) for the D- and F-blocks [dict]
- Return type:
- vip(block=None, method='vip')[source]#
Return Variable Importance in Projection (VIP) scores for TPLS blocks.
VIP scores are computed during fitting for the D-block (material properties) and F-block (formulation variables) and stored in
feature_importance.- Parameters:
block (str or None, default=None) – Which block to return. Must be
"D"or"F", orNoneto return all blocks.method ({"vip", "deflated"}, default="vip") –
How importance is measured.
"vip"(default): standard VIP on the raw block loadings, as reported by Garcia-Munoz (2014). These are the values stored infeature_importance."deflated": VIP computed on the direct (rotated) weights that map the original variables onto the scores while accounting for the deflation across components,S(V^T S)^-1for the D-block andP(P^T P)^-1for the F-block. This is an alternative importance measure; it does not changefeature_importance.
- Returns:
If block is
None:{"D": {group: pd.Series, ...}, "F": {group: pd.Series, ...}}. If block is"D"or"F": the inner dict{group: pd.Series, ...}for that block, where eachpd.Seriesis indexed by feature names.- Return type:
- Raises:
ValueError – If the model is not fitted, block is not
"D","F", orNone, or method is not"vip"or"deflated".
Examples
>>> tpls = TPLS(...).fit(data) >>> tpls.vip() # all blocks, standard VIP >>> tpls.vip("D") # D-block only → {group_name: pd.Series, ...} >>> tpls.vip("D", method="deflated") # D-block deflated direct-weights importance
- property d_block_scaling_: dict[str, float]#
Block-scaling factor applied to each D-block (read-only).
After column-wise centring and auto-scaling, every D-block
X_iis additionally divided bysqrt(P_i * M_i)(P_i= number of lots/rows,M_i= number of properties/columns) so thattrace(X_i^T X_i) ~= 1, removing bias toward blocks with more lots or properties (Garcia-Munoz, 2014, section 2.1).- Returns:
Mapping of D-block group name to its scalar block-scaling factor.
- Return type:
- Raises:
AttributeError – If the model has not been fitted yet.
- set_score_request(*, sample_weight='$UNCHANGED$')#
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
MBPLS#
Multi-block PLS in the hierarchical / superblock formulation of
Westerhuis, Kourti & MacGregor (1998). Each X-block is preprocessed
independently and weighted by 1/sqrt(K_b) before the inner NIPALS
loop, so blocks of unequal width contribute fairly to the consensus
super-score.
- class process_improve.multivariate.methods.MBPLS(n_components, *, max_iter=500, tol=None, algorithm='auto', missing_data_settings=None)[source]#
Bases:
_HotellingsT2LimitMixin,RegressorMixin,BaseEstimatorMulti-block PLS (hierarchical / superblock formulation).
Generic multi-block PLS as described by Westerhuis, Kourti & MacGregor (1998) and Westerhuis & Smilde (2001). Each X-block is preprocessed independently (mean-centred and unit-variance scaled), then divided by
sqrt(K_b)so that blocks of unequal width contribute fairly to the super-score.- Parameters:
n_components (int) – Number of latent variables to extract.
max_iter (int, default=500) – Maximum NIPALS iterations per latent variable.
tol (float or None, default=None) – Convergence tolerance on the change in the Y-block score. If
None,np.finfo(float).eps ** (6/7)is used (matching the legacy multi-block reference implementation).algorithm (str) –
Algorithm to use for fitting the model.
"auto": dense vectorised hierarchical NIPALS when every block (X and Y) is complete; mask-aware NIPALS when any block contains missing values."dense": dense vectorised hierarchical NIPALS. Raises if any block contains missing values."nipals": mask-aware hierarchical NIPALS. Always uses the NaN-tolerant inner-loop primitives, even when the data is complete (slower than"dense"but produces equivalent results).
missing_data_settings (dict or None, default=None) – Settings for the iterative
"nipals"path. Keys:md_tol(convergence tolerance on the score-vector change between iterations),md_max_iter(maximum NIPALS iterations per component). Defaults to{"md_tol": epsqrt, "md_max_iter": 1000}.fitting) (Attributes (after)
--------------------------
block_names (list[str]) – Ordered list of X-block names (the keys of the input dict).
block_widths (dict[str, int]) – Number of variables in each X-block.
super_scores (pd.DataFrame, shape (n_samples, n_components)) – Super-block (consensus) X-scores
T.super_y_scores (pd.DataFrame, shape (n_samples, n_components)) – Super-block Y-scores
U.super_weights (pd.DataFrame, shape (n_blocks, n_components)) – Super-block weights
w_super; rows indexed by block name.super_y_loadings (pd.DataFrame, shape (n_targets, n_components)) – Y-block loadings
c.block_scores (dict[str, pd.DataFrame]) – Per-block X-scores
t_b, each shape(n_samples, n_components).block_weights (dict[str, pd.DataFrame]) – Per-block X-weights
w_b, each shape(K_b, n_components). Each column has unit norm.block_loadings (dict[str, pd.DataFrame]) – Per-block X-loadings
p_b(used for deflation), each shape(K_b, n_components).predictions (pd.DataFrame, shape (n_samples, n_targets)) – In-sample Y predictions on the original scale.
explained_variance (np.ndarray, shape (n_components,)) – Variance of the super-score per component (ddof=1).
scaling_factor_for_super_scores (pd.Series) –
sqrt(explained_variance_)per component.fitting_info (dict) – Per-component iteration count and timing.
has_missing_data (bool) – Whether any X-block or Y had NaN values.
algorithm – The resolved algorithm actually used for the fit. With
algorithm="auto", this is"dense"for complete data and"nipals"for NaN-containing data.
Notes
Block weighting uses the convention \(X_b / \sqrt{K_b}\) so that every block contributes the same total sum of squares to the super-score, regardless of how many variables it has.
Missing data#
When any X-block or Y contains NaN entries, the
"auto"algorithm routes to a mask-aware NIPALS variant. The X-block weights, block scores, block loadings used for deflation, Y-block loadings and Y-block scores are each computed as a regression that uses only the observed entries; the masked sum-of-squares is used as the denominator so missing values neither bias the latent direction nor contribute to the score. The mask is preserved across components automatically because deflation propagates NaN through subtraction. This is the standard skip-NaN NIPALS update; see Walczak & Massart (2001) and Arteaga & Ferrer (2002).The fit refuses to run if any X-block or Y has a column with all entries missing, or a row with all entries missing for that block; either case leaves the masked denominator at zero. Drop or impute such rows or columns before fitting. Predict-time score estimation for new observations with NaN (Trimmed Score Regression / Projection to the Model Plane) is a separate follow-up.
References
Westerhuis, J. A., Kourti, T. & MacGregor, J. F. Analysis of multiblock and hierarchical PCA and PLS models. Journal of Chemometrics, 12 (1998), 301-321.
Westerhuis, J. A. & Smilde, A. K. Deflation in multiblock PLS. Journal of Chemometrics, 15 (2001), 485-493.
Walczak, B. & Massart, D. L. Dealing with missing data: Part I. Chemom. Intell. Lab. Syst., 58 (2001), 15-27.
Arteaga, F. & Ferrer, A. Dealing with missing data in MSPC: several methods, different interpretations, some examples. J. Chemometrics, 16 (2002), 408-418.
- block_spe_limit(block, conf_level=0.95)[source]#
SPE limit for one X-block using the Nomikos & MacGregor chi-square approximation.
Operates on the same scale as
block_spe_[block](sqrt of row sum of squares), so the value can be drawn directly on a SPE plot.
- super_spe_limit(conf_level=0.95)[source]#
SPE limit for the merged super-block (sum of per-block SPE squared).
- spe_contributions(X)[source]#
Per-variable squared residuals for each X-block (SPE contributions).
For each new observation and each X-block, reconstruct the block as
T_super @ P_b^T(matching the deflation step used during fit) and return the squared per-variable residuals. Useful for fault diagnosis: the variable with the largest contribution is the most likely culprit for a high SPE.
- score_contributions(t_super_start, t_super_end=None, components=None, *, weighted=False)[source]#
Per-block per-variable contributions to a super-score movement.
The multi-block analogue of
PCA.score_contributions(). Decomposes a super-score-space delta back into preprocessed-scale variable-space deltas, one per X-block.For MBPLS, the super-score at component a is built from the per-block scores (which themselves are weighted regressions of the block on
w_b), so the back-projection throughw_super[b,a] * w_b[:,a] / sqrt(K_b)gives the variable contribution.- Parameters:
t_super_start (array-like, shape (n_components,)) – Super-score row of the observation of interest. Typically a row from
self.super_scores_or frompredict(X_new).super_scores.t_super_end (array-like, optional) – Reference point in super-score space. Defaults to the model centre (zeros).
components (list of int, optional) – 1-based component indices to decompose over.
None(default) uses all components - appropriate for Hotelling’s T² contributions.weighted (bool, default=False) – If
True, divide the super-score delta bysqrt(explained_variance_)per component before back-projecting, giving contributions to the T² statistic instead of the Euclidean super-score distance.
- Returns:
One Series per X-block (length
K_b), indexed by variable (column) name.- Return type:
- super_weights_bar_plot(component=1)[source]#
Bar plot of super-weights
w_superfor a single component.- Parameters:
component (int)
- Return type:
Figure
- predictions_vs_observed_plot(y_observed, variable=None)[source]#
Scatter plot of predicted vs observed Y, with y=x reference and RMSEE annotation.
- Parameters:
y_observed (pd.DataFrame) – The observed Y on the original scale, same columns as the training Y.
variable (str or None, default=None) – If given, plot only that Y-variable. If
None, plot the first one.
- Return type:
Figure
- display_results(show_cumulative=True)[source]#
Format a short text summary of per-block R²X, overall R²Y, iterations and timing.
- diagnose(X)[source]#
Project new data and return the full diagnostics Bunch.
Returns a
sklearn.utils.Bunchwith fieldssuper_scores(DataFrame, n_samples x n_components),block_scores(dict[str, DataFrame]),predictions(DataFrame on original Y scale),block_spe(dict[str, Series], per-block SPE of the new observations) andhotellings_t2(Series of cumulative Hotelling’s T² over all components, per new observation).The rename (since 1.38.4, #395) matches
PLS.diagnose()and PCA.diagnose;predict()is kept as a deprecation shim.
- predict(X)[source]#
Forward to
diagnose(); emits aDeprecationWarning.Deprecated since version 1.38.4: Use
MBPLS.diagnose()instead. The sklearn-conventionpredictname suggests a regression-style ndarray return, but the historical return is the rich diagnostics Bunch. The rename aligns withPLS.diagnose()andPCA.diagnose()and frees the name for a future contract that returns just thepredictionsfield. Will be removed in 2.0.0.
- set_score_request(*, sample_weight='$UNCHANGED$')#
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- process_improve.multivariate.methods.randomization_test_mbpls(model, X, y, n_permutations=200, *, seed=None)[source]#
Randomization (permutation) test for component significance in MBPLS.
For each component
a, the null hypothesis is “there is no real relationship between X and Y at this component”; the test permutes the rows ofy, refits a fresh MBPLS with the same number of components, and recomputes the test statistic. The risk is the fraction of permutations whose statistic equals or exceeds the original model’s.Statistic: per-component absolute correlation between the super X-score and the super Y-score,
|t_super(:,a)' u_super(:,a)| / (||t|| * ||u||), matching the legacy ConnectMV randomization-objective for PLS.- Parameters:
model (MBPLS) – A fitted MBPLS model.
X (dict[str, DataFrame], DataFrame) – The same training data used to fit
model.y (dict[str, DataFrame], DataFrame) – The same training data used to fit
model.n_permutations (int, default=200) – Number of Y-row permutations to evaluate.
seed (int or None, default=None) – Seed for the permutation RNG (
Noneuses non-reproducible randomness).
- Returns:
Indexed by component
1..Awith columns:observed: the actual model’s per-component statistic.risk_pct: fraction (in %) of permutations with statistic>= observed. Low values (e.g. < 5%) suggest the component is significant; values near 50% suggest the component is no better than chance.
- Return type:
pd.DataFrame
References
Wiklund, S., Nilsson, D., Eriksson, L., Sjöström, M., Wold, S. & Faber, K. A randomization test for PLS component selection. J. Chemometrics, 21 (2007), 427-439.
MBPCA#
Multi-block PCA / consensus-PCA. Same dict-of-DataFrames API as
MBPLS; no Y-block.
- class process_improve.multivariate.methods.MBPCA(n_components, *, max_iter=500, tol=None, algorithm='auto', missing_data_settings=None)[source]#
Bases:
_HotellingsT2LimitMixin,TransformerMixin,BaseEstimatorMulti-block PCA (hierarchical / consensus PCA).
Generic multi-block PCA following the consensus-PCA / hierarchical PCA formulation of Westerhuis, Kourti & MacGregor (1998). Each X-block is preprocessed independently (mean-centred and unit-variance scaled), then divided by
sqrt(K_b)so blocks of unequal width contribute fairly to the consensus super-score.The hierarchical NIPALS loop alternates: (i) regress each block on the super-score to get block loadings and block scores, (ii) collect block scores into a super-block, (iii) regress the super-block to get a new super-score / super-loading, repeat to convergence. After convergence, deflate every block by the super-score and the corresponding block loading scaled by the super-loading element.
- Parameters:
n_components (int)
max_iter (int, default=500)
tol (float or None, default=None) – Convergence tolerance on the super-score change.
Noneusesnp.finfo(float).eps ** (9/10)(matches the legacy reference).algorithm (str, default="auto") –
Algorithm to use for fitting the model.
"auto": dense vectorised hierarchical NIPALS when the data is complete; mask-aware NIPALS (NaN-tolerant) when any block contains missing values."dense": dense vectorised hierarchical NIPALS. Raises if any block contains missing values."nipals": mask-aware hierarchical NIPALS. Always uses the NaN-tolerant inner-loop primitives, even when the data is complete (slower than"dense"but produces equivalent results).
missing_data_settings (dict or None, default=None) – Settings for the iterative
"nipals"path. Keys:md_tol(convergence tolerance on the score-vector change between iterations),md_max_iter(maximum NIPALS iterations per component). Defaults to{"md_tol": epsqrt, "md_max_iter": 1000}.fitting) (Attributes (after)
--------------------------
block_names
MBPLS) (block_widths (as)
A) (super_loadings DataFrame (B x)
A)
block_scores
dict[str (r2_x_per_variable)
DataFrame]
r2_x_per_block_cumulative
r2_x_per_block_per_component
dict[str
DataFrame]
block_vip
super_vip
block_spe
block_hotellings_t2
super_hotellings_t2
explained_variance
scaling_factor_for_super_scores
fitting_info
has_missing_data
algorithm
Notes
The deflation step is \(X_b \leftarrow X_b - t_{\rm super}\, (p_b\,p_s[b]\,\sqrt{K_b})^\top\), derived in Westerhuis et al. 1998. The legacy MATLAB
mbpca.mhad this step marked as broken by the original author; this implementation re-derives it directly from the paper and is independently validated against the pure-numpy reference oracles in the test suite.Missing data#
When any block contains NaN entries, the
"auto"algorithm routes to a mask-aware NIPALS variant. Each per-block projection in the inner loop is computed as a regression that uses only the observed entries; the masked sum-of-squares is used as the denominator so missing values neither bias the loading direction nor contribute to the score. The mask is preserved across components automatically because deflation propagates NaN through subtraction. This is the standard skip-NaN NIPALS update; see Walczak & Massart (2001) and Arteaga & Ferrer (2002).The fit refuses to run if any block has a column with all entries missing, or any block has a row with all entries missing for that block; either case leaves the masked denominator at zero. Drop or impute such rows or columns before fitting. Predict-time score estimation for new observations with NaN (Trimmed Score Regression / Projection to the Model Plane) is a separate follow-up.
References
Westerhuis, J. A., Kourti, T. & MacGregor, J. F. Analysis of multiblock and hierarchical PCA and PLS models. J. Chemometrics, 12 (1998), 301-321.
Walczak, B. & Massart, D. L. Dealing with missing data: Part I. Chemom. Intell. Lab. Syst., 58 (2001), 15-27.
Arteaga, F. & Ferrer, A. Dealing with missing data in MSPC: several methods, different interpretations, some examples. J. Chemometrics, 16 (2002), 408-418.
- diagnose(X)[source]#
Project new data; return super_scores, block_scores, block_spe, hotellings_t2.
The rename (since 1.38.4, #395) matches
PCA.diagnose()andPLS.diagnose();predict()is kept as a deprecation shim.
- predict(X)[source]#
Forward to
diagnose(); emits aDeprecationWarning.Deprecated since version 1.38.4: Use
MBPCA.diagnose()instead.predictmatches the sklearn-convention name but MBPCA isn’t a regressor; the historical return is a diagnostics Bunch. The rename aligns withPCA.diagnose(). Will be removed in 2.0.0.
- block_spe_limit(block, conf_level=0.95)[source]#
SPE limit for one X-block (Nomikos & MacGregor chi-square approximation).
- spe_contributions(X)[source]#
Per-variable squared residuals for each X-block (SPE contributions).
Reconstruction matches the MBPCA deflation step:
X_b = T_super @ (P_b * p_super[b] * sqrt(K_b))^Tsummed over components. Returns squared residuals on the preprocessed scale; sum across columns equalsblock_spe_[b].iloc[:, -1] ** 2.
- score_contributions(t_super_start, t_super_end=None, components=None, *, weighted=False)[source]#
Per-block per-variable contributions to a super-score movement (MBPCA).
Decomposes a super-score-space delta into preprocessed-scale variable contributions per X-block. The MBPCA back-projection mirrors the deflation step used during fit:
\[\text{contrib}_{b,j} = \sum_a (\Delta t_\mathrm{super}[a]) \cdot P_b[j, a] \cdot p_\mathrm{super}[b, a] \cdot \sqrt{K_b}\]See
MBPLS.score_contributions()for the parameter and return descriptions; the API is identical.
- super_score_plot(pc_horiz=1, pc_vert=2)[source]#
Scatter plot of MBPCA super-scores for two components.
Analysis#
- process_improve.multivariate.methods.rv_coefficient(X, Y)[source]#
Compute the RV coefficient between two data blocks.
The RV coefficient (Robert and Escoufier, 1976) measures how much common structure two matrices, measured on the same observations, share. It is a multivariate generalisation of the squared Pearson correlation: it compares the observation-by-observation configuration matrices \(XX^T\) and \(YY^T\) rather than individual variables.
- Parameters:
- Returns:
The RV coefficient in the range [0, 1]. A value of 1 means the two blocks describe the same configuration of observations up to a rotation and an overall scaling; 0 means no shared structure.
nanis returned if either block has no variance.- Return type:
Notes
Each column is mean-centred internally, since the RV coefficient is defined on centred data. The blocks are not scaled; scale the columns yourself (for example with
MCUVScaler) when the variables have different units.For high-dimensional data (many more variables than observations) the RV coefficient is biased upwards and tends towards 1 even for unrelated blocks. Use
rv2_coefficient()in that regime.References
Robert, P. and Escoufier, Y. (1976). A unifying tool for linear multivariate statistical methods: the RV-coefficient. Journal of the Royal Statistical Society, Series C, 25(3), 257-265.
See also
rv2_coefficientModified RV coefficient, unbiased for high-dimensional data.
Examples
>>> rv_coefficient(X, Y) >>> rv_coefficient(X, X) # 1.0: a block is perfectly correlated with itself
- process_improve.multivariate.methods.rv2_coefficient(X, Y)[source]#
Compute the modified RV coefficient (RV2) between two data blocks.
The modified RV coefficient (Smilde et al., 2009) is a variant of
rv_coefficient()that removes the diagonals of the configuration matrices \(XX^T\) and \(YY^T\) before comparing them. This removes the upward bias that makes the ordinary RV coefficient tend towards 1 for high-dimensional data, so RV2 stays near 0 for genuinely unrelated blocks.- Parameters:
- Returns:
The modified RV coefficient, in the range [-1, 1]. A value of 1 means the two blocks describe the same configuration of observations; values near 0 mean no shared structure, and small negative values can occur.
nanis returned if either block has no variance.- Return type:
Notes
Each column is mean-centred internally but the blocks are not scaled; scale the columns yourself (for example with
MCUVScaler) when the variables have different units.References
Smilde, A. K., Kiers, H. A. L., Bijlsma, S., Rubingh, C. M. and van Erk, M. J. (2009). Matrix correlations for high-dimensional data: the modified RV-coefficient. Bioinformatics, 25(3), 401-405.
See also
rv_coefficientThe original RV coefficient.
Examples
>>> rv2_coefficient(X, Y)
Preprocessing#
- class process_improve.multivariate.methods.MCUVScaler[source]#
Bases:
TransformerMixin,BaseEstimatorMean-centre, unit-variance (MCUV) scaler.
Unlike
sklearn.preprocessing.StandardScalerthis uses the sample standard deviation (ddof=1), the convention for chemometric data analysis where the population is the training set itself rather than a sampled super-population.The estimator follows the standard sklearn contract:
n_features_in_andfeature_names_in_are populated byfit; sparse / complex / object dtype / empty input are rejected with sklearn-style errors; NaN values pass through (the chemometric preprocessing pipeline expects to thread missing-data through to the downstream NIPALS estimator).- get_feature_names_out(input_features=None)[source]#
Return the output column names of
transform().MCUVScaleris column-preserving (centring + scaling leave the X column layout unchanged), so the returned names mirror those captured duringfit()(or theinput_featuresargument when nofeature_names_in_was captured - the standard sklearn fallback for ndarray-fit estimators).Used by
set_output()(sklearn 1.2+) to label theDataFrameview of the output whenset_output(transform="pandas")is on, and by Pipeline introspection.- Return type:
- fit(X, y=None)[source]#
Compute the column means and sample standard deviations.
yis accepted (and ignored) so the scaler plugs intosklearn.pipeline.Pipeline, which threadsythrough every step’sfiteven when (as for a transformer) it is unused.- Parameters:
- Return type:
- process_improve.multivariate.methods.center(X, func=<function mean>, axis=0, extra_output=False)[source]#
Perform centering of data, using a function, func (default: np.mean). The function, if supplied, must return a vector with as many columns as the matrix X.
axis [optional; default=0] {integer or None}
This specifies the axis along which the centering vector will be calculated if not provided. The function is applied along the axis: 0=down the columns; 1 = across the rows.
Missing values: The sample mean is computed by taking the sum along the axis, skipping any missing data, and dividing by N = number of values which are present. Values which were missing before, are left as missing after.
- process_improve.multivariate.methods.scale(X, func=<function std>, axis=0, extra_output=False, ddof=0, **kwargs)[source]#
Scales the data (does NOT do any centering); scales to unit variance by default.
- func [optional; default=np.std] {a function}
The default (np.std) uses NumPy to calculate the standard deviation of the data along the required axis, skipping over any missing data, and uses that as scale.
- axis [optional; default=0] {integer}
Transformations are applied on slices of data. This specifies the axis along which the transformation will be applied.
- ddof [optional; default=0] {integer}
Delta degrees of freedom, forwarded to np.std when func is the default np.std. The standard deviation is computed by dividing by
N - ddof, where N is the number of values which are present. The default (ddof=0) divides by N (the population standard deviation); passddof=1for the sample standard deviation (dividing by N-1).Note:
MCUVScalerusesddof=1and is the preferred scaler for fitting PCA / PLS models. Usescale(center(X), ddof=1)here to match it. Theddofargument is ignored when a custom func is supplied (forward your own keyword arguments via**kwargsinstead).
Constant (zero-variance) columns are left unchanged: a zero entry in the computed scaling vector is replaced by 1.0 before inversion, mirroring
MCUVScaler, so noinf/NaNis introduced.Usage#
X = … # data matrix X = scale(center(X)) X = scale(center(X), ddof=1) # sample standard deviation, matches MCUVScaler my_scale = np.mad X = scale(center(X), func=my_scale)
- returns:
scaled (DataMatrix) – The scaled data, returned when
extra_output=False(the default).(scaled, scale_vector) (tuple[DataMatrix, np.ndarray]) – When
extra_output=True, a tuple of the scaled data and the per-column scaling vector (the reciprocal of func applied along axis, with zero entries replaced by 1.0 to leave constant columns unchanged) is returned instead.
Diagnostics#
These functions work with fitted PCA and PLS models. Each is
also bound as a convenience method on the model after fit().
Note
Two different “contributions” diagnostics. The library has two methods whose names both contain “contributions”; they are not interchangeable and answer different questions about the same fitted score matrix.
PCA.score_contributions()(andPLS.score_contributions()) is per-variable and signed. It decomposes a single observation’s movement in score space back onto the original variables, answering “which variables explain why this observation sits where it does?”. It returns one signed value per variable, and it takes an observation’s score vector as input.observation_contributions()is per-observation and non-negative. It reports each observation’s share of a component’s total inertia (\(t_{ia}^2 / \sum_i t_{ia}^2\)), answering “which observations most strongly shape this component?”. It returns a sample-by-component table whose columns each sum to 1, and it takes no input beyond the fitted model.
In short, score_contributions decomposes across variables while
observation_contributions decomposes across observations.
- process_improve.multivariate.methods.vip(model, n_components=None)[source]#
Calculate Variable Importance in Projection (VIP) scores.
Works with fitted
PCAandPLSmodels. For PCA the principal-component loadingsloadings_are used as the weight matrix; for PLS the X-block weightsx_weights_are used.The formula is:
\[\begin{split}\\text{VIP}_j = \\sqrt{K \\cdot \\frac{\\sum_{a=1}^{A} r2_a \\cdot w_{ja}^2}{\\sum_{a=1}^{A} r2_a}}\end{split}\]where \(K\) is the number of features, \(A\) the number of components, \(r2_a\) the fraction of variance explained by component \(a\), and \(w_{ja}\) the weight for feature \(j\) in component \(a\).
- Parameters:
- Returns:
VIP scores indexed by feature names, named
"VIP".- Return type:
pd.Series
- Raises:
ValueError – If the model is not fitted, if neither
x_weights_norloadings_is found, or if n_components is out of range.
Examples
>>> pls = PLS(n_components=3).fit(X_scaled, Y_scaled) >>> pls.vip() # bound convenience method after fit() >>> vip(pls) # or call the standalone function directly >>> pca = PCA(n_components=3).fit(X_scaled) >>> pca.vip(n_components=2)
- process_improve.multivariate.methods.squared_cosine(model, n_components=None)[source]#
Calculate the squared cosine (cos2): quality of representation of observations.
Works with fitted
PCAandPLSmodels. The squared cosine of observation \(i\) on component \(a\) is the squared score divided by that observation’s total variation budget:\[\begin{split}\\cos^2_{ia} = \\frac{t_{ia}^2} {\\sum_{a=1}^{A} t_{ia}^2 + \\text{SPE}_i^2}\end{split}\]where \(t_{ia}\) is the score and \(\\text{SPE}_i\) the residual (squared prediction error) of the observation. Across all components the cos2 values plus the residual fraction sum to 1. A value close to 1 means the observation is well represented on that component. For
PCA, whose loadings are orthonormal, the denominator equals the squared distance of the observation from the origin, matching the classical definition.cos2 complements the existing diagnostics: Hotelling’s T² measures distance within the model plane, SPE measures distance to it, and cos2 reports how much of an observation’s total variation a given component captures.
- Parameters:
- Returns:
cos2 values of shape (n_samples, n_components), indexed by sample.
- Return type:
pd.DataFrame
- Raises:
ValueError – If the model is not fitted, or if n_components is out of range.
Examples
>>> pca = PCA(n_components=3).fit(X_scaled) >>> pca.squared_cosine() # bound convenience method after fit() >>> squared_cosine(pca, n_components=2) # or call the function directly
- process_improve.multivariate.methods.observation_contributions(model, n_components=None)[source]#
Calculate the contribution of each observation to each component.
Works with fitted
PCAandPLSmodels. The contribution of observation \(i\) to component \(a\) is its squared score divided by the sum of squared scores of all observations on that component:\[\begin{split}\\text{contribution}_{ia} = \\frac{t_{ia}^2}{\\sum_{i=1}^{N} t_{ia}^2}\end{split}\]Values lie between 0 and 1 and each column sums to 1, so a contribution well above the average \(1/N\) flags an observation that strongly shapes that component.
Note that this is not the same diagnostic as the
score_contributionsmethod, despite the similar name.score_contributionsis per-variable and signed: it decomposes one observation’s position in score space back onto the original variables (“which variables explain why this observation sits where it does?”).observation_contributionsis per-observation and non-negative: it reports each observation’s share of a component’s total inertia (“which observations most strongly shape this component?”). The two are orthogonal views of the same score matrix and are not interchangeable.- Parameters:
- Returns:
Contributions of shape (n_samples, n_components), indexed by sample. Each column sums to 1.
- Return type:
pd.DataFrame
- Raises:
ValueError – If the model is not fitted, or if n_components is out of range.
Examples
>>> pca = PCA(n_components=3).fit(X_scaled) >>> pca.observation_contributions() >>> observation_contributions(pca, n_components=2)
See also
PCA.score_contributionsThe per-variable counterpart - decomposes one observation’s score-space position back onto the original variables.
- process_improve.multivariate.methods.eigenvalue_summary(model)[source]#
Summarize the variance captured by each component as a tidy table.
Works with fitted
PCAandPLSmodels. Returns one row per component, collectingexplained_variance_,r2_per_component_andr2_cumulative_into a single table.- Parameters:
- Returns:
Indexed by component, with columns
eigenvalue(the variance of the component scores),percent_varianceandcumulative_percent. For PCA the percentages refer to variance in X; for PLS they refer to the variance in Y explained by each component.- Return type:
pd.DataFrame
- Raises:
ValueError – If the model is not fitted.
Examples
>>> pca = PCA(n_components=3).fit(X_scaled) >>> pca.eigenvalue_summary() >>> eigenvalue_summary(pca)
- process_improve.multivariate.methods.project_variables(model, supplementary_data)[source]#
Project supplementary (passive) variables onto a fitted model.
Works with fitted
PCAandPLSmodels. Supplementary variables are extra columns that did not take part in fitting the model but were measured on the same observations. Each supplementary variable is represented by its correlation with each component’s scores, the standard representation for passive quantitative variables. This is the column-wise counterpart oftransform, which projects supplementary rows (new observations).- Parameters:
- Returns:
Correlations of shape (n_supplementary, n_components): the coordinate of each supplementary variable on each component.
- Return type:
pd.DataFrame
- Raises:
ValueError – If the model is not fitted, or if supplementary_data does not have the same number of rows as the training data.
Examples
>>> pca = PCA(n_components=3).fit(X_scaled) >>> pca.project_variables(passive_columns) >>> project_variables(pca, passive_columns)
Plots#
- process_improve.multivariate.plots.score_plot(model, pc_horiz=1, pc_vert=2, pc_depth=-1, items_to_highlight=None, settings=None, fig=None)[source]#
Generate a 2D or 3D score plot for the given latent variable model.
A 2D scatter on (
pc_horiz,pc_vert) is produced by default. Supplyingpc_depth >= 1adds a third score axis and switches the underlying trace toScatter3d.- Parameters:
model (MVmodel object (PCA, or PLS)) – A latent variable model generated by this library.
pc_horiz (int, optional) – Which component to plot on the horizontal axis, by default 1 (the first component)
pc_vert (int, optional) – Which component to plot on the vertical axis, by default 2 (the second component)
pc_depth (int, optional) – If pc_depth >= 1, then a 3D score plot is generated, with this component on the 3rd axis
items_to_highlight (dict, optional) –
Keys are JSON strings parseable by
json.loadsinto a Plotly line specifier; values are lists of index names to highlight. For example:items_to_highlight = {'{"color": "red", "symbol": "cross"}': items_in_red}
will highlight the items in
items_in_redwith the given colour and shape.settings (dict) –
Default settings:
{ "show_ellipse": True, # bool: show the Hotelling's T2 ellipse "ellipse_conf_level": 0.95, # float: ellipse confidence level (< 1.00) "title": "Score plot of ...", # str: overall plot title "show_labels": False, # bool: add a label for each observation "show_legend": True, # bool: show clickable legend "html_image_height": 500, # int: image height in pixels "html_aspect_ratio_w_over_h": 16/9, # float: width as ratio of height "template": "pi_journal", # str: registered Plotly theme name }
fig (Figure | None)
- Return type:
Figure
Examples
>>> pca = PCA(n_components=3).fit(X_scaled) >>> pca.score_plot() # PC1 vs PC2 >>> pca.score_plot(pc_horiz=1, pc_vert=3) # PC1 vs PC3 >>> pca.score_plot(pc_horiz=1, pc_vert=2, pc_depth=3) # 3D
- process_improve.multivariate.plots.loading_plot(model, loadings_type='p', pc_horiz=1, pc_vert=2, settings=None, fig=None)[source]#
Generate a 2-dimensional loadings for the given latent variable model.
- Parameters:
model (MVmodel object (PCA, or PLS)) – A latent variable model generated by this library.
loadings_type (str, optional) –
- A choice of the following:
’p’ : (default for PCA) : the P (projection) loadings: only option possible for PCA ‘w’ : the W loadings: Suitable for PLS ‘w*’ : (default for PLS) the W* (or R) loadings: Suitable for PLS ‘w*c’ : the W* (from X-space) with C loadings from the Y-space: Suitable for PLS ‘c’ : the C loadings from the Y-space: Suitable for PLS
For PCA model any other choice besides ‘p’ will be ignored.
pc_horiz (int, optional) – Which component to plot on the horizontal axis, by default 1 (the first component)
pc_vert (int, optional) – Which component to plot on the vertical axis, by default 2 (the second component)
settings (dict) –
Default settings:
{ "title": "Loadings plot ...", # str: overall plot title "show_labels": True, # bool: add a label for each variable "html_image_height": 500, # int: image height in pixels "html_aspect_ratio_w_over_h": 16/9, # float: width as ratio of height "template": "pi_journal", # str: registered Plotly theme name }
fig (Figure | None)
- Return type:
Figure
Examples
>>> pca.loading_plot() # P loadings, PC1 vs PC2 >>> pls.loading_plot(loadings_type="w*c") # W* and C loadings >>> pls.loading_plot(loadings_type="w", pc_vert=3) # W loadings, PC1 vs PC3
- process_improve.multivariate.plots.spe_plot(model, with_a=-1, items_to_highlight=None, settings=None, fig=None)[source]#
Generate a squared-prediction error (SPE) plot for the given latent variable model using with_a number of latent variables. The default will use the total number of latent variables which have already been fitted.
- Parameters:
model (MVmodel object (PCA, or PLS)) – A latent variable model generated by this library.
with_a (int, optional) – Uses this many number of latent variables, and therefore shows the SPE after this number of model components. By default the total number of components fitted will be used.
items_to_highlight (dict, optional) –
Keys are JSON strings parseable by
json.loadsinto a Plotly line specifier; values are lists of index names to highlight. For example:items_to_highlight = {'{"color": "red", "symbol": "cross"}': items_in_red}
will highlight the items in
items_in_redwith the given colour and shape.settings (dict) –
Default settings:
{ "show_limit": True, # bool: show the SPE confidence limit line "conf_level": 0.95, # float: confidence level for limit (< 1.00) "title": "SPE plot ...", # str: overall plot title "default_marker": {...}, # dict: e.g. dict(symbol="circle", size=7) "show_labels": False, # bool: add a label for each observation "show_legend": False, # bool: show clickable legend "html_image_height": 500, # int: image height in pixels "html_aspect_ratio_w_over_h": 16/9, # float: width as ratio of height "template": "pi_journal", # str: registered Plotly theme name }
fig (Figure | None)
- Return type:
Figure
Examples
>>> pca.spe_plot() >>> pca.spe_plot(settings={"conf_level": 0.99, "show_labels": True})
- process_improve.multivariate.plots.t2_plot(model, with_a=-1, items_to_highlight=None, settings=None, fig=None)[source]#
Generate a Hotelling’s T2 (T^2) plot for the given latent variable model using with_a number of latent variables. The default will use the total number of latent variables which have already been fitted.
- Parameters:
model (MVmodel object (PCA, or PLS)) – A latent variable model generated by this library.
with_a (int, optional) – Uses this many number of latent variables, and therefore shows the SPE after this number of model components. By default the total number of components fitted will be used.
items_to_highlight (dict, optional) –
Keys are JSON strings parseable by
json.loadsinto a Plotly line specifier; values are lists of index names to highlight. For example:items_to_highlight = {'{"color": "red", "symbol": "cross"}': items_in_red}
will highlight the items in
items_in_redwith the given colour and shape.settings (dict) –
Default settings:
{ "show_limit": True, # bool: show the T2 confidence limit line "conf_level": 0.95, # float: confidence level for limit (< 1.00) "title": "T2 plot ...", # str: overall plot title "default_marker": {...}, # dict: e.g. dict(symbol="circle", size=7) "show_labels": False, # bool: add a label for each observation "show_legend": False, # bool: show clickable legend "html_image_height": 500, # int: image height in pixels "html_aspect_ratio_w_over_h": 16/9, # float: width as ratio of height "template": "pi_journal", # str: registered Plotly theme name }
fig (Figure | None)
- Return type:
Figure
Examples
>>> pca.t2_plot() >>> pca.t2_plot(settings={"conf_level": 0.99, "show_labels": True})
- process_improve.multivariate.plots.explained_variance_plot(model, settings=None, fig=None)[source]#
Generate an explained-variance plot for a fitted latent variable model.
Shows the variance explained by each component as bars, with the cumulative variance explained overlaid as a line. For PCA the variance refers to the X-block; for PLS it refers to the Y-block.
- Parameters:
model (MVmodel object (PCA, or PLS)) – A fitted latent variable model generated by this library.
settings (dict) –
Default settings:
{ "as_percentage": True, # bool: y-axis as a percentage, else a fraction "title": "Variance explained ...", # str: overall plot title "bar_color": None, # str|None: bar colour; None uses the theme "line_color": None, # str|None: line colour; None uses the theme "show_legend": True, # bool: show clickable legend "html_image_height": 500, # int: image height in pixels "html_aspect_ratio_w_over_h": 16/9, # float: width as ratio of height "template": "pi_journal", # str: registered Plotly theme name }
fig (go.Figure, optional) – An existing figure to draw onto. A new figure is created if omitted.
- Return type:
Figure
Examples
>>> pca.explained_variance_plot() >>> pls.explained_variance_plot(settings={"as_percentage": False})
- process_improve.multivariate.plots.correlation_loadings_plot(model, pc_horiz=1, pc_vert=2, variance_ellipses=(0.5, 1.0), settings=None, fig=None)[source]#
Generate a correlation loadings plot for a fitted latent variable model.
Each variable is placed by its correlation with the scores of two components. A variable’s squared distance from the origin is the fraction of its variance explained by those two components, so every variable lies inside the unit circle. Concentric ellipses mark variance-explained thresholds: a variable beyond the 50% ellipse has at least half of its variance captured by the two components shown.
For PCA the X-variables are shown. For PLS both the X-variables and the Y-variables are overlaid against the X-scores, which reveals how process variables relate to quality variables.
- Parameters:
model (MVmodel object (PCA, or PLS)) – A fitted latent variable model generated by this library.
pc_horiz (int, default 1) – Component shown on the horizontal axis (1-based).
pc_vert (int, default 2) – Component shown on the vertical axis (1-based).
variance_ellipses (sequence of float, default (0.5, 1.0)) – Variance-explained thresholds, each a fraction in (0, 1], at which to draw a concentric ellipse. The conventional choice is the 50% and 100% ellipses; any other thresholds (for example 0.75 and 0.95) are equally valid.
settings (dict) –
Default settings:
{ "title": "Correlation loadings ...", # str: overall plot title "x_marker_color": None, # str|None: X-variable marker colour; None uses the theme "y_marker_color": None, # str|None: Y-variable marker colour (PLS); None uses the theme "ellipse_color": "grey", # str: colour of the variance ellipses "show_labels": True, # bool: label each variable "show_legend": True, # bool: show clickable legend (PLS only) "html_image_height": 600, # int: image height in pixels "html_aspect_ratio_w_over_h": 1.0, # float: width as ratio of height "template": "pi_journal", # str: registered Plotly theme name }
fig (go.Figure, optional) – An existing figure to draw onto. A new figure is created if omitted.
- Return type:
Figure
Examples
>>> pca.correlation_loadings_plot() >>> pls.correlation_loadings_plot(pc_horiz=1, pc_vert=3) >>> pca.correlation_loadings_plot(variance_ellipses=(0.75, 0.95))
- process_improve.multivariate.plots.predictions_vs_observed_plot(model, *, y_observed, variable=None, settings=None, fig=None)[source]#
Generate an observed-vs-predicted (parity) plot for a fitted PLS model.
Plots the calibration predictions against the observed Y values, with a
y = xreference line and an RMSE annotation. Points lying close to the reference line indicate good predictions.- Parameters:
model (PLS object) – A fitted PLS model generated by this library.
y_observed (array-like of shape (n_samples, n_targets)) – The observed Y values, on the same scale as the data used to fit the model (for example the scaled Y from
MCUVScaler).variable (str, optional) – Which Y-variable to plot. Defaults to the first Y-variable.
settings (dict) –
Default settings:
{ "title": "Observed vs predicted ...", # str: overall plot title "marker_color": None, # str|None: data-marker colour; None uses the theme "reference_color": "#9CA3AF", # str: colour of the y = x line "html_image_height": 500, # int: image height in pixels "html_aspect_ratio_w_over_h": 1.0, # float: width as ratio of height "template": "pi_journal", # str: registered Plotly theme name }
fig (go.Figure, optional) – An existing figure to draw onto. A new figure is created if omitted.
- Return type:
Figure
Examples
>>> pls.predictions_vs_observed_plot(y_observed=Y_scaled) >>> pls.predictions_vs_observed_plot(y_observed=Y_scaled, variable="quality")
- process_improve.multivariate.plots.coefficient_plot(model, variable=None, settings=None, fig=None)[source]#
Generate a bar plot of the PLS regression coefficients.
Shows
beta_coefficients_for one Y-variable: one bar per X-variable, mapping the (preprocessed) X onto the predicted Y. Tall bars mark the X-variables that most strongly drive the prediction.- Parameters:
model (PLS object) – A fitted PLS model generated by this library.
variable (str, optional) – Which Y-variable’s coefficients to plot. Defaults to the first one.
settings (dict) –
Default settings:
{ "title": "Regression coefficients ...", # str: overall plot title "bar_color": None, # str|None: bar colour; None uses the theme "html_image_height": 500, # int: image height in pixels "html_aspect_ratio_w_over_h": 16/9, # float: width as ratio of height "template": "pi_journal", # str: registered Plotly theme name }
fig (go.Figure, optional) – An existing figure to draw onto. A new figure is created if omitted.
- Return type:
Figure
Examples
>>> pls.coefficient_plot() >>> pls.coefficient_plot(variable="quality")