Multi-block PCA and PLS (MBPCA / MBPLS)#

Generic multi-block latent-variable models for the case where your X data is naturally organized into several semantically distinct blocks - for example, one block per processing zone, plant unit, or sensor group.

Multi-block models give you per-block diagnostics (which block is driving a fault? which block is most predictive of Y?) that get lost when every variable is dumped into a single big-X model.

When to Use#

You have multiple X-blocks (typically 2-10) sharing the same row observations but different column variables.
Each block has a clear semantic meaning (e.g. zone1, zone2, feed, utilities, quality_lab).
You want per-block bookkeeping: per-block R²X, per-block VIPs, per-block SPE / Hotelling’s T², per-block contribution plots.
For MBPLS, you also have a single Y-block and want to know how each X-block contributes to predicting Y.

If you only have one big X (no semantic block structure), use PCA or PLS instead. If your data follows the fixed D / F / Z / Y structure (database of properties, formulations, process conditions, quality), use TPLS.

How It Works#

The multi-block / hierarchical / consensus formulation of Westerhuis, Kourti & MacGregor (1998):

Each X-block is preprocessed independently (mean-centred and unit-variance scaled with MCUVScaler).
Each block is divided by sqrt(K_b) (where K_b is the number of variables in block b) so blocks of unequal width contribute fairly to the consensus super-score.
Hierarchical NIPALS alternates between (i) computing per-block scores and weights / loadings against the current super-score, and (ii) collecting block scores into a super-block and refining the super-score / super-loading.
After convergence, each block is deflated using the super-score and the corresponding block loading.

The block-weighting in step 2 is fundamental: without it, blocks with many variables would dominate the super-score simply by virtue of their size.

API at a glance#

Both classes use the same dict[str, pd.DataFrame] API for X-blocks:

x_blocks = {
    "zone1": df_with_zone1_columns,    # all blocks share the row index
    "zone2": df_with_zone2_columns,
}

from process_improve.multivariate.methods import MBPCA, MBPLS

pca = MBPCA(n_components=3).fit(x_blocks)
pls = MBPLS(n_components=3).fit(x_blocks, y_df)

After fitting, every model exposes:

super_scores_, super_loadings_ (or super_weights_ for MBPLS)
block_scores_, block_loadings_ - both dict[str, DataFrame]
r2_x_per_block_cumulative_, r2_x_per_block_per_component_
block_vip_ - per-block VIPs as dict[str, Series]
block_spe_, block_hotellings_t2_, super_hotellings_t2_
predict(X_new) - returns super-scores, block-scores, per-block SPE, super Hotelling’s T² (and Y predictions for MBPLS)
spe_contributions(X) - per-variable squared residuals for fault diagnosis
block_spe_limit(name, conf_level), super_spe_limit(conf_level), hotellings_t2_limit(conf_level)

Worked example: LDPE tubular reactor (MBPLS)#

The LDPE dataset shipped with this package is a tubular polymer reactor with two zones; the Y-block is five quality variables. Splitting the X-block by reactor zone is a natural multi-block setup.

import pathlib
import pandas as pd
from process_improve.multivariate.methods import MBPLS, randomization_test_mbpls

folder = pathlib.Path("process_improve/datasets/multivariate/LDPE")
values = pd.read_csv(folder / "LDPE.csv", index_col=0)

# Reactor-zone split (1-based MATLAB indexes -> 0-based Python)
zone_1_idx = [0, 1, 2, 5, 7, 9, 11, 13]
zone_2_idx = [3, 4, 6, 8, 10, 12]
x_blocks = {
    "zone1": values.iloc[:, zone_1_idx],
    "zone2": values.iloc[:, zone_2_idx],
}
y_df = values.iloc[:, 14:]

model = MBPLS(n_components=3).fit(x_blocks, y_df)
print(model.display_results())

# How predictive is each component? Lower risk_pct = more significant.
sig = randomization_test_mbpls(model, x_blocks, y_df, n_permutations=200, seed=0)
print(sig)

# Top 5 variables contributing to a high-SPE observation
contribs = model.spe_contributions(x_blocks)
for name, df in contribs.items():
    worst_row = df.sum(axis=1).idxmax()
    print(name, df.loc[worst_row].nlargest(5))

# Quick visual: super-score plot, RMSEE plot
model.super_score_plot(pc_horiz=1, pc_vert=2).show()
model.predictions_vs_observed_plot(y_df, variable=str(y_df.columns[0])).show()

References#

Westerhuis, J. A., Kourti, T. & MacGregor, J. F. Analysis of multiblock and hierarchical PCA and PLS models. Journal of Chemometrics, 12 (1998), 301-321.
Westerhuis, J. A. & Smilde, A. K. Deflation in multiblock PLS. Journal of Chemometrics, 15 (2001), 485-493.
Wiklund, S., Nilsson, D., Eriksson, L., Sjöström, M., Wold, S. & Faber, K. A randomization test for PLS component selection. Journal of Chemometrics, 21 (2007), 427-439.