Multi-block PCA and PLS (MBPCA / MBPLS)#
Generic multi-block latent-variable models for the case where your X data is naturally organized into several semantically distinct blocks - for example, one block per processing zone, plant unit, or sensor group.
Multi-block models give you per-block diagnostics (which block is driving a fault? which block is most predictive of Y?) that get lost when every variable is dumped into a single big-X model.
When to Use#
You have multiple X-blocks (typically 2-10) sharing the same row observations but different column variables.
Each block has a clear semantic meaning (e.g.
zone1,zone2,feed,utilities,quality_lab).You want per-block bookkeeping: per-block R²X, per-block VIPs, per-block SPE / Hotelling’s T², per-block contribution plots.
For MBPLS, you also have a single Y-block and want to know how each X-block contributes to predicting Y.
If you only have one big X (no semantic block structure), use PCA or PLS instead. If your data follows the fixed D / F / Z / Y structure (database of properties, formulations, process conditions, quality), use TPLS.
How It Works#
The multi-block / hierarchical / consensus formulation of Westerhuis, Kourti & MacGregor (1998):
Each X-block is preprocessed independently (mean-centred and unit-variance scaled with
MCUVScaler).Each block is divided by
sqrt(K_b)(whereK_bis the number of variables in blockb) so blocks of unequal width contribute fairly to the consensus super-score.Hierarchical NIPALS alternates between (i) computing per-block scores and weights / loadings against the current super-score, and (ii) collecting block scores into a super-block and refining the super-score / super-loading.
After convergence, each block is deflated using the super-score and the corresponding block loading.
The block-weighting in step 2 is fundamental: without it, blocks with many variables would dominate the super-score simply by virtue of their size.
API at a glance#
Both classes use the same dict[str, pd.DataFrame] API for X-blocks:
x_blocks = {
"zone1": df_with_zone1_columns, # all blocks share the row index
"zone2": df_with_zone2_columns,
}
from process_improve.multivariate.methods import MBPCA, MBPLS
pca = MBPCA(n_components=3).fit(x_blocks)
pls = MBPLS(n_components=3).fit(x_blocks, y_df)
After fitting, every model exposes:
super_scores_,super_loadings_(orsuper_weights_for MBPLS)block_scores_,block_loadings_- bothdict[str, DataFrame]r2_x_per_block_cumulative_,r2_x_per_block_per_component_block_vip_- per-block VIPs asdict[str, Series]block_spe_,block_hotellings_t2_,super_hotellings_t2_predict(X_new)- returns super-scores, block-scores, per-block SPE, super Hotelling’s T² (and Y predictions for MBPLS)spe_contributions(X)- per-variable squared residuals for fault diagnosisblock_spe_limit(name, conf_level),super_spe_limit(conf_level),hotellings_t2_limit(conf_level)
Worked example: LDPE tubular reactor (MBPLS)#
The LDPE dataset shipped with this package is a tubular polymer reactor with two zones; the Y-block is five quality variables. Splitting the X-block by reactor zone is a natural multi-block setup.
import pathlib
import pandas as pd
from process_improve.multivariate.methods import MBPLS, randomization_test_mbpls
folder = pathlib.Path("process_improve/datasets/multivariate/LDPE")
values = pd.read_csv(folder / "LDPE.csv", index_col=0)
# Reactor-zone split (1-based MATLAB indexes -> 0-based Python)
zone_1_idx = [0, 1, 2, 5, 7, 9, 11, 13]
zone_2_idx = [3, 4, 6, 8, 10, 12]
x_blocks = {
"zone1": values.iloc[:, zone_1_idx],
"zone2": values.iloc[:, zone_2_idx],
}
y_df = values.iloc[:, 14:]
model = MBPLS(n_components=3).fit(x_blocks, y_df)
print(model.display_results())
# How predictive is each component? Lower risk_pct = more significant.
sig = randomization_test_mbpls(model, x_blocks, y_df, n_permutations=200, seed=0)
print(sig)
# Top 5 variables contributing to a high-SPE observation
contribs = model.spe_contributions(x_blocks)
for name, df in contribs.items():
worst_row = df.sum(axis=1).idxmax()
print(name, df.loc[worst_row].nlargest(5))
# Quick visual: super-score plot, RMSEE plot
model.super_score_plot(pc_horiz=1, pc_vert=2).show()
model.predictions_vs_observed_plot(y_df, variable=str(y_df.columns[0])).show()
References#
Westerhuis, J. A., Kourti, T. & MacGregor, J. F. Analysis of multiblock and hierarchical PCA and PLS models. Journal of Chemometrics, 12 (1998), 301-321.
Westerhuis, J. A. & Smilde, A. K. Deflation in multiblock PLS. Journal of Chemometrics, 15 (2001), 485-493.
Wiklund, S., Nilsson, D., Eriksson, L., Sjöström, M., Wold, S. & Faber, K. A randomization test for PLS component selection. Journal of Chemometrics, 21 (2007), 427-439.