MBPCA with missing data: LDPE tubular reactor#
Companion to the MBPLS LDPE case study. Same dataset, same natural multi-block split, but cast as an unsupervised MBPCA problem: there is no Y target, so the quality block joins the process blocks as a fourth X-block and the model is asked to find a consensus super-score across all four. The mask-aware NIPALS path is enabled the same way as in MBPLS: leave algorithm="auto" (the default) and any NaN flips it to the iterative solver.
Data. LDPE.csv shipped under process_improve/datasets/multivariate/LDPE/.
Blocks (all treated as X for MBPCA).
Block |
Columns |
|---|---|
|
Tin, Tmax1, Tout1, Tcin1, z1, Fi1, Fs1 |
|
Tmax2, Tout2, Tcin2, z2, Fi2, Fs2 |
|
Press |
|
Conv, Mn, Mw, LCB, SCB |
What we do. Fit a complete-data MBPCA, read the per-block R²X and the super-score; inject ~10% MCAR NaN into the multi-column blocks; refit and verify the super-score and the per-block R²X are essentially unchanged.
[1]:
import pathlib
import numpy as np
import pandas as pd
import plotly.io as pio
import process_improve
from process_improve.multivariate.methods import MBPCA
pio.renderers.default = "notebook_connected"
rng = np.random.default_rng(7)
Load the dataset and build the 4 X-blocks#
[2]:
ldpe_csv = pathlib.Path(process_improve.__file__).parent / "datasets" / "multivariate" / "LDPE" / "LDPE.csv"
values = pd.read_csv(ldpe_csv, index_col=0)
x_blocks = {
"zone1": values.iloc[:, [0, 1, 2, 5, 7, 9, 11]],
"zone2": values.iloc[:, [3, 4, 6, 8, 10, 12]],
"pressure": values.iloc[:, [13]],
"quality": values.iloc[:, 14:],
}
print(f"observations: {len(values)}")
for name, df in x_blocks.items():
print(f" {name:8s}: {df.shape[1]:>2d} vars -> {list(df.columns)}")
observations: 54
zone1 : 7 vars -> ['Tin', 'Tmax1', 'Tout1', 'Tcin1', 'z1', 'Fi1', 'Fs1']
zone2 : 6 vars -> ['Tmax2', 'Tout2', 'Tcin2', 'z2', 'Fi2', 'Fs2']
pressure: 1 vars -> ['Press']
quality : 5 vars -> ['Conv', 'Mn', 'Mw', 'LCB', 'SCB']
Complete-data MBPCA baseline#
Three latent variables already explain most of the per-block variance. The single-column pressure block is dominated by the second and third super-components (its R²X reaches ~0.99 by LV2). The quality block tracks LV1 strongly (R²X ~0.55 from the first component alone), which is exactly what an MBPLS analysis treating quality as Y would also conclude.
[3]:
complete = MBPCA(n_components=3).fit(x_blocks)
print(f"algorithm_ : {complete.algorithm_}")
print(f"has_missing_data_: {complete.has_missing_data_}")
complete.r2_x_per_block_cumulative_.round(3)
algorithm_ : dense
has_missing_data_: False
[3]:
| 1 | 2 | 3 | |
|---|---|---|---|
| zone1 | 0.136 | 0.175 | 0.457 |
| zone2 | 0.285 | 0.405 | 0.487 |
| pressure | 0.489 | 0.978 | 0.994 |
| quality | 0.547 | 0.776 | 0.928 |
Super-score plot#
The unsupervised super-score is a consensus across all four blocks. Two clusters in the LV1-LV2 plane usually correspond to distinct operating regimes; outliers along LV3 are samples where one block disagrees with the others about which mode the reactor is in.
[4]:
complete.super_score_plot(pc_horiz=1, pc_vert=2)
Inject ~10% missing-completely-at-random NaN#
Same convention as the MBPLS notebook: MCAR in the multi-column blocks only, with at least one observed value per row, leaving the one-column pressure block complete (univariate row-NaN trips the degeneracy guard and is not a meaningful skip-NaN test).
[5]:
def inject_mcar(df: pd.DataFrame, ratio: float, rng: np.random.Generator) -> pd.DataFrame:
mask = rng.random(df.shape) < ratio
for r in range(df.shape[0]):
if mask[r].all():
mask[r, 0] = False
return df.mask(pd.DataFrame(mask, index=df.index, columns=df.columns))
x_nan = {
name: inject_mcar(df, 0.10, rng) if df.shape[1] > 1 else df
for name, df in x_blocks.items()
}
print("NaN injected (~10% MCAR):")
for name, df in x_nan.items():
print(f" {name:8s}: {int(df.isna().sum().sum()):>3d} / {df.size}")
NaN injected (~10% MCAR):
zone1 : 36 / 378
zone2 : 31 / 324
pressure: 0 / 54
quality : 33 / 270
Fit MBPCA on the incomplete data#
algorithm="auto" is the default; the solver detects the NaN in x_nan and resolves to "nipals". fitting_info_["iterations"] is generally larger than in the analogous MBPLS fit because MBPCA’s outer loop has no Y target to anchor the consensus and so converges the per-block scores more carefully.
[6]:
with_nan = MBPCA(n_components=3).fit(x_nan)
print(f"algorithm_ : {with_nan.algorithm_}")
print(f"has_missing_data_: {with_nan.has_missing_data_}")
print(f"iterations / LV : {with_nan.fitting_info_['iterations']}")
with_nan.r2_x_per_block_cumulative_.round(3)
algorithm_ : nipals
has_missing_data_: True
iterations / LV : [66 73 64]
[6]:
| 1 | 2 | 3 | |
|---|---|---|---|
| zone1 | 0.130 | 0.168 | 0.474 |
| zone2 | 0.305 | 0.416 | 0.492 |
| pressure | 0.447 | 0.980 | 0.993 |
| quality | 0.580 | 0.782 | 0.923 |
Does the super-score still tell the same story?#
The headline check for unsupervised multi-block missing-data: with the same components and the same data minus a fraction of the cells, do the super-score directions still match? We sign-align each component and compute per-component Pearson correlation between the complete-data and the with-NaN super-scores. On this dataset at ~10% MCAR every component stays above 0.99.
[7]:
ts_complete = complete.super_scores_.to_numpy()
ts_with_nan = with_nan.super_scores_.to_numpy().copy()
# Sign-align each component to the complete-data fit before comparing.
for j in range(ts_with_nan.shape[1]):
if ts_complete[:, j] @ ts_with_nan[:, j] < 0:
ts_with_nan[:, j] = -ts_with_nan[:, j]
per_lv_corr = pd.Series(
[np.corrcoef(ts_complete[:, j], ts_with_nan[:, j])[0, 1] for j in range(ts_with_nan.shape[1])],
index=[f"LV{j + 1}" for j in range(ts_with_nan.shape[1])],
name="super-score correlation (NaN vs complete)",
)
per_lv_corr.round(3)
[7]:
LV1 0.997
LV2 0.996
LV3 0.992
Name: super-score correlation (NaN vs complete), dtype: float64
Per-block R²X under missing data#
The same R²X table as the baseline, side by side, makes the small per-block residual change visible. Differences are at most a couple of percent across all four blocks at 10% MCAR.
[8]:
r2_complete = complete.r2_x_per_block_cumulative_.iloc[:, -1].rename("complete")
r2_with_nan = with_nan.r2_x_per_block_cumulative_.iloc[:, -1].rename("with NaN")
pd.concat([r2_complete, r2_with_nan, (r2_complete - r2_with_nan).rename("delta")], axis=1).round(3)
[8]:
| complete | with NaN | delta | |
|---|---|---|---|
| zone1 | 0.457 | 0.474 | -0.017 |
| zone2 | 0.487 | 0.492 | -0.005 |
| pressure | 0.994 | 0.993 | 0.001 |
| quality | 0.928 | 0.923 | 0.005 |
What to try next#
Raise the NaN ratio (15% / 25%) and watch the per-LV super-score correlation degrade gracefully rather than collapse.
Inspect the super-score plot under missing data and verify the cluster structure is preserved.
Re-cast the same dataset as MBPLS with
qualityas Y (see the MBPLS LDPE missing-data notebook) and compare which super-score directions the two formulations agree on.