MBPCA with missing data: LDPE tubular reactor#

Companion to the MBPLS LDPE case study. Same dataset, same natural multi-block split, but cast as an unsupervised MBPCA problem: there is no Y target, so the quality block joins the process blocks as a fourth X-block and the model is asked to find a consensus super-score across all four. The mask-aware NIPALS path is enabled the same way as in MBPLS: leave algorithm="auto" (the default) and any NaN flips it to the iterative solver.

Data. LDPE.csv shipped under process_improve/datasets/multivariate/LDPE/.

Blocks (all treated as X for MBPCA).

Block

Columns

zone1

Tin, Tmax1, Tout1, Tcin1, z1, Fi1, Fs1

zone2

Tmax2, Tout2, Tcin2, z2, Fi2, Fs2

pressure

Press

quality

Conv, Mn, Mw, LCB, SCB

What we do. Fit a complete-data MBPCA, read the per-block R²X and the super-score; inject ~10% MCAR NaN into the multi-column blocks; refit and verify the super-score and the per-block R²X are essentially unchanged.

[1]:
import pathlib

import numpy as np
import pandas as pd
import plotly.io as pio

import process_improve
from process_improve.multivariate.methods import MBPCA

pio.renderers.default = "notebook_connected"
rng = np.random.default_rng(7)

Load the dataset and build the 4 X-blocks#

[2]:
ldpe_csv = pathlib.Path(process_improve.__file__).parent / "datasets" / "multivariate" / "LDPE" / "LDPE.csv"
values = pd.read_csv(ldpe_csv, index_col=0)

x_blocks = {
    "zone1":    values.iloc[:, [0, 1, 2, 5, 7, 9, 11]],
    "zone2":    values.iloc[:, [3, 4, 6, 8, 10, 12]],
    "pressure": values.iloc[:, [13]],
    "quality":  values.iloc[:, 14:],
}
print(f"observations: {len(values)}")
for name, df in x_blocks.items():
    print(f"  {name:8s}: {df.shape[1]:>2d} vars  ->  {list(df.columns)}")
observations: 54
  zone1   :  7 vars  ->  ['Tin', 'Tmax1', 'Tout1', 'Tcin1', 'z1', 'Fi1', 'Fs1']
  zone2   :  6 vars  ->  ['Tmax2', 'Tout2', 'Tcin2', 'z2', 'Fi2', 'Fs2']
  pressure:  1 vars  ->  ['Press']
  quality :  5 vars  ->  ['Conv', 'Mn', 'Mw', 'LCB', 'SCB']

Complete-data MBPCA baseline#

Three latent variables already explain most of the per-block variance. The single-column pressure block is dominated by the second and third super-components (its R²X reaches ~0.99 by LV2). The quality block tracks LV1 strongly (R²X ~0.55 from the first component alone), which is exactly what an MBPLS analysis treating quality as Y would also conclude.

[3]:
complete = MBPCA(n_components=3).fit(x_blocks)
print(f"algorithm_       : {complete.algorithm_}")
print(f"has_missing_data_: {complete.has_missing_data_}")
complete.r2_x_per_block_cumulative_.round(3)
algorithm_       : dense
has_missing_data_: False
[3]:
1 2 3
zone1 0.136 0.175 0.457
zone2 0.285 0.405 0.487
pressure 0.489 0.978 0.994
quality 0.547 0.776 0.928

Super-score plot#

The unsupervised super-score is a consensus across all four blocks. Two clusters in the LV1-LV2 plane usually correspond to distinct operating regimes; outliers along LV3 are samples where one block disagrees with the others about which mode the reactor is in.

[4]:
complete.super_score_plot(pc_horiz=1, pc_vert=2)

Inject ~10% missing-completely-at-random NaN#

Same convention as the MBPLS notebook: MCAR in the multi-column blocks only, with at least one observed value per row, leaving the one-column pressure block complete (univariate row-NaN trips the degeneracy guard and is not a meaningful skip-NaN test).

[5]:
def inject_mcar(df: pd.DataFrame, ratio: float, rng: np.random.Generator) -> pd.DataFrame:
    mask = rng.random(df.shape) < ratio
    for r in range(df.shape[0]):
        if mask[r].all():
            mask[r, 0] = False
    return df.mask(pd.DataFrame(mask, index=df.index, columns=df.columns))


x_nan = {
    name: inject_mcar(df, 0.10, rng) if df.shape[1] > 1 else df
    for name, df in x_blocks.items()
}

print("NaN injected (~10% MCAR):")
for name, df in x_nan.items():
    print(f"  {name:8s}: {int(df.isna().sum().sum()):>3d} / {df.size}")
NaN injected (~10% MCAR):
  zone1   :  36 / 378
  zone2   :  31 / 324
  pressure:   0 / 54
  quality :  33 / 270

Fit MBPCA on the incomplete data#

algorithm="auto" is the default; the solver detects the NaN in x_nan and resolves to "nipals". fitting_info_["iterations"] is generally larger than in the analogous MBPLS fit because MBPCA’s outer loop has no Y target to anchor the consensus and so converges the per-block scores more carefully.

[6]:
with_nan = MBPCA(n_components=3).fit(x_nan)
print(f"algorithm_       : {with_nan.algorithm_}")
print(f"has_missing_data_: {with_nan.has_missing_data_}")
print(f"iterations / LV  : {with_nan.fitting_info_['iterations']}")
with_nan.r2_x_per_block_cumulative_.round(3)
algorithm_       : nipals
has_missing_data_: True
iterations / LV  : [66 73 64]
[6]:
1 2 3
zone1 0.130 0.168 0.474
zone2 0.305 0.416 0.492
pressure 0.447 0.980 0.993
quality 0.580 0.782 0.923

Does the super-score still tell the same story?#

The headline check for unsupervised multi-block missing-data: with the same components and the same data minus a fraction of the cells, do the super-score directions still match? We sign-align each component and compute per-component Pearson correlation between the complete-data and the with-NaN super-scores. On this dataset at ~10% MCAR every component stays above 0.99.

[7]:
ts_complete = complete.super_scores_.to_numpy()
ts_with_nan = with_nan.super_scores_.to_numpy().copy()

# Sign-align each component to the complete-data fit before comparing.
for j in range(ts_with_nan.shape[1]):
    if ts_complete[:, j] @ ts_with_nan[:, j] < 0:
        ts_with_nan[:, j] = -ts_with_nan[:, j]

per_lv_corr = pd.Series(
    [np.corrcoef(ts_complete[:, j], ts_with_nan[:, j])[0, 1] for j in range(ts_with_nan.shape[1])],
    index=[f"LV{j + 1}" for j in range(ts_with_nan.shape[1])],
    name="super-score correlation (NaN vs complete)",
)
per_lv_corr.round(3)
[7]:
LV1    0.997
LV2    0.996
LV3    0.992
Name: super-score correlation (NaN vs complete), dtype: float64

Per-block R²X under missing data#

The same R²X table as the baseline, side by side, makes the small per-block residual change visible. Differences are at most a couple of percent across all four blocks at 10% MCAR.

[8]:
r2_complete = complete.r2_x_per_block_cumulative_.iloc[:, -1].rename("complete")
r2_with_nan = with_nan.r2_x_per_block_cumulative_.iloc[:, -1].rename("with NaN")
pd.concat([r2_complete, r2_with_nan, (r2_complete - r2_with_nan).rename("delta")], axis=1).round(3)
[8]:
complete with NaN delta
zone1 0.457 0.474 -0.017
zone2 0.487 0.492 -0.005
pressure 0.994 0.993 0.001
quality 0.928 0.923 0.005

What to try next#

  • Raise the NaN ratio (15% / 25%) and watch the per-LV super-score correlation degrade gracefully rather than collapse.

  • Inspect the super-score plot under missing data and verify the cluster structure is preserved.

  • Re-cast the same dataset as MBPLS with quality as Y (see the MBPLS LDPE missing-data notebook) and compare which super-score directions the two formulations agree on.