MBPCA with missing data: LDPE tubular reactor#

Companion to the MBPLS LDPE case study. Same dataset, same natural multi-block split, but cast as an unsupervised MBPCA problem: there is no Y target, so the quality block joins the process blocks as a fourth X-block and the model is asked to find a consensus super-score across all four. The mask-aware NIPALS path is enabled the same way as in MBPLS: leave algorithm="auto" (the default) and any NaN flips it to the iterative solver.

Data. LDPE.csv shipped under process_improve/datasets/multivariate/LDPE/.

Blocks (all treated as X for MBPCA).

Block	Columns
`zone1`	Tin, Tmax1, Tout1, Tcin1, z1, Fi1, Fs1
`zone2`	Tmax2, Tout2, Tcin2, z2, Fi2, Fs2
`pressure`	Press
`quality`	Conv, Mn, Mw, LCB, SCB

What we do. Fit a complete-data MBPCA, read the per-block R²X and the super-score; inject ~10% MCAR NaN into the multi-column blocks; refit and verify the super-score and the per-block R²X are essentially unchanged.

[1]:

import pathlib

import numpy as np
import pandas as pd
import plotly.io as pio

import process_improve
from process_improve.multivariate.methods import MBPCA

pio.renderers.default = "notebook_connected"
rng = np.random.default_rng(7)

Load the dataset and build the 4 X-blocks#

[2]:

ldpe_csv = pathlib.Path(process_improve.__file__).parent / "datasets" / "multivariate" / "LDPE" / "LDPE.csv"
values = pd.read_csv(ldpe_csv, index_col=0)

x_blocks = {
    "zone1":    values.iloc[:, [0, 1, 2, 5, 7, 9, 11]],
    "zone2":    values.iloc[:, [3, 4, 6, 8, 10, 12]],
    "pressure": values.iloc[:, [13]],
    "quality":  values.iloc[:, 14:],
}
print(f"observations: {len(values)}")
for name, df in x_blocks.items():
    print(f"  {name:8s}: {df.shape[1]:>2d} vars  ->  {list(df.columns)}")

observations: 54
  zone1   :  7 vars  ->  ['Tin', 'Tmax1', 'Tout1', 'Tcin1', 'z1', 'Fi1', 'Fs1']
  zone2   :  6 vars  ->  ['Tmax2', 'Tout2', 'Tcin2', 'z2', 'Fi2', 'Fs2']
  pressure:  1 vars  ->  ['Press']
  quality :  5 vars  ->  ['Conv', 'Mn', 'Mw', 'LCB', 'SCB']

Complete-data MBPCA baseline#

Three latent variables already explain most of the per-block variance. The single-column pressure block is dominated by the second and third super-components (its R²X reaches ~0.99 by LV2). The quality block tracks LV1 strongly (R²X ~0.55 from the first component alone), which is exactly what an MBPLS analysis treating quality as Y would also conclude.

[3]:

complete = MBPCA(n_components=3).fit(x_blocks)
print(f"algorithm_       : {complete.algorithm_}")
print(f"has_missing_data_: {complete.has_missing_data_}")
complete.r2_x_per_block_cumulative_.round(3)

algorithm_       : dense
has_missing_data_: False

[3]:

	1	2	3
zone1	0.136	0.175	0.457
zone2	0.285	0.405	0.487
pressure	0.489	0.978	0.994
quality	0.547	0.776	0.928

Super-score plot#

The unsupervised super-score is a consensus across all four blocks. Two clusters in the LV1-LV2 plane usually correspond to distinct operating regimes; outliers along LV3 are samples where one block disagrees with the others about which mode the reactor is in.

[4]:

complete.super_score_plot(pc_horiz=1, pc_vert=2)

Inject ~10% missing-completely-at-random NaN#

Same convention as the MBPLS notebook: MCAR in the multi-column blocks only, with at least one observed value per row, leaving the one-column pressure block complete (univariate row-NaN trips the degeneracy guard and is not a meaningful skip-NaN test).

[5]:

def inject_mcar(df: pd.DataFrame, ratio: float, rng: np.random.Generator) -> pd.DataFrame:
    mask = rng.random(df.shape) < ratio
    for r in range(df.shape[0]):
        if mask[r].all():
            mask[r, 0] = False
    return df.mask(pd.DataFrame(mask, index=df.index, columns=df.columns))


x_nan = {
    name: inject_mcar(df, 0.10, rng) if df.shape[1] > 1 else df
    for name, df in x_blocks.items()
}

print("NaN injected (~10% MCAR):")
for name, df in x_nan.items():
    print(f"  {name:8s}: {int(df.isna().sum().sum()):>3d} / {df.size}")

NaN injected (~10% MCAR):
  zone1   :  36 / 378
  zone2   :  31 / 324
  pressure:   0 / 54
  quality :  33 / 270

Fit MBPCA on the incomplete data#

algorithm="auto" is the default; the solver detects the NaN in x_nan and resolves to "nipals". fitting_info_["iterations"] is generally larger than in the analogous MBPLS fit because MBPCA’s outer loop has no Y target to anchor the consensus and so converges the per-block scores more carefully.

[6]:

with_nan = MBPCA(n_components=3).fit(x_nan)
print(f"algorithm_       : {with_nan.algorithm_}")
print(f"has_missing_data_: {with_nan.has_missing_data_}")
print(f"iterations / LV  : {with_nan.fitting_info_['iterations']}")
with_nan.r2_x_per_block_cumulative_.round(3)

algorithm_       : nipals
has_missing_data_: True
iterations / LV  : [66 73 64]

[6]:

	1	2	3
zone1	0.130	0.168	0.474
zone2	0.305	0.416	0.492
pressure	0.447	0.980	0.993
quality	0.580	0.782	0.923

Does the super-score still tell the same story?#

The headline check for unsupervised multi-block missing-data: with the same components and the same data minus a fraction of the cells, do the super-score directions still match? We sign-align each component and compute per-component Pearson correlation between the complete-data and the with-NaN super-scores. On this dataset at ~10% MCAR every component stays above 0.99.

[7]:

ts_complete = complete.super_scores_.to_numpy()
ts_with_nan = with_nan.super_scores_.to_numpy().copy()

# Sign-align each component to the complete-data fit before comparing.
for j in range(ts_with_nan.shape[1]):
    if ts_complete[:, j] @ ts_with_nan[:, j] < 0:
        ts_with_nan[:, j] = -ts_with_nan[:, j]

per_lv_corr = pd.Series(
    [np.corrcoef(ts_complete[:, j], ts_with_nan[:, j])[0, 1] for j in range(ts_with_nan.shape[1])],
    index=[f"LV{j + 1}" for j in range(ts_with_nan.shape[1])],
    name="super-score correlation (NaN vs complete)",
)
per_lv_corr.round(3)

[7]:

LV1    0.997
LV2    0.996
LV3    0.992
Name: super-score correlation (NaN vs complete), dtype: float64

Per-block R²X under missing data#

The same R²X table as the baseline, side by side, makes the small per-block residual change visible. Differences are at most a couple of percent across all four blocks at 10% MCAR.

[8]:

r2_complete = complete.r2_x_per_block_cumulative_.iloc[:, -1].rename("complete")
r2_with_nan = with_nan.r2_x_per_block_cumulative_.iloc[:, -1].rename("with NaN")
pd.concat([r2_complete, r2_with_nan, (r2_complete - r2_with_nan).rename("delta")], axis=1).round(3)

[8]:

	complete	with NaN	delta
zone1	0.457	0.474	-0.017
zone2	0.487	0.492	-0.005
pressure	0.994	0.993	0.001
quality	0.928	0.923	0.005

What to try next#

Raise the NaN ratio (15% / 25%) and watch the per-LV super-score correlation degrade gracefully rather than collapse.
Inspect the super-score plot under missing data and verify the cluster structure is preserved.
Re-cast the same dataset as MBPLS with quality as Y (see the MBPLS LDPE missing-data notebook) and compare which super-score directions the two formulations agree on.