MBPLS with missing data: LDPE tubular reactor#

A small but realistic multi-block PLS case study on the LDPE tubular polymer reactor dataset shipped with this package. We use the natural 4-block split of the LDPE columns, fit MBPLS on the complete data, then inject MCAR noise into the X-blocks and Y to see how the mask-aware NIPALS path degrades. The whole workflow is one parameter change (algorithm="auto") different from the complete-data version.

Data. LDPE.csv shipped under process_improve/datasets/multivariate/LDPE/ (ported from ConnectMV, originally based on the tubular high-pressure polyethylene reactor described in MacGregor et al.).

Blocks.

Block

Columns

Role

zone1

Tin, Tmax1, Tout1, Tcin1, z1, Fi1, Fs1

reactor-zone-1 process variables

zone2

Tmax2, Tout2, Tcin2, z2, Fi2, Fs2

reactor-zone-2 process variables

pressure

Press

common operating pressure

Y

Conv, Mn, Mw, LCB, SCB

quality block (cumulative conversion, Mn, Mw, long- and short-chain branching)

What we do. Fit a complete-data MBPLS; read the per-block R²X, the super VIPs, and the cumulative R²Y. Then inject ~10% MCAR NaN into the multi-column X-blocks and into Y, refit, and compare predictions against the complete-data fit.

[1]:
import pathlib

import numpy as np
import pandas as pd
import plotly.io as pio

import process_improve
from process_improve.multivariate.methods import MBPLS

pio.renderers.default = "notebook_connected"
rng = np.random.default_rng(42)

Load the dataset and build the 4 blocks#

[2]:
ldpe_csv = pathlib.Path(process_improve.__file__).parent / "datasets" / "multivariate" / "LDPE" / "LDPE.csv"
values = pd.read_csv(ldpe_csv, index_col=0)

x_blocks = {
    "zone1":    values.iloc[:, [0, 1, 2, 5, 7, 9, 11]],   # Tin, Tmax1, Tout1, Tcin1, z1, Fi1, Fs1
    "zone2":    values.iloc[:, [3, 4, 6, 8, 10, 12]],     # Tmax2, Tout2, Tcin2, z2, Fi2, Fs2
    "pressure": values.iloc[:, [13]],                     # Press
}
y_df = values.iloc[:, 14:]                                # Conv, Mn, Mw, LCB, SCB

print(f"observations: {len(values)}")
for name, df in x_blocks.items():
    print(f"  {name:8s}: {df.shape[1]:>2d} vars  ->  {list(df.columns)}")
print(f"  {'Y':8s}: {y_df.shape[1]:>2d} vars  ->  {list(y_df.columns)}")
observations: 54
  zone1   :  7 vars  ->  ['Tin', 'Tmax1', 'Tout1', 'Tcin1', 'z1', 'Fi1', 'Fs1']
  zone2   :  6 vars  ->  ['Tmax2', 'Tout2', 'Tcin2', 'z2', 'Fi2', 'Fs2']
  pressure:  1 vars  ->  ['Press']
  Y       :  5 vars  ->  ['Conv', 'Mn', 'Mw', 'LCB', 'SCB']

Complete-data MBPLS baseline#

Three latent variables already account for almost all of the variance in the quality block. Block-weighting (sqrt(K_b) per block) makes the seven-column zone1 block and the one-column pressure block contribute to the super-score on equal terms despite the size difference.

[3]:
complete = MBPLS(n_components=3).fit(x_blocks, y_df)
print(f"algorithm_       : {complete.algorithm_}")
print(f"has_missing_data_: {complete.has_missing_data_}")
print(f"R2Y cumulative   : {complete.r2_y_cumulative_.round(3).to_dict()}")
complete.r2_x_per_block_cumulative_.round(3)
algorithm_       : dense
has_missing_data_: False
R2Y cumulative   : {1: 0.573, 2: 0.771, 3: 0.907}
[3]:
1 2 3
zone1 0.160 0.185 0.448
zone2 0.301 0.423 0.493
pressure 0.408 0.932 0.995

The pressure block sits on its own: a single variable, and almost all of its variance is captured by latent variables 2 and 3. The two zone blocks decompose more slowly because each carries seven and six variables of correlated reactor measurements.

The super VIPs say which X-blocks pull the most weight when predicting Y.

[4]:
complete.super_vip_.round(3).rename("super VIP").to_frame()
[4]:
super VIP
zone1 0.953
zone2 1.158
pressure 0.867

Super-score plot#

[5]:
complete.super_score_plot(pc_horiz=1, pc_vert=2)

Inject ~10% missing-completely-at-random NaN#

MCAR is the simplest missingness story: each cell is independently lost with a fixed probability, with no relationship to the value. Sensor dropouts and intermittent communication failures often look MCAR to first order. The mask-aware NIPALS solver computes each per-block score from only the observed entries of each row, so we do not have to impute or drop anything.

We keep the one-column pressure block complete. With a single variable, any NaN there is a full-row NaN in the block, which is a degeneracy (the row carries no information about that block), and the solver rejects it explicitly rather than silently filling in zero.

We also keep one observed entry per row in every multi-column block, so we never accidentally hit the row-all-NaN guard during this synthetic injection.

[6]:
def inject_mcar(df: pd.DataFrame, ratio: float, rng: np.random.Generator) -> pd.DataFrame:
    mask = rng.random(df.shape) < ratio
    for r in range(df.shape[0]):
        if mask[r].all():
            mask[r, 0] = False
    return df.mask(pd.DataFrame(mask, index=df.index, columns=df.columns))


x_nan = {
    name: inject_mcar(df, 0.10, rng) if df.shape[1] > 1 else df
    for name, df in x_blocks.items()
}
y_nan = inject_mcar(y_df, 0.10, rng)

print("NaN injected (~10% MCAR):")
for name, df in x_nan.items():
    print(f"  {name:8s}: {int(df.isna().sum().sum()):>3d} / {df.size}")
print(f"  {'Y':8s}: {int(y_nan.isna().sum().sum()):>3d} / {y_nan.size}")
NaN injected (~10% MCAR):
  zone1   :  33 / 378
  zone2   :  32 / 324
  pressure:   0 / 54
  Y       :  28 / 270

Fit MBPLS on the incomplete data#

Nothing changes in the call site: MBPLS(n_components=3, algorithm="auto") is the default, and auto resolves to "nipals" as soon as it sees any NaN in any block.

fitting_info_["iterations"] records the inner-loop iteration count per component. Components that touch many missing cells take a few more iterations to converge; the default cap is md_max_iter=1000 and the tolerance is md_tol=sqrt(eps), both tunable via the missing_data_settings argument.

[7]:
with_nan = MBPLS(n_components=3).fit(x_nan, y_nan)
print(f"algorithm_       : {with_nan.algorithm_}")
print(f"has_missing_data_: {with_nan.has_missing_data_}")
print(f"iterations / LV  : {with_nan.fitting_info_['iterations']}")
print(f"R2Y cumulative   : {with_nan.r2_y_cumulative_.round(3).to_dict()}")
algorithm_       : nipals
has_missing_data_: True
iterations / LV  : [19 41 15]
R2Y cumulative   : {1: 0.569, 2: 0.757, 3: 0.885}

How close are the predictions to the complete-data fit?#

The headline check for any missing-data method is: if you run it on the complete data, then knock 10% of the cells out and re-fit, do the predictions move? At ~10% MCAR on this dataset the per-target Pearson correlation between the two prediction series stays above 0.98 for every Y variable; the cumulative R²Y drops by a couple of percentage points.

[8]:
comparison = pd.DataFrame(
    {
        "complete R2Y": complete.r2_y_cumulative_,
        "with-NaN R2Y": with_nan.r2_y_cumulative_,
    }
).round(3)

pred_corr = pd.Series(
    {
        col: np.corrcoef(complete.predictions_[col], with_nan.predictions_[col])[0, 1]
        for col in y_df.columns
    },
    name="prediction correlation",
).round(3)

comparison, pred_corr
[8]:
(   complete R2Y  with-NaN R2Y
 1         0.573         0.569
 2         0.771         0.757
 3         0.907         0.885,
 Conv    0.994
 Mn      0.991
 Mw      0.984
 LCB     0.992
 SCB     0.994
 Name: prediction correlation, dtype: float64)

Predicted-vs-observed under missing data#

predictions_vs_observed_plot draws the y = x diagonal and reports the in-sample RMSE so you can eyeball whether the fit is biased or just noisier under the injected missingness.

[9]:
with_nan.predictions_vs_observed_plot(y_df, variable="Conv")

What to try next#

  • Raise the injection ratio (15% / 25%) and watch the prediction correlation degrade gracefully rather than fail.

  • Switch to algorithm="dense" explicitly and try to fit the incomplete data; the model raises a clear error rather than silently imputing zeros.

  • Tighten missing_data_settings={"md_tol": 1e-12, "md_max_iter": 5000} and read off the new iteration counts. On well-conditioned data the tighter tolerance is essentially free; on rank-deficient data it stalls and the iteration cap fires.

  • Try dropping the pressure block entirely. It is highly informative about X-variance (its single column tracks LV2 / LV3 almost perfectly) but its super VIP for Y is comparatively low; the predictive story is largely in the two zone blocks.