MBPLS with missing data: LDPE tubular reactor#
A small but realistic multi-block PLS case study on the LDPE tubular polymer reactor dataset shipped with this package. We use the natural 4-block split of the LDPE columns, fit MBPLS on the complete data, then inject MCAR noise into the X-blocks and Y to see how the mask-aware NIPALS path degrades. The whole workflow is one parameter change (algorithm="auto") different from the complete-data version.
Data. LDPE.csv shipped under process_improve/datasets/multivariate/LDPE/ (ported from ConnectMV, originally based on the tubular high-pressure polyethylene reactor described in MacGregor et al.).
Blocks.
Block |
Columns |
Role |
|---|---|---|
|
Tin, Tmax1, Tout1, Tcin1, z1, Fi1, Fs1 |
reactor-zone-1 process variables |
|
Tmax2, Tout2, Tcin2, z2, Fi2, Fs2 |
reactor-zone-2 process variables |
|
Press |
common operating pressure |
Y |
Conv, Mn, Mw, LCB, SCB |
quality block (cumulative conversion, Mn, Mw, long- and short-chain branching) |
What we do. Fit a complete-data MBPLS; read the per-block R²X, the super VIPs, and the cumulative R²Y. Then inject ~10% MCAR NaN into the multi-column X-blocks and into Y, refit, and compare predictions against the complete-data fit.
[1]:
import pathlib
import numpy as np
import pandas as pd
import plotly.io as pio
import process_improve
from process_improve.multivariate.methods import MBPLS
pio.renderers.default = "notebook_connected"
rng = np.random.default_rng(42)
Load the dataset and build the 4 blocks#
[2]:
ldpe_csv = pathlib.Path(process_improve.__file__).parent / "datasets" / "multivariate" / "LDPE" / "LDPE.csv"
values = pd.read_csv(ldpe_csv, index_col=0)
x_blocks = {
"zone1": values.iloc[:, [0, 1, 2, 5, 7, 9, 11]], # Tin, Tmax1, Tout1, Tcin1, z1, Fi1, Fs1
"zone2": values.iloc[:, [3, 4, 6, 8, 10, 12]], # Tmax2, Tout2, Tcin2, z2, Fi2, Fs2
"pressure": values.iloc[:, [13]], # Press
}
y_df = values.iloc[:, 14:] # Conv, Mn, Mw, LCB, SCB
print(f"observations: {len(values)}")
for name, df in x_blocks.items():
print(f" {name:8s}: {df.shape[1]:>2d} vars -> {list(df.columns)}")
print(f" {'Y':8s}: {y_df.shape[1]:>2d} vars -> {list(y_df.columns)}")
observations: 54
zone1 : 7 vars -> ['Tin', 'Tmax1', 'Tout1', 'Tcin1', 'z1', 'Fi1', 'Fs1']
zone2 : 6 vars -> ['Tmax2', 'Tout2', 'Tcin2', 'z2', 'Fi2', 'Fs2']
pressure: 1 vars -> ['Press']
Y : 5 vars -> ['Conv', 'Mn', 'Mw', 'LCB', 'SCB']
Complete-data MBPLS baseline#
Three latent variables already account for almost all of the variance in the quality block. Block-weighting (sqrt(K_b) per block) makes the seven-column zone1 block and the one-column pressure block contribute to the super-score on equal terms despite the size difference.
[3]:
complete = MBPLS(n_components=3).fit(x_blocks, y_df)
print(f"algorithm_ : {complete.algorithm_}")
print(f"has_missing_data_: {complete.has_missing_data_}")
print(f"R2Y cumulative : {complete.r2_y_cumulative_.round(3).to_dict()}")
complete.r2_x_per_block_cumulative_.round(3)
algorithm_ : dense
has_missing_data_: False
R2Y cumulative : {1: 0.573, 2: 0.771, 3: 0.907}
[3]:
| 1 | 2 | 3 | |
|---|---|---|---|
| zone1 | 0.160 | 0.185 | 0.448 |
| zone2 | 0.301 | 0.423 | 0.493 |
| pressure | 0.408 | 0.932 | 0.995 |
The pressure block sits on its own: a single variable, and almost all of its variance is captured by latent variables 2 and 3. The two zone blocks decompose more slowly because each carries seven and six variables of correlated reactor measurements.
The super VIPs say which X-blocks pull the most weight when predicting Y.
[4]:
complete.super_vip_.round(3).rename("super VIP").to_frame()
[4]:
| super VIP | |
|---|---|
| zone1 | 0.953 |
| zone2 | 1.158 |
| pressure | 0.867 |
Super-score plot#
[5]:
complete.super_score_plot(pc_horiz=1, pc_vert=2)
Inject ~10% missing-completely-at-random NaN#
MCAR is the simplest missingness story: each cell is independently lost with a fixed probability, with no relationship to the value. Sensor dropouts and intermittent communication failures often look MCAR to first order. The mask-aware NIPALS solver computes each per-block score from only the observed entries of each row, so we do not have to impute or drop anything.
We keep the one-column pressure block complete. With a single variable, any NaN there is a full-row NaN in the block, which is a degeneracy (the row carries no information about that block), and the solver rejects it explicitly rather than silently filling in zero.
We also keep one observed entry per row in every multi-column block, so we never accidentally hit the row-all-NaN guard during this synthetic injection.
[6]:
def inject_mcar(df: pd.DataFrame, ratio: float, rng: np.random.Generator) -> pd.DataFrame:
mask = rng.random(df.shape) < ratio
for r in range(df.shape[0]):
if mask[r].all():
mask[r, 0] = False
return df.mask(pd.DataFrame(mask, index=df.index, columns=df.columns))
x_nan = {
name: inject_mcar(df, 0.10, rng) if df.shape[1] > 1 else df
for name, df in x_blocks.items()
}
y_nan = inject_mcar(y_df, 0.10, rng)
print("NaN injected (~10% MCAR):")
for name, df in x_nan.items():
print(f" {name:8s}: {int(df.isna().sum().sum()):>3d} / {df.size}")
print(f" {'Y':8s}: {int(y_nan.isna().sum().sum()):>3d} / {y_nan.size}")
NaN injected (~10% MCAR):
zone1 : 33 / 378
zone2 : 32 / 324
pressure: 0 / 54
Y : 28 / 270
Fit MBPLS on the incomplete data#
Nothing changes in the call site: MBPLS(n_components=3, algorithm="auto") is the default, and auto resolves to "nipals" as soon as it sees any NaN in any block.
fitting_info_["iterations"] records the inner-loop iteration count per component. Components that touch many missing cells take a few more iterations to converge; the default cap is md_max_iter=1000 and the tolerance is md_tol=sqrt(eps), both tunable via the missing_data_settings argument.
[7]:
with_nan = MBPLS(n_components=3).fit(x_nan, y_nan)
print(f"algorithm_ : {with_nan.algorithm_}")
print(f"has_missing_data_: {with_nan.has_missing_data_}")
print(f"iterations / LV : {with_nan.fitting_info_['iterations']}")
print(f"R2Y cumulative : {with_nan.r2_y_cumulative_.round(3).to_dict()}")
algorithm_ : nipals
has_missing_data_: True
iterations / LV : [19 41 15]
R2Y cumulative : {1: 0.569, 2: 0.757, 3: 0.885}
How close are the predictions to the complete-data fit?#
The headline check for any missing-data method is: if you run it on the complete data, then knock 10% of the cells out and re-fit, do the predictions move? At ~10% MCAR on this dataset the per-target Pearson correlation between the two prediction series stays above 0.98 for every Y variable; the cumulative R²Y drops by a couple of percentage points.
[8]:
comparison = pd.DataFrame(
{
"complete R2Y": complete.r2_y_cumulative_,
"with-NaN R2Y": with_nan.r2_y_cumulative_,
}
).round(3)
pred_corr = pd.Series(
{
col: np.corrcoef(complete.predictions_[col], with_nan.predictions_[col])[0, 1]
for col in y_df.columns
},
name="prediction correlation",
).round(3)
comparison, pred_corr
[8]:
( complete R2Y with-NaN R2Y
1 0.573 0.569
2 0.771 0.757
3 0.907 0.885,
Conv 0.994
Mn 0.991
Mw 0.984
LCB 0.992
SCB 0.994
Name: prediction correlation, dtype: float64)
Predicted-vs-observed under missing data#
predictions_vs_observed_plot draws the y = x diagonal and reports the in-sample RMSE so you can eyeball whether the fit is biased or just noisier under the injected missingness.
[9]:
with_nan.predictions_vs_observed_plot(y_df, variable="Conv")
What to try next#
Raise the injection ratio (15% / 25%) and watch the prediction correlation degrade gracefully rather than fail.
Switch to
algorithm="dense"explicitly and try to fit the incomplete data; the model raises a clear error rather than silently imputing zeros.Tighten
missing_data_settings={"md_tol": 1e-12, "md_max_iter": 5000}and read off the new iteration counts. On well-conditioned data the tighter tolerance is essentially free; on rank-deficient data it stalls and the iteration cap fires.Try dropping the
pressureblock entirely. It is highly informative about X-variance (its single column tracks LV2 / LV3 almost perfectly) but its super VIP for Y is comparatively low; the predictive story is largely in the two zone blocks.