{ "cells": [ { "cell_type": "markdown", "id": "c86ab28b", "metadata": {}, "source": [ "# MBPCA with missing data: LDPE tubular reactor\n", "\n", "Companion to the MBPLS LDPE case study. Same dataset, same natural multi-block split, but cast as an unsupervised MBPCA problem: there is no Y target, so the quality block joins the process blocks as a fourth X-block and the model is asked to find a consensus super-score across all four. The mask-aware NIPALS path is enabled the same way as in MBPLS: leave ``algorithm=\"auto\"`` (the default) and any NaN flips it to the iterative solver.\n", "\n", "**Data.** `LDPE.csv` shipped under `process_improve/datasets/multivariate/LDPE/`.\n", "\n", "**Blocks (all treated as X for MBPCA).**\n", "\n", "| Block | Columns |\n", "|---|---|\n", "| `zone1` | Tin, Tmax1, Tout1, Tcin1, z1, Fi1, Fs1 |\n", "| `zone2` | Tmax2, Tout2, Tcin2, z2, Fi2, Fs2 |\n", "| `pressure` | Press |\n", "| `quality` | Conv, Mn, Mw, LCB, SCB |\n", "\n", "**What we do.** Fit a complete-data MBPCA, read the per-block R²X and the super-score; inject ~10% MCAR NaN into the multi-column blocks; refit and verify the super-score and the per-block R²X are essentially unchanged." ] }, { "cell_type": "code", "execution_count": null, "id": "6895c217", "metadata": {}, "outputs": [], "source": [ "import pathlib\n", "\n", "import numpy as np\n", "import pandas as pd\n", "import plotly.io as pio\n", "\n", "import process_improve\n", "from process_improve.multivariate.methods import MBPCA\n", "\n", "pio.renderers.default = \"notebook_connected\"\n", "rng = np.random.default_rng(7)" ] }, { "cell_type": "markdown", "id": "623f0882", "metadata": {}, "source": [ "## Load the dataset and build the 4 X-blocks" ] }, { "cell_type": "code", "execution_count": null, "id": "a935eacb", "metadata": {}, "outputs": [], "source": [ "ldpe_csv = pathlib.Path(process_improve.__file__).parent / \"datasets\" / \"multivariate\" / \"LDPE\" / \"LDPE.csv\"\n", "values = pd.read_csv(ldpe_csv, index_col=0)\n", "\n", "x_blocks = {\n", " \"zone1\": values.iloc[:, [0, 1, 2, 5, 7, 9, 11]],\n", " \"zone2\": values.iloc[:, [3, 4, 6, 8, 10, 12]],\n", " \"pressure\": values.iloc[:, [13]],\n", " \"quality\": values.iloc[:, 14:],\n", "}\n", "print(f\"observations: {len(values)}\")\n", "for name, df in x_blocks.items():\n", " print(f\" {name:8s}: {df.shape[1]:>2d} vars -> {list(df.columns)}\")" ] }, { "cell_type": "markdown", "id": "161e04d3", "metadata": {}, "source": [ "## Complete-data MBPCA baseline\n", "\n", "Three latent variables already explain most of the per-block variance. The single-column `pressure` block is dominated by the second and third super-components (its R²X reaches ~0.99 by LV2). The quality block tracks LV1 strongly (R²X ~0.55 from the first component alone), which is exactly what an MBPLS analysis treating `quality` as Y would also conclude." ] }, { "cell_type": "code", "execution_count": null, "id": "e9c6a523", "metadata": {}, "outputs": [], "source": [ "complete = MBPCA(n_components=3).fit(x_blocks)\n", "print(f\"algorithm_ : {complete.algorithm_}\")\n", "print(f\"has_missing_data_: {complete.has_missing_data_}\")\n", "complete.r2_x_per_block_cumulative_.round(3)" ] }, { "cell_type": "markdown", "id": "fa091e79", "metadata": {}, "source": [ "## Super-score plot\n", "\n", "The unsupervised super-score is a consensus across all four blocks. Two clusters in the LV1-LV2 plane usually correspond to distinct operating regimes; outliers along LV3 are samples where one block disagrees with the others about which mode the reactor is in." ] }, { "cell_type": "code", "execution_count": null, "id": "c917a2a8", "metadata": {}, "outputs": [], "source": [ "complete.super_score_plot(pc_horiz=1, pc_vert=2)" ] }, { "cell_type": "markdown", "id": "8e6e71db", "metadata": {}, "source": [ "## Inject ~10% missing-completely-at-random NaN\n", "\n", "Same convention as the MBPLS notebook: MCAR in the multi-column blocks only, with at least one observed value per row, leaving the one-column `pressure` block complete (univariate row-NaN trips the degeneracy guard and is not a meaningful skip-NaN test)." ] }, { "cell_type": "code", "execution_count": null, "id": "bb359253", "metadata": {}, "outputs": [], "source": [ "def inject_mcar(df: pd.DataFrame, ratio: float, rng: np.random.Generator) -> pd.DataFrame:\n", " mask = rng.random(df.shape) < ratio\n", " for r in range(df.shape[0]):\n", " if mask[r].all():\n", " mask[r, 0] = False\n", " return df.mask(pd.DataFrame(mask, index=df.index, columns=df.columns))\n", "\n", "\n", "x_nan = {\n", " name: inject_mcar(df, 0.10, rng) if df.shape[1] > 1 else df\n", " for name, df in x_blocks.items()\n", "}\n", "\n", "print(\"NaN injected (~10% MCAR):\")\n", "for name, df in x_nan.items():\n", " print(f\" {name:8s}: {int(df.isna().sum().sum()):>3d} / {df.size}\")" ] }, { "cell_type": "markdown", "id": "5cbf96ea", "metadata": {}, "source": [ "## Fit MBPCA on the incomplete data\n", "\n", "`algorithm=\"auto\"` is the default; the solver detects the NaN in `x_nan` and resolves to ``\"nipals\"``. `fitting_info_[\"iterations\"]` is generally larger than in the analogous MBPLS fit because MBPCA's outer loop has no Y target to anchor the consensus and so converges the per-block scores more carefully." ] }, { "cell_type": "code", "execution_count": null, "id": "1ac07a6b", "metadata": {}, "outputs": [], "source": [ "with_nan = MBPCA(n_components=3).fit(x_nan)\n", "print(f\"algorithm_ : {with_nan.algorithm_}\")\n", "print(f\"has_missing_data_: {with_nan.has_missing_data_}\")\n", "print(f\"iterations / LV : {with_nan.fitting_info_['iterations']}\")\n", "with_nan.r2_x_per_block_cumulative_.round(3)" ] }, { "cell_type": "markdown", "id": "7974312a", "metadata": {}, "source": [ "## Does the super-score still tell the same story?\n", "\n", "The headline check for unsupervised multi-block missing-data: with the same components and the same data minus a fraction of the cells, do the super-score directions still match? We sign-align each component and compute per-component Pearson correlation between the complete-data and the with-NaN super-scores. On this dataset at ~10% MCAR every component stays above 0.99." ] }, { "cell_type": "code", "execution_count": null, "id": "469ac980", "metadata": {}, "outputs": [], "source": [ "ts_complete = complete.super_scores_.to_numpy()\n", "ts_with_nan = with_nan.super_scores_.to_numpy().copy()\n", "\n", "# Sign-align each component to the complete-data fit before comparing.\n", "for j in range(ts_with_nan.shape[1]):\n", " if ts_complete[:, j] @ ts_with_nan[:, j] < 0:\n", " ts_with_nan[:, j] = -ts_with_nan[:, j]\n", "\n", "per_lv_corr = pd.Series(\n", " [np.corrcoef(ts_complete[:, j], ts_with_nan[:, j])[0, 1] for j in range(ts_with_nan.shape[1])],\n", " index=[f\"LV{j + 1}\" for j in range(ts_with_nan.shape[1])],\n", " name=\"super-score correlation (NaN vs complete)\",\n", ")\n", "per_lv_corr.round(3)" ] }, { "cell_type": "markdown", "id": "19830a12", "metadata": {}, "source": [ "## Per-block R²X under missing data\n", "\n", "The same R²X table as the baseline, side by side, makes the small per-block residual change visible. Differences are at most a couple of percent across all four blocks at 10% MCAR." ] }, { "cell_type": "code", "execution_count": null, "id": "e82171ea", "metadata": {}, "outputs": [], "source": [ "r2_complete = complete.r2_x_per_block_cumulative_.iloc[:, -1].rename(\"complete\")\n", "r2_with_nan = with_nan.r2_x_per_block_cumulative_.iloc[:, -1].rename(\"with NaN\")\n", "pd.concat([r2_complete, r2_with_nan, (r2_complete - r2_with_nan).rename(\"delta\")], axis=1).round(3)" ] }, { "cell_type": "markdown", "id": "980508e3", "metadata": {}, "source": [ "## What to try next\n", "\n", "- Raise the NaN ratio (15% / 25%) and watch the per-LV super-score correlation degrade gracefully rather than collapse.\n", "- Inspect the super-score plot under missing data and verify the cluster structure is preserved.\n", "- Re-cast the same dataset as MBPLS with `quality` as Y (see the [MBPLS LDPE missing-data notebook](mbpls-ldpe-missing-data.ipynb)) and compare which super-score directions the two formulations agree on." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 5 }