{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "96414d6f",
   "metadata": {},
   "source": [
    "# MBPLS with missing data: LDPE tubular reactor\n",
    "\n",
    "A small but realistic multi-block PLS case study on the LDPE tubular polymer reactor dataset shipped with this package. We use the **natural 4-block split** of the LDPE columns, fit MBPLS on the complete data, then inject MCAR noise into the X-blocks and Y to see how the mask-aware NIPALS path degrades. The whole workflow is one parameter change (`algorithm=\"auto\"`) different from the complete-data version.\n",
    "\n",
    "**Data.** `LDPE.csv` shipped under `process_improve/datasets/multivariate/LDPE/` (ported from ConnectMV, originally based on the tubular high-pressure polyethylene reactor described in MacGregor *et al.*).\n",
    "\n",
    "**Blocks.**\n",
    "\n",
    "| Block | Columns | Role |\n",
    "|---|---|---|\n",
    "| `zone1` | Tin, Tmax1, Tout1, Tcin1, z1, Fi1, Fs1 | reactor-zone-1 process variables |\n",
    "| `zone2` | Tmax2, Tout2, Tcin2, z2, Fi2, Fs2 | reactor-zone-2 process variables |\n",
    "| `pressure` | Press | common operating pressure |\n",
    "| Y | Conv, Mn, Mw, LCB, SCB | quality block (cumulative conversion, Mn, Mw, long- and short-chain branching) |\n",
    "\n",
    "**What we do.** Fit a complete-data MBPLS; read the per-block R²X, the super VIPs, and the cumulative R²Y. Then inject ~10% MCAR NaN into the multi-column X-blocks and into Y, refit, and compare predictions against the complete-data fit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "95c0a700",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-11T05:37:16.769593Z",
     "iopub.status.busy": "2026-05-11T05:37:16.769344Z",
     "iopub.status.idle": "2026-05-11T05:37:18.732338Z",
     "shell.execute_reply": "2026-05-11T05:37:18.730408Z"
    }
   },
   "outputs": [],
   "source": [
    "import pathlib\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import plotly.io as pio\n",
    "\n",
    "import process_improve\n",
    "from process_improve.multivariate.methods import MBPLS\n",
    "\n",
    "pio.renderers.default = \"notebook_connected\"\n",
    "rng = np.random.default_rng(42)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f00e0c70",
   "metadata": {},
   "source": [
    "## Load the dataset and build the 4 blocks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e4a0f48b",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-11T05:37:18.735445Z",
     "iopub.status.busy": "2026-05-11T05:37:18.735151Z",
     "iopub.status.idle": "2026-05-11T05:37:18.748227Z",
     "shell.execute_reply": "2026-05-11T05:37:18.747013Z"
    }
   },
   "outputs": [],
   "source": [
    "ldpe_csv = pathlib.Path(process_improve.__file__).parent / \"datasets\" / \"multivariate\" / \"LDPE\" / \"LDPE.csv\"\n",
    "values = pd.read_csv(ldpe_csv, index_col=0)\n",
    "\n",
    "x_blocks = {\n",
    "    \"zone1\":    values.iloc[:, [0, 1, 2, 5, 7, 9, 11]],   # Tin, Tmax1, Tout1, Tcin1, z1, Fi1, Fs1\n",
    "    \"zone2\":    values.iloc[:, [3, 4, 6, 8, 10, 12]],     # Tmax2, Tout2, Tcin2, z2, Fi2, Fs2\n",
    "    \"pressure\": values.iloc[:, [13]],                     # Press\n",
    "}\n",
    "y_df = values.iloc[:, 14:]                                # Conv, Mn, Mw, LCB, SCB\n",
    "\n",
    "print(f\"observations: {len(values)}\")\n",
    "for name, df in x_blocks.items():\n",
    "    print(f\"  {name:8s}: {df.shape[1]:>2d} vars  ->  {list(df.columns)}\")\n",
    "print(f\"  {'Y':8s}: {y_df.shape[1]:>2d} vars  ->  {list(y_df.columns)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "248e43c6",
   "metadata": {},
   "source": [
    "## Complete-data MBPLS baseline\n",
    "\n",
    "Three latent variables already account for almost all of the variance in the quality block. Block-weighting (`sqrt(K_b)` per block) makes the seven-column `zone1` block and the one-column `pressure` block contribute to the super-score on equal terms despite the size difference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e4b60d8f",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-11T05:37:18.750753Z",
     "iopub.status.busy": "2026-05-11T05:37:18.750530Z",
     "iopub.status.idle": "2026-05-11T05:37:18.812051Z",
     "shell.execute_reply": "2026-05-11T05:37:18.810715Z"
    }
   },
   "outputs": [],
   "source": [
    "complete = MBPLS(n_components=3).fit(x_blocks, y_df)\n",
    "print(f\"algorithm_       : {complete.algorithm_}\")\n",
    "print(f\"has_missing_data_: {complete.has_missing_data_}\")\n",
    "print(f\"R2Y cumulative   : {complete.r2_y_cumulative_.round(3).to_dict()}\")\n",
    "complete.r2_x_per_block_cumulative_.round(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e146af8",
   "metadata": {},
   "source": [
    "The `pressure` block sits on its own: a single variable, and almost all of its variance is captured by latent variables 2 and 3. The two zone blocks decompose more slowly because each carries seven and six variables of correlated reactor measurements.\n",
    "\n",
    "The super VIPs say which X-blocks pull the most weight when predicting Y."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "068e9376",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-11T05:37:18.814851Z",
     "iopub.status.busy": "2026-05-11T05:37:18.814604Z",
     "iopub.status.idle": "2026-05-11T05:37:18.823492Z",
     "shell.execute_reply": "2026-05-11T05:37:18.822127Z"
    }
   },
   "outputs": [],
   "source": [
    "complete.super_vip_.round(3).rename(\"super VIP\").to_frame()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a895ecc1",
   "metadata": {},
   "source": [
    "## Super-score plot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0fdadc4f",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-11T05:37:18.826205Z",
     "iopub.status.busy": "2026-05-11T05:37:18.825955Z",
     "iopub.status.idle": "2026-05-11T05:37:19.538885Z",
     "shell.execute_reply": "2026-05-11T05:37:19.537593Z"
    }
   },
   "outputs": [],
   "source": [
    "complete.super_score_plot(pc_horiz=1, pc_vert=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa2c6217",
   "metadata": {},
   "source": [
    "## Inject ~10% missing-completely-at-random NaN\n",
    "\n",
    "MCAR is the simplest missingness story: each cell is independently lost with a fixed probability, with no relationship to the value. Sensor dropouts and intermittent communication failures often look MCAR to first order. The mask-aware NIPALS solver computes each per-block score from only the observed entries of each row, so we do not have to impute or drop anything.\n",
    "\n",
    "We keep the one-column `pressure` block complete. With a single variable, any NaN there is a full-row NaN in the block, which is a degeneracy (the row carries no information about that block), and the solver rejects it explicitly rather than silently filling in zero.\n",
    "\n",
    "We also keep one observed entry per row in every multi-column block, so we never accidentally hit the row-all-NaN guard during this synthetic injection."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ed9d6e7b",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-11T05:37:19.541516Z",
     "iopub.status.busy": "2026-05-11T05:37:19.541284Z",
     "iopub.status.idle": "2026-05-11T05:37:19.556220Z",
     "shell.execute_reply": "2026-05-11T05:37:19.554784Z"
    }
   },
   "outputs": [],
   "source": [
    "def inject_mcar(df: pd.DataFrame, ratio: float, rng: np.random.Generator) -> pd.DataFrame:\n",
    "    mask = rng.random(df.shape) < ratio\n",
    "    for r in range(df.shape[0]):\n",
    "        if mask[r].all():\n",
    "            mask[r, 0] = False\n",
    "    return df.mask(pd.DataFrame(mask, index=df.index, columns=df.columns))\n",
    "\n",
    "\n",
    "x_nan = {\n",
    "    name: inject_mcar(df, 0.10, rng) if df.shape[1] > 1 else df\n",
    "    for name, df in x_blocks.items()\n",
    "}\n",
    "y_nan = inject_mcar(y_df, 0.10, rng)\n",
    "\n",
    "print(\"NaN injected (~10% MCAR):\")\n",
    "for name, df in x_nan.items():\n",
    "    print(f\"  {name:8s}: {int(df.isna().sum().sum()):>3d} / {df.size}\")\n",
    "print(f\"  {'Y':8s}: {int(y_nan.isna().sum().sum()):>3d} / {y_nan.size}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3e5a6053",
   "metadata": {},
   "source": [
    "## Fit MBPLS on the incomplete data\n",
    "\n",
    "Nothing changes in the call site: ``MBPLS(n_components=3, algorithm=\"auto\")`` is the default, and `auto` resolves to ``\"nipals\"`` as soon as it sees any NaN in any block.\n",
    "\n",
    "`fitting_info_[\"iterations\"]` records the inner-loop iteration count per component. Components that touch many missing cells take a few more iterations to converge; the default cap is `md_max_iter=1000` and the tolerance is `md_tol=sqrt(eps)`, both tunable via the ``missing_data_settings`` argument."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5d93ca9d",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-11T05:37:19.558706Z",
     "iopub.status.busy": "2026-05-11T05:37:19.558468Z",
     "iopub.status.idle": "2026-05-11T05:37:20.207088Z",
     "shell.execute_reply": "2026-05-11T05:37:20.205673Z"
    }
   },
   "outputs": [],
   "source": [
    "with_nan = MBPLS(n_components=3).fit(x_nan, y_nan)\n",
    "print(f\"algorithm_       : {with_nan.algorithm_}\")\n",
    "print(f\"has_missing_data_: {with_nan.has_missing_data_}\")\n",
    "print(f\"iterations / LV  : {with_nan.fitting_info_['iterations']}\")\n",
    "print(f\"R2Y cumulative   : {with_nan.r2_y_cumulative_.round(3).to_dict()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6439b75f",
   "metadata": {},
   "source": [
    "## How close are the predictions to the complete-data fit?\n",
    "\n",
    "The headline check for any missing-data method is: if you run it on the complete data, then knock 10% of the cells out and re-fit, do the predictions move? At ~10% MCAR on this dataset the per-target Pearson correlation between the two prediction series stays above 0.98 for every Y variable; the cumulative R²Y drops by a couple of percentage points."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e134005e",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-11T05:37:20.209708Z",
     "iopub.status.busy": "2026-05-11T05:37:20.209470Z",
     "iopub.status.idle": "2026-05-11T05:37:20.222117Z",
     "shell.execute_reply": "2026-05-11T05:37:20.220914Z"
    }
   },
   "outputs": [],
   "source": [
    "comparison = pd.DataFrame(\n",
    "    {\n",
    "        \"complete R2Y\": complete.r2_y_cumulative_,\n",
    "        \"with-NaN R2Y\": with_nan.r2_y_cumulative_,\n",
    "    }\n",
    ").round(3)\n",
    "\n",
    "pred_corr = pd.Series(\n",
    "    {\n",
    "        col: np.corrcoef(complete.predictions_[col], with_nan.predictions_[col])[0, 1]\n",
    "        for col in y_df.columns\n",
    "    },\n",
    "    name=\"prediction correlation\",\n",
    ").round(3)\n",
    "\n",
    "comparison, pred_corr"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0981c594",
   "metadata": {},
   "source": [
    "## Predicted-vs-observed under missing data\n",
    "\n",
    "`predictions_vs_observed_plot` draws the y = x diagonal and reports the in-sample RMSE so you can eyeball whether the fit is biased or just noisier under the injected missingness."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4a3969d8",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-11T05:37:20.225189Z",
     "iopub.status.busy": "2026-05-11T05:37:20.224869Z",
     "iopub.status.idle": "2026-05-11T05:37:20.429141Z",
     "shell.execute_reply": "2026-05-11T05:37:20.427892Z"
    }
   },
   "outputs": [],
   "source": [
    "with_nan.predictions_vs_observed_plot(y_df, variable=\"Conv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "efc528cb",
   "metadata": {},
   "source": [
    "## What to try next\n",
    "\n",
    "- Raise the injection ratio (15% / 25%) and watch the prediction correlation degrade gracefully rather than fail.\n",
    "- Switch to ``algorithm=\"dense\"`` explicitly and try to fit the incomplete data; the model raises a clear error rather than silently imputing zeros.\n",
    "- Tighten ``missing_data_settings={\"md_tol\": 1e-12, \"md_max_iter\": 5000}`` and read off the new iteration counts. On well-conditioned data the tighter tolerance is essentially free; on rank-deficient data it stalls and the iteration cap fires.\n",
    "- Try dropping the ``pressure`` block entirely. It is highly informative about *X-variance* (its single column tracks LV2 / LV3 almost perfectly) but its super VIP for Y is comparatively low; the predictive story is largely in the two zone blocks."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}