{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c86ab28b",
   "metadata": {},
   "source": [
    "# MBPCA with missing data: LDPE tubular reactor\n",
    "\n",
    "Companion to the MBPLS LDPE case study. Same dataset, same natural multi-block split, but cast as an unsupervised MBPCA problem: there is no Y target, so the quality block joins the process blocks as a fourth X-block and the model is asked to find a consensus super-score across all four. The mask-aware NIPALS path is enabled the same way as in MBPLS: leave ``algorithm=\"auto\"`` (the default) and any NaN flips it to the iterative solver.\n",
    "\n",
    "**Data.** `LDPE.csv` shipped under `process_improve/datasets/multivariate/LDPE/`.\n",
    "\n",
    "**Blocks (all treated as X for MBPCA).**\n",
    "\n",
    "| Block | Columns |\n",
    "|---|---|\n",
    "| `zone1` | Tin, Tmax1, Tout1, Tcin1, z1, Fi1, Fs1 |\n",
    "| `zone2` | Tmax2, Tout2, Tcin2, z2, Fi2, Fs2 |\n",
    "| `pressure` | Press |\n",
    "| `quality` | Conv, Mn, Mw, LCB, SCB |\n",
    "\n",
    "**What we do.** Fit a complete-data MBPCA, read the per-block R²X and the super-score; inject ~10% MCAR NaN into the multi-column blocks; refit and verify the super-score and the per-block R²X are essentially unchanged."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6895c217",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pathlib\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import plotly.io as pio\n",
    "\n",
    "import process_improve\n",
    "from process_improve.multivariate.methods import MBPCA\n",
    "\n",
    "pio.renderers.default = \"notebook_connected\"\n",
    "rng = np.random.default_rng(7)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "623f0882",
   "metadata": {},
   "source": [
    "## Load the dataset and build the 4 X-blocks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a935eacb",
   "metadata": {},
   "outputs": [],
   "source": [
    "ldpe_csv = pathlib.Path(process_improve.__file__).parent / \"datasets\" / \"multivariate\" / \"LDPE\" / \"LDPE.csv\"\n",
    "values = pd.read_csv(ldpe_csv, index_col=0)\n",
    "\n",
    "x_blocks = {\n",
    "    \"zone1\":    values.iloc[:, [0, 1, 2, 5, 7, 9, 11]],\n",
    "    \"zone2\":    values.iloc[:, [3, 4, 6, 8, 10, 12]],\n",
    "    \"pressure\": values.iloc[:, [13]],\n",
    "    \"quality\":  values.iloc[:, 14:],\n",
    "}\n",
    "print(f\"observations: {len(values)}\")\n",
    "for name, df in x_blocks.items():\n",
    "    print(f\"  {name:8s}: {df.shape[1]:>2d} vars  ->  {list(df.columns)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "161e04d3",
   "metadata": {},
   "source": [
    "## Complete-data MBPCA baseline\n",
    "\n",
    "Three latent variables already explain most of the per-block variance. The single-column `pressure` block is dominated by the second and third super-components (its R²X reaches ~0.99 by LV2). The quality block tracks LV1 strongly (R²X ~0.55 from the first component alone), which is exactly what an MBPLS analysis treating `quality` as Y would also conclude."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e9c6a523",
   "metadata": {},
   "outputs": [],
   "source": [
    "complete = MBPCA(n_components=3).fit(x_blocks)\n",
    "print(f\"algorithm_       : {complete.algorithm_}\")\n",
    "print(f\"has_missing_data_: {complete.has_missing_data_}\")\n",
    "complete.r2_x_per_block_cumulative_.round(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fa091e79",
   "metadata": {},
   "source": [
    "## Super-score plot\n",
    "\n",
    "The unsupervised super-score is a consensus across all four blocks. Two clusters in the LV1-LV2 plane usually correspond to distinct operating regimes; outliers along LV3 are samples where one block disagrees with the others about which mode the reactor is in."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c917a2a8",
   "metadata": {},
   "outputs": [],
   "source": [
    "complete.super_score_plot(pc_horiz=1, pc_vert=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8e6e71db",
   "metadata": {},
   "source": [
    "## Inject ~10% missing-completely-at-random NaN\n",
    "\n",
    "Same convention as the MBPLS notebook: MCAR in the multi-column blocks only, with at least one observed value per row, leaving the one-column `pressure` block complete (univariate row-NaN trips the degeneracy guard and is not a meaningful skip-NaN test)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bb359253",
   "metadata": {},
   "outputs": [],
   "source": [
    "def inject_mcar(df: pd.DataFrame, ratio: float, rng: np.random.Generator) -> pd.DataFrame:\n",
    "    mask = rng.random(df.shape) < ratio\n",
    "    for r in range(df.shape[0]):\n",
    "        if mask[r].all():\n",
    "            mask[r, 0] = False\n",
    "    return df.mask(pd.DataFrame(mask, index=df.index, columns=df.columns))\n",
    "\n",
    "\n",
    "x_nan = {\n",
    "    name: inject_mcar(df, 0.10, rng) if df.shape[1] > 1 else df\n",
    "    for name, df in x_blocks.items()\n",
    "}\n",
    "\n",
    "print(\"NaN injected (~10% MCAR):\")\n",
    "for name, df in x_nan.items():\n",
    "    print(f\"  {name:8s}: {int(df.isna().sum().sum()):>3d} / {df.size}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5cbf96ea",
   "metadata": {},
   "source": [
    "## Fit MBPCA on the incomplete data\n",
    "\n",
    "`algorithm=\"auto\"` is the default; the solver detects the NaN in `x_nan` and resolves to ``\"nipals\"``. `fitting_info_[\"iterations\"]` is generally larger than in the analogous MBPLS fit because MBPCA's outer loop has no Y target to anchor the consensus and so converges the per-block scores more carefully."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1ac07a6b",
   "metadata": {},
   "outputs": [],
   "source": [
    "with_nan = MBPCA(n_components=3).fit(x_nan)\n",
    "print(f\"algorithm_       : {with_nan.algorithm_}\")\n",
    "print(f\"has_missing_data_: {with_nan.has_missing_data_}\")\n",
    "print(f\"iterations / LV  : {with_nan.fitting_info_['iterations']}\")\n",
    "with_nan.r2_x_per_block_cumulative_.round(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7974312a",
   "metadata": {},
   "source": [
    "## Does the super-score still tell the same story?\n",
    "\n",
    "The headline check for unsupervised multi-block missing-data: with the same components and the same data minus a fraction of the cells, do the super-score directions still match? We sign-align each component and compute per-component Pearson correlation between the complete-data and the with-NaN super-scores. On this dataset at ~10% MCAR every component stays above 0.99."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "469ac980",
   "metadata": {},
   "outputs": [],
   "source": [
    "ts_complete = complete.super_scores_.to_numpy()\n",
    "ts_with_nan = with_nan.super_scores_.to_numpy().copy()\n",
    "\n",
    "# Sign-align each component to the complete-data fit before comparing.\n",
    "for j in range(ts_with_nan.shape[1]):\n",
    "    if ts_complete[:, j] @ ts_with_nan[:, j] < 0:\n",
    "        ts_with_nan[:, j] = -ts_with_nan[:, j]\n",
    "\n",
    "per_lv_corr = pd.Series(\n",
    "    [np.corrcoef(ts_complete[:, j], ts_with_nan[:, j])[0, 1] for j in range(ts_with_nan.shape[1])],\n",
    "    index=[f\"LV{j + 1}\" for j in range(ts_with_nan.shape[1])],\n",
    "    name=\"super-score correlation (NaN vs complete)\",\n",
    ")\n",
    "per_lv_corr.round(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19830a12",
   "metadata": {},
   "source": [
    "## Per-block R²X under missing data\n",
    "\n",
    "The same R²X table as the baseline, side by side, makes the small per-block residual change visible. Differences are at most a couple of percent across all four blocks at 10% MCAR."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e82171ea",
   "metadata": {},
   "outputs": [],
   "source": [
    "r2_complete = complete.r2_x_per_block_cumulative_.iloc[:, -1].rename(\"complete\")\n",
    "r2_with_nan = with_nan.r2_x_per_block_cumulative_.iloc[:, -1].rename(\"with NaN\")\n",
    "pd.concat([r2_complete, r2_with_nan, (r2_complete - r2_with_nan).rename(\"delta\")], axis=1).round(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "980508e3",
   "metadata": {},
   "source": [
    "## What to try next\n",
    "\n",
    "- Raise the NaN ratio (15% / 25%) and watch the per-LV super-score correlation degrade gracefully rather than collapse.\n",
    "- Inspect the super-score plot under missing data and verify the cluster structure is preserved.\n",
    "- Re-cast the same dataset as MBPLS with `quality` as Y (see the [MBPLS LDPE missing-data notebook](mbpls-ldpe-missing-data.ipynb)) and compare which super-score directions the two formulations agree on."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}