{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# PCA on food texture\n",
    "\n",
    "A small dataset where five sensory and instrumental measurements are taken on each of 50 pastry samples: oil percentage, density, crispiness, fracture force, and hardness. The variables are correlated and measured on different scales, so PCA on the autoscaled data shows the dominant patterns in two components and lets us identify samples that do not look like the rest of the batch.\n",
    "\n",
    "**Data.** `food-texture.csv` from [openmv.net](https://openmv.net/info/food-texture). The first column is a sample identifier; the remaining five are the measurements.\n",
    "\n",
    "**Adapted from** the *Principal Component Analysis* chapter of the [Process Improvement using Data](https://learnche.org/pid) book (CC BY-SA 4.0)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import pandas as pd\n",
    "import plotly.io as pio\n",
    "\n",
    "from process_improve.multivariate.methods import PCA, MCUVScaler\n",
    "\n",
    "pio.renderers.default = \"notebook_connected\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "foods = pd.read_csv(\"https://openmv.net/file/food-texture.csv\", index_col=0)\n",
    "print(f\"Shape: {foods.shape[0]} samples x {foods.shape[1]} variables\")\n",
    "foods.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "foods.describe().round(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The five variables span very different ranges (oil percentage is in the teens, density is in the thousands, the rest are unitless sensory scores). Without scaling, density would dominate the model purely through its variance.\n",
    "\n",
    "## Preprocess and fit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = MCUVScaler().fit_transform(foods)\n",
    "\n",
    "selection = PCA.select_n_components(X, max_components=4, cv=5)\n",
    "print(f\"Recommended A = {selection.n_components}\")\n",
    "selection.press_ratio.round(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = PCA(n_components=max(2, selection.n_components)).fit(X)\n",
    "model.r2_cumulative_.round(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Score and loading plots\n",
    "\n",
    "With only five variables, the loading plot is more useful than for high-dimensional spectra: every loading carries a recognisable label, and we can read directly which sensory and instrumental measurements drive each component."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.score_plot()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.loading_plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## SPE and Hotelling's T squared"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.spe_plot()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.t2_plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Outlier explanation\n",
    "\n",
    "Each flagged sample can be explained back in terms of the original five variables with `score_contributions`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "outliers = model.detect_outliers(conf_level=0.95)\n",
    "print(f\"{len(outliers)} samples flagged\")\n",
    "for o in outliers[:3]:\n",
    "    print(f\"  {o['observation']}: {o['outlier_types']} (severity={o['severity']:.2f})\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if outliers:\n",
    "    worst = outliers[0][\"observation\"]\n",
    "    contrib = model.score_contributions(model.scores_.loc[worst].values)\n",
    "\n",
    "    fig, ax = plt.subplots(figsize=(6, 3))\n",
    "    contrib.plot.bar(ax=ax, color=\"steelblue\")\n",
    "    ax.axhline(0, color=\"k\", linewidth=0.5)\n",
    "    ax.set_ylabel(\"Contribution\")\n",
    "    ax.set_title(f\"Score contributions for sample {worst}\")\n",
    "    plt.tight_layout()\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What to try next\n",
    "\n",
    "- The food-texture dataset is small enough that you can recompute everything by hand and check intuition. Try changing the number of components and watch what happens to the score plot, the SPE limit, and the list of flagged samples.\n",
    "- Compare the loading plot against domain knowledge: do the variables that cluster together in the loading plot also correlate strongly in `foods.corr()`?\n",
    "- Once a property like consumer acceptance is added, switch to PLS to relate the texture variables to it."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}