{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "7fb27b941602401d91542211134fc71a",
   "metadata": {},
   "source": "# Regressing cheddar taste on chemical composition\n\nThe cheddar dataset records a panel-tasting score for 30 samples of mature cheddar alongside three chemical measurements that vary with maturation: free acetic acid concentration, hydrogen sulfide, and lactic acid. The question is the standard regression question: how much of the variation in taste can the chemistry explain, and which of the three predictors carries the most weight?\n\n**Data.** `cheddar-cheese.csv` from [openmv.net](https://openmv.net/info/cheddar-cheese). 30 rows, five columns: `Case` (a sample identifier) and the four numeric variables `Acetic`, `H2S`, `Lactic`, `Taste`.\n\n**What we do.** Fit simple linear regression of taste on hydrogen sulfide on its own, then a multiple linear regression on all three predictors, read the coefficient table, and check the residual diagnostics that come with the `OLS` estimator.\n\n**Adapted from** the *Least squares modelling* chapter of the [Process Improvement using Data](https://learnche.org/pid) book (CC BY-SA 4.0)."
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "acae54e37e7d407bbb7b55eff062a284",
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from scipy import stats\n",
    "\n",
    "from process_improve.regression.methods import OLS"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9a63283cbaf04dbcab1f6479b197f3a8",
   "metadata": {},
   "source": "## Load the data\n\nThe first column is a sample identifier (`Case`); the next four are the numeric variables we will model. We strip any whitespace from column names so case-insensitive lookups in the next cell are robust to small naming differences between revisions of the dataset."
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8dd0d8092fe74a7c96281538738b07e2",
   "metadata": {},
   "outputs": [],
   "source": "cheddar = pd.read_csv(\"https://openmv.net/file/cheddar-cheese.csv\")\ncheddar.columns = [c.strip() for c in cheddar.columns]\nprint(f\"Shape: {cheddar.shape}\")\nprint(f\"Columns: {list(cheddar.columns)}\")\ncheddar.head()"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "72eea5119410473aa328ad9291626812",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Identify columns case-insensitively so the notebook is robust to small\n",
    "# naming differences between revisions of the dataset.\n",
    "cols = {c.lower(): c for c in cheddar.columns}\n",
    "y_col = cols[\"taste\"]\n",
    "h2s_col = cols[\"h2s\"]\n",
    "predictor_cols = [cols[\"acetic\"], cols[\"h2s\"], cols[\"lactic\"]]\n",
    "y = cheddar[y_col].astype(float)\n",
    "print(f\"Response: {y_col}; Predictors: {predictor_cols}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8edb47106e1a46a883d545849b8ab81b",
   "metadata": {},
   "source": [
    "## Simple linear regression\n",
    "\n",
    "Hydrogen sulfide is the predictor that pid-book singles out as the strongest individual driver. Fitting taste on `H2S` alone is a useful sanity check; we expect a positive slope (more mature cheese smells stronger and tastes more) and a comfortable signal-to-noise ratio."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10185d26023b46108eb7d9f57d49d2b3",
   "metadata": {},
   "outputs": [],
   "source": [
    "model_simple = OLS().fit(cheddar[[h2s_col]], y)\n",
    "print(\n",
    "    f\"intercept = {model_simple.intercept_:.2f}, \"\n",
    "    f\"slope = {model_simple.coefficients_[0]:.2f}, \"\n",
    "    f\"R2 = {model_simple.r2_:.3f}\"\n",
    ")\n",
    "print(f\"t-value on slope = {model_simple.t_values_[0]:.2f}, p-value = {model_simple.p_values_[0]:.4f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8763a12b2bbd4a93a75aff182afb95dc",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Scatter plot with fitted line and 95 percent prediction interval band.\n",
    "x_grid = model_simple.pi_range_[:, 0]\n",
    "lo = model_simple.pi_range_[:, 1]\n",
    "hi = model_simple.pi_range_[:, 2]\n",
    "y_hat = model_simple.intercept_ + model_simple.coefficients_[0] * x_grid\n",
    "\n",
    "_fig, ax = plt.subplots(figsize=(7, 4))\n",
    "ax.scatter(cheddar[h2s_col], y, s=30, color=\"#1f77b4\", label=\"observed\")\n",
    "ax.plot(x_grid, y_hat, color=\"k\", linewidth=1.2, label=\"fit\")\n",
    "ax.fill_between(x_grid, lo, hi, color=\"orange\", alpha=0.25, label=\"95% prediction band\")\n",
    "ax.set_xlabel(h2s_col)\n",
    "ax.set_ylabel(y_col)\n",
    "ax.set_title(\"Cheddar taste vs hydrogen sulfide\")\n",
    "ax.legend(loc=\"best\")\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7623eae2785240b9bd12b16a66d81610",
   "metadata": {},
   "source": [
    "## Multiple linear regression\n",
    "\n",
    "Adding the other two predictors gives the model a chance to explain more of the variation, but tells us less about which predictor matters most: collinearity between the three chemistry variables means the individual coefficient estimates trade against each other."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7cdc8c89c7104fffa095e18ddfef8986",
   "metadata": {},
   "outputs": [],
   "source": [
    "model_full = OLS().fit(cheddar[predictor_cols], y)\n",
    "summary = pd.DataFrame(\n",
    "    {\n",
    "        \"coefficient\": model_full.coefficients_,\n",
    "        \"std_error\": model_full.standard_errors_,\n",
    "        \"t_value\": model_full.t_values_,\n",
    "        \"p_value\": model_full.p_values_,\n",
    "    },\n",
    "    index=predictor_cols,\n",
    ")\n",
    "summary.loc[\"(intercept)\"] = [\n",
    "    model_full.intercept_,\n",
    "    model_full.standard_error_intercept_,\n",
    "    model_full.t_value_intercept_,\n",
    "    model_full.p_value_intercept_,\n",
    "]\n",
    "print(\n",
    "    f\"R2 = {model_full.r2_:.3f}, adjusted R2 = {model_full.adj_r2_:.3f}, \"\n",
    "    f\"F = {model_full.f_statistic_:.2f} (p = {model_full.f_pvalue_:.4f})\"\n",
    ")\n",
    "summary.round(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b118ea5561624da68c537baed56e602f",
   "metadata": {},
   "source": [
    "## Residual diagnostics\n",
    "\n",
    "Three quick plots tell you most of what you need to know about whether the linear model is appropriate:\n",
    "\n",
    "- **Residuals vs fitted**: should be a horizontal cloud around zero; a U-shape means a curvature term is missing, a fan shape means the variance grows with the fitted value.\n",
    "- **Residual histogram or Q-Q plot**: should look approximately normal if you want to trust the t-statistics.\n",
    "- **Residuals vs sample order** (if applicable): non-random patterns suggest the rows are not independent."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "938c804e27f84196a10c8828c723f798",
   "metadata": {},
   "outputs": [],
   "source": [
    "fitted = model_full.fitted_values_\n",
    "resid = model_full.residuals_\n",
    "resid_clean = resid[np.isfinite(resid)]\n",
    "\n",
    "_fig, axes = plt.subplots(1, 3, figsize=(11, 3.2))\n",
    "axes[0].scatter(fitted, resid_clean, s=25, color=\"#1f77b4\")\n",
    "axes[0].axhline(0, color=\"k\", linewidth=0.8)\n",
    "axes[0].set_xlabel(\"Fitted\")\n",
    "axes[0].set_ylabel(\"Residual\")\n",
    "axes[0].set_title(\"Residuals vs fitted\")\n",
    "\n",
    "axes[1].hist(resid_clean, bins=10, color=\"#1f77b4\", edgecolor=\"k\")\n",
    "axes[1].set_xlabel(\"Residual\")\n",
    "axes[1].set_title(\"Residual distribution\")\n",
    "\n",
    "stats.probplot(resid_clean, dist=\"norm\", plot=axes[2])\n",
    "axes[2].set_title(\"Q-Q plot\")\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "504fb2a444614c0babb325280ed9130a",
   "metadata": {},
   "source": [
    "## What to try next\n",
    "\n",
    "- Drop the predictor with the largest p-value and re-fit; compare the coefficient estimates of the two retained predictors. They should change because of the collinearity between the three chemistry variables.\n",
    "- Try `OLS(fit_intercept=False)` to see what an origin-passing fit looks like; the diagnostics shift even though the data does not.\n",
    "- For correlated predictors, switch to PLS to avoid the variance-inflation problem altogether. See the latent-variable case studies for examples."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}