{ "cells": [ { "cell_type": "markdown", "id": "7fb27b941602401d91542211134fc71a", "metadata": {}, "source": "# Regressing cheddar taste on chemical composition\n\nThe cheddar dataset records a panel-tasting score for 30 samples of mature cheddar alongside three chemical measurements that vary with maturation: free acetic acid concentration, hydrogen sulfide, and lactic acid. The question is the standard regression question: how much of the variation in taste can the chemistry explain, and which of the three predictors carries the most weight?\n\n**Data.** `cheddar-cheese.csv` from [openmv.net](https://openmv.net/info/cheddar-cheese). 30 rows, five columns: `Case` (a sample identifier) and the four numeric variables `Acetic`, `H2S`, `Lactic`, `Taste`.\n\n**What we do.** Fit simple linear regression of taste on hydrogen sulfide on its own, then a multiple linear regression on all three predictors, read the coefficient table, and check the residual diagnostics that come with the `OLS` estimator.\n\n**Adapted from** the *Least squares modelling* chapter of the [Process Improvement using Data](https://learnche.org/pid) book (CC BY-SA 4.0)." }, { "cell_type": "code", "execution_count": null, "id": "acae54e37e7d407bbb7b55eff062a284", "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "from scipy import stats\n", "\n", "from process_improve.regression.methods import OLS" ] }, { "cell_type": "markdown", "id": "9a63283cbaf04dbcab1f6479b197f3a8", "metadata": {}, "source": "## Load the data\n\nThe first column is a sample identifier (`Case`); the next four are the numeric variables we will model. We strip any whitespace from column names so case-insensitive lookups in the next cell are robust to small naming differences between revisions of the dataset." }, { "cell_type": "code", "execution_count": null, "id": "8dd0d8092fe74a7c96281538738b07e2", "metadata": {}, "outputs": [], "source": "cheddar = pd.read_csv(\"https://openmv.net/file/cheddar-cheese.csv\")\ncheddar.columns = [c.strip() for c in cheddar.columns]\nprint(f\"Shape: {cheddar.shape}\")\nprint(f\"Columns: {list(cheddar.columns)}\")\ncheddar.head()" }, { "cell_type": "code", "execution_count": null, "id": "72eea5119410473aa328ad9291626812", "metadata": {}, "outputs": [], "source": [ "# Identify columns case-insensitively so the notebook is robust to small\n", "# naming differences between revisions of the dataset.\n", "cols = {c.lower(): c for c in cheddar.columns}\n", "y_col = cols[\"taste\"]\n", "h2s_col = cols[\"h2s\"]\n", "predictor_cols = [cols[\"acetic\"], cols[\"h2s\"], cols[\"lactic\"]]\n", "y = cheddar[y_col].astype(float)\n", "print(f\"Response: {y_col}; Predictors: {predictor_cols}\")" ] }, { "cell_type": "markdown", "id": "8edb47106e1a46a883d545849b8ab81b", "metadata": {}, "source": [ "## Simple linear regression\n", "\n", "Hydrogen sulfide is the predictor that pid-book singles out as the strongest individual driver. Fitting taste on `H2S` alone is a useful sanity check; we expect a positive slope (more mature cheese smells stronger and tastes more) and a comfortable signal-to-noise ratio." ] }, { "cell_type": "code", "execution_count": null, "id": "10185d26023b46108eb7d9f57d49d2b3", "metadata": {}, "outputs": [], "source": [ "model_simple = OLS().fit(cheddar[[h2s_col]], y)\n", "print(\n", " f\"intercept = {model_simple.intercept_:.2f}, \"\n", " f\"slope = {model_simple.coefficients_[0]:.2f}, \"\n", " f\"R2 = {model_simple.r2_:.3f}\"\n", ")\n", "print(f\"t-value on slope = {model_simple.t_values_[0]:.2f}, p-value = {model_simple.p_values_[0]:.4f}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "8763a12b2bbd4a93a75aff182afb95dc", "metadata": {}, "outputs": [], "source": [ "# Scatter plot with fitted line and 95 percent prediction interval band.\n", "x_grid = model_simple.pi_range_[:, 0]\n", "lo = model_simple.pi_range_[:, 1]\n", "hi = model_simple.pi_range_[:, 2]\n", "y_hat = model_simple.intercept_ + model_simple.coefficients_[0] * x_grid\n", "\n", "_fig, ax = plt.subplots(figsize=(7, 4))\n", "ax.scatter(cheddar[h2s_col], y, s=30, color=\"#1f77b4\", label=\"observed\")\n", "ax.plot(x_grid, y_hat, color=\"k\", linewidth=1.2, label=\"fit\")\n", "ax.fill_between(x_grid, lo, hi, color=\"orange\", alpha=0.25, label=\"95% prediction band\")\n", "ax.set_xlabel(h2s_col)\n", "ax.set_ylabel(y_col)\n", "ax.set_title(\"Cheddar taste vs hydrogen sulfide\")\n", "ax.legend(loc=\"best\")\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "7623eae2785240b9bd12b16a66d81610", "metadata": {}, "source": [ "## Multiple linear regression\n", "\n", "Adding the other two predictors gives the model a chance to explain more of the variation, but tells us less about which predictor matters most: collinearity between the three chemistry variables means the individual coefficient estimates trade against each other." ] }, { "cell_type": "code", "execution_count": null, "id": "7cdc8c89c7104fffa095e18ddfef8986", "metadata": {}, "outputs": [], "source": [ "model_full = OLS().fit(cheddar[predictor_cols], y)\n", "summary = pd.DataFrame(\n", " {\n", " \"coefficient\": model_full.coefficients_,\n", " \"std_error\": model_full.standard_errors_,\n", " \"t_value\": model_full.t_values_,\n", " \"p_value\": model_full.p_values_,\n", " },\n", " index=predictor_cols,\n", ")\n", "summary.loc[\"(intercept)\"] = [\n", " model_full.intercept_,\n", " model_full.standard_error_intercept_,\n", " model_full.t_value_intercept_,\n", " model_full.p_value_intercept_,\n", "]\n", "print(\n", " f\"R2 = {model_full.r2_:.3f}, adjusted R2 = {model_full.adj_r2_:.3f}, \"\n", " f\"F = {model_full.f_statistic_:.2f} (p = {model_full.f_pvalue_:.4f})\"\n", ")\n", "summary.round(3)" ] }, { "cell_type": "markdown", "id": "b118ea5561624da68c537baed56e602f", "metadata": {}, "source": [ "## Residual diagnostics\n", "\n", "Three quick plots tell you most of what you need to know about whether the linear model is appropriate:\n", "\n", "- **Residuals vs fitted**: should be a horizontal cloud around zero; a U-shape means a curvature term is missing, a fan shape means the variance grows with the fitted value.\n", "- **Residual histogram or Q-Q plot**: should look approximately normal if you want to trust the t-statistics.\n", "- **Residuals vs sample order** (if applicable): non-random patterns suggest the rows are not independent." ] }, { "cell_type": "code", "execution_count": null, "id": "938c804e27f84196a10c8828c723f798", "metadata": {}, "outputs": [], "source": [ "fitted = model_full.fitted_values_\n", "resid = model_full.residuals_\n", "resid_clean = resid[np.isfinite(resid)]\n", "\n", "_fig, axes = plt.subplots(1, 3, figsize=(11, 3.2))\n", "axes[0].scatter(fitted, resid_clean, s=25, color=\"#1f77b4\")\n", "axes[0].axhline(0, color=\"k\", linewidth=0.8)\n", "axes[0].set_xlabel(\"Fitted\")\n", "axes[0].set_ylabel(\"Residual\")\n", "axes[0].set_title(\"Residuals vs fitted\")\n", "\n", "axes[1].hist(resid_clean, bins=10, color=\"#1f77b4\", edgecolor=\"k\")\n", "axes[1].set_xlabel(\"Residual\")\n", "axes[1].set_title(\"Residual distribution\")\n", "\n", "stats.probplot(resid_clean, dist=\"norm\", plot=axes[2])\n", "axes[2].set_title(\"Q-Q plot\")\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "504fb2a444614c0babb325280ed9130a", "metadata": {}, "source": [ "## What to try next\n", "\n", "- Drop the predictor with the largest p-value and re-fit; compare the coefficient estimates of the two retained predictors. They should change because of the collinearity between the three chemistry variables.\n", "- Try `OLS(fit_intercept=False)` to see what an origin-passing fit looks like; the diagnostics shift even though the data does not.\n", "- For correlated predictors, switch to PLS to avoid the variance-inflation problem altogether. See the latent-variable case studies for examples." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 5 }