{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# A bigger factorial: four factors and model reduction\n",
    "\n",
    "**Source worksheet:** [yint.org/w5](https://yint.org/w5) - week 5 of the applied DoE short course.\n",
    "\n",
    "Module 3 finished with a 2^3 cube.  This module steps up to four\n",
    "factors and 16 runs.  The mathematics is unchanged - it is still\n",
    "``y = b_0 + sum(b_i * x_i) + sum(b_ij * x_i * x_j) + ...`` - but two\n",
    "new habits matter:\n",
    "\n",
    "1. **Count the terms before you fit.**  With ``k`` factors you have\n",
    "   ``2^k`` runs and ``2^k`` terms in the full polynomial: 1 intercept,\n",
    "   ``k`` main effects, ``C(k,2)`` two-factor interactions, and so on.\n",
    "   This sets the ``R^2`` budget: a full model on a saturated design\n",
    "   *must* hit ``R^2 = 1`` exactly because there is one coefficient\n",
    "   per observation.\n",
    "2. **Reduce the model.**  Real responses depend on a handful of\n",
    "   factors and interactions; the rest of the polynomial estimates\n",
    "   noise.  Reading the *Pareto* of effect magnitudes and dropping\n",
    "   the small ones recovers degrees of freedom and produces a model\n",
    "   you can trust."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. tip::\n",
    "\n",
    "   The general approach in this module is to **fit the full model\n",
    "   first, then reduce it**.  A 16-run full factorial fits 16\n",
    "   coefficients exactly, so the saturated model leaves no residual\n",
    "   degrees of freedom for testing significance.  The Pareto plot is\n",
    "   used to identify the smallest terms, which are dropped to recover\n",
    "   those degrees of freedom."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Q1-Q2 - counting interactions\n",
    "\n",
    "A 2^4 design has:\n",
    "\n",
    "- ``4`` main effects,\n",
    "- ``C(4, 2) = 6`` two-factor interactions,\n",
    "- ``C(4, 3) = 4`` three-factor interactions,\n",
    "- ``C(4, 4) = 1`` four-factor interaction.\n",
    "\n",
    "Plus one intercept, that is 16 coefficients in the full model.  With\n",
    "16 runs and 16 coefficients there are zero residual degrees of freedom;\n",
    "``R^2`` is exactly 1 and the residual standard error is exactly 0.\n",
    "This is not a sign of a great model - it is a sign that the model has\n",
    "no room left to disagree with the data."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "from math import comb\n",
    "\n",
    "k = 4\n",
    "print(f\"Main effects     : {k}\")\n",
    "print(f\"2-factor interactions: {comb(k, 2)}\")\n",
    "print(f\"3-factor interactions: {comb(k, 3)}\")\n",
    "print(f\"4-factor interactions: {comb(k, 4)}\")\n",
    "print(f\"Total terms (incl intercept): {sum(comb(k, j) for j in range(k + 1))}\")\n",
    "print(f\"Runs available  : {2 ** k}\")\n",
    "print(f\"Residual DoF    : {2 ** k - sum(comb(k, j) for j in range(k + 1))}\")"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. admonition:: Solution\n",
    "\n",
    "   ``R^2 = 1.0`` exactly; residual standard error is exactly ``0.0``.\n",
    "   You cannot estimate uncertainty from a saturated design.  Either\n",
    "   replicate runs (Module 3) or prune terms (this module)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Q3 - the bioreactor system\n",
    "\n",
    "Four factors are varied in 16 random-order runs:\n",
    "\n",
    "- **A = feed rate** [5 or 8 g/min]\n",
    "- **B = initial inoculant** [300 or 400 g]\n",
    "- **C = feed substrate concentration** [40 or 60 g/L]\n",
    "- **D = dissolved oxygen set-point** [4 or 5 mg/L]\n",
    "\n",
    "In standard order (A varies fastest, then B, then C, then D), the\n",
    "yields (in g/L) are:\n",
    "\n",
    "```\n",
    "y = [60, 59, 63, 61, 69, 61, 94, 93,\n",
    "     56, 63, 70, 65, 44, 45, 78, 77]\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "import plotly.graph_objects as go\n",
    "import plotly.io as pio\n",
    "\n",
    "from process_improve.experiments import c, gather, lm, predict\n",
    "from process_improve.experiments.visualization import visualize_doe\n",
    "\n",
    "pio.renderers.default = \"notebook_connected\"\n",
    "\n",
    "# Standard order: A flips every row, B every 2, C every 4, D every 8.\n",
    "\n",
    "A = c(-1, +1, -1, +1, -1, +1, -1, +1, -1, +1, -1, +1, -1, +1, -1, +1, name=\"A\", lo=5,   hi=8)\n",
    "B = c(-1, -1, +1, +1, -1, -1, +1, +1, -1, -1, +1, +1, -1, -1, +1, +1, name=\"B\", lo=300, hi=400)\n",
    "C = c(-1, -1, -1, -1, +1, +1, +1, +1, -1, -1, -1, -1, +1, +1, +1, +1, name=\"C\", lo=40,  hi=60)\n",
    "D = c(-1, -1, -1, -1, -1, -1, -1, -1, +1, +1, +1, +1, +1, +1, +1, +1, name=\"D\", lo=4,   hi=5)\n",
    "y = c(60, 59, 63, 61, 69, 61, 94, 93,\n",
    "      56, 63, 70, 65, 44, 45, 78, 77, name=\"y\")\n",
    "bio = gather(A=A, B=B, C=C, D=D, y=y)\n",
    "print(f\"Design has {len(bio)} runs and {bio.shape[1] - 1} factors\")"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Fit the full 4-way model.  Every interaction term is included; the\n",
    "# design is saturated so R^2 = 1 by construction.\n",
    "\n",
    "full = lm(\"y ~ A * B * C * D\", bio)\n",
    "params = full.get_parameters(drop_intercept=False)\n",
    "\n",
    "# Sort by absolute magnitude to spot the dominant terms.\n",
    "\n",
    "ordered = params.reindex(params.abs().sort_values(ascending=False).index)\n",
    "print(ordered.to_string())"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Pareto-style bar chart of the effects (= 2 * coefficient).\n",
    "\n",
    "effects = {\n",
    "    term: 2 * coef\n",
    "    for term, coef in full.get_parameters(drop_intercept=True).items()\n",
    "}\n",
    "pareto = visualize_doe(\n",
    "    plot_type=\"pareto\",\n",
    "    analysis_results={\"effects\": effects},\n",
    "    backend=\"plotly\",\n",
    ")\n",
    "fig = go.Figure(pareto[\"plotly\"])\n",
    "fig.update_layout(width=720, height=460, title=\"Pareto plot of effects on bioreactor yield\")\n",
    "fig"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. admonition:: Solution\n",
    "\n",
    "   In magnitude order the **regression coefficients** are roughly:\n",
    "\n",
    "   ::\n",
    "\n",
    "       B (+9.0)  B:C (+6.4)  C:D (-5.25)  C (+4.0)  D (-3.9)\n",
    "       B:D (+1.25)  A:B:D (-1.25)  A:B:C (+1.1)   ... all the rest <= 1\n",
    "\n",
    "   These numbers are **half** the bar heights in the Pareto plot\n",
    "   above: the Pareto plot draws *effects*, and an effect is defined\n",
    "   as ``effect = 2 * coefficient`` (the response change when a factor\n",
    "   moves from its ``-1`` level all the way to its ``+1`` level, a\n",
    "   span of two coded units).  Either view tells the same story; the\n",
    "   factor of two is just a convention.\n",
    "\n",
    "   The system is dominated by **B**, **C**, and **D**; the two-factor\n",
    "   interactions **B:C** and **C:D** are real and roughly the same size\n",
    "   as the main effects.  **A** (feed rate) and every interaction\n",
    "   involving A are tiny - feed rate is essentially insensitive in this\n",
    "   range."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Q6 - reducing the model\n",
    "\n",
    "Dropping A buys back four degrees of freedom (the main effect of A\n",
    "plus the three-, two-, and one-way interactions that involve A).  The\n",
    "*orthogonal* design guarantees the coefficients on the remaining\n",
    "terms are unchanged - only the standard errors and the residual\n",
    "degrees of freedom move."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "reduced = lm(\"y ~ B * C * D\", bio)\n",
    "print(\"Reduced model (A and all its interactions dropped):\")\n",
    "print(reduced.get_parameters(drop_intercept=False).to_string())\n",
    "print()\n",
    "print(f\"Reduced R^2: {reduced._OLS.rsquared:.4f}\")\n",
    "print(f\"Reduced residual std error: {(reduced._OLS.scale ** 0.5):.2f} g/L\")"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. admonition:: Solution\n",
    "\n",
    "   The coefficients for B, C, D, B:C, B:D, C:D, B:C:D are *exactly*\n",
    "   the same as in the full model.  Only the standard errors change.\n",
    "   With 8 residual degrees of freedom you can now talk about\n",
    "   confidence intervals; the residual std error of about 1.4 g/L is\n",
    "   the implicit noise estimate.\n",
    "\n",
    ".. admonition:: Guidance\n",
    "\n",
    "   Dropping a factor does not mean \"feed rate does not matter.\"  It\n",
    "   means \"feed rate does not matter *over the range tested* (5 to\n",
    "   8 g/min)\".  Stretch the range and the conclusion may change.\n",
    "   Set A to whichever level is cheapest, safest, or most operationally\n",
    "   convenient."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Q7-Q8 - where to go next\n",
    "\n",
    "To **maximize** yield:\n",
    "\n",
    "- ``B`` is positive -> set B high (+1).\n",
    "- ``C`` is positive **and** ``B:C`` is positive -> set C high (+1) too;\n",
    "  the interaction amplifies B.\n",
    "- ``D`` is negative **and** ``C:D`` is negative -> set D low (-1);\n",
    "  high D actively hurts when C is high.\n",
    "\n",
    "That points to the corner ``B = +1, C = +1, D = -1`` (with A wherever\n",
    "is cheapest).  Several extrapolation candidates are explored below."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "candidates = [\n",
    "    (\"B=+1, C=+1, D=-1 (corner)\",       dict(A=0, B=+1,   C=+1,   D=-1)),\n",
    "    (\"B=+1.5, C=+1.5, D=-1\",             dict(A=0, B=+1.5, C=+1.5, D=-1)),\n",
    "    (\"B=+2,   C=+1,   D=-1.5\",           dict(A=0, B=+2,   C=+1,   D=-1.5)),\n",
    "    (\"B=+2,   C=+2,   D=-1\",             dict(A=0, B=+2,   C=+2,   D=-1)),\n",
    "]\n",
    "for label, point in candidates:\n",
    "    pred = float(predict(reduced, **point).iloc[0])\n",
    "    print(f\"{label:40s} -> predicted yield {pred:5.1f} g/L\")"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. admonition:: Solution\n",
    "\n",
    "   - **Corner ``B = +1, C = +1, D = -1``** is predicted at about\n",
    "     ``93.5 g/L``, matching the observed run 7 (yield 94).\n",
    "   - **``B = +1.5, C = +1.5, D = -1``** pushes the prediction past\n",
    "     ``100 g/L`` - the model thinks more is better, but you should\n",
    "     confirm with one experiment before betting on it.\n",
    "   - **``B = +2, C = +1, D = -1.5``** explores trading harder on B\n",
    "     and on suppressing D; same neighbourhood.\n",
    "\n",
    "   The strategy is steepest ascent: each candidate moves further\n",
    "   along the gradient direction.  Run one, compare predicted to\n",
    "   observed, and use the residual to decide whether to keep going\n",
    "   or to call this the operational sweet spot."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Wrap-up\n",
    "\n",
    "Two transferable lessons that scale to any number of factors:\n",
    "\n",
    "- A **saturated full factorial** spends every observation on a\n",
    "  coefficient.  The model fits perfectly and tells you nothing about\n",
    "  uncertainty.  Replicates (Module 3) or model reduction (this\n",
    "  module) give you that uncertainty back.\n",
    "- **Orthogonality** is what makes \"just drop the small terms\" safe.\n",
    "  In a designed full factorial the columns of the model matrix are\n",
    "  uncorrelated, so removing a factor does not bias the remaining\n",
    "  coefficients.\n",
    "\n",
    "**Next:** Module 5 attacks the same problem from a different direction.\n",
    "What if you cannot afford 16 runs to begin with?  Fractional\n",
    "factorials trade *resolution* for *budget*: half a factorial costs\n",
    "half as much, and (usually) costs only the highest-order interaction\n",
    "you did not believe in anyway."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}