{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Coding, scaling, and the linear model\n",
    "\n",
    "**Source worksheet:** [yint.org/w2](https://yint.org/w2) - week 2 of the applied DoE short course.\n",
    "\n",
    "Module 1 read effects straight off corner averages of a 2x2. That was\n",
    "appropriate for two factors and four runs, but the same approach\n",
    "becomes cumbersome\n",
    "once you scale beyond it. Module 2 introduces the *coded linear model*\n",
    "- the workhorse that scales to any number of factors, copes with\n",
    "interactions, and gives you predictions outside the corners you ran.\n",
    "\n",
    "The math is unchanged from Module 1; what we add is **coded units**\n",
    "(-1 and +1 instead of grams and minutes), a **linear regression** that\n",
    "recovers the same effects you computed by hand, and a sober look at\n",
    "**extrapolation** - what the model promises, and what it cannot promise.\n",
    "\n",
    "Everything below uses Python with `process_improve`; the original\n",
    "worksheet's R snippet at ``yint.org/2w1`` is translated to its Python\n",
    "equivalent."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. tip::\n",
    "\n",
    "   The first habit to build: in design space, every factor lives on the\n",
    "   same scale, with ``-1`` at its low extreme and ``+1`` at its high\n",
    "   extreme.  A 30 mL setting becomes ``+1``; 200 grams becomes ``-1``.\n",
    "   This is *coding*.  It makes coefficients directly comparable and\n",
    "   makes predictions outside the corner runs straightforward to write\n",
    "   down.  In ``process_improve`` you state the real-world range with\n",
    "   ``c(..., lo=..., hi=...)`` and the package handles the coding."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Questions 1 and 2 - how many runs does a full factorial need?\n",
    "\n",
    "A full factorial runs every combination of every factor level. For *k*\n",
    "factors at *L* levels each, that is **L**^**k** experiments.\n",
    "\n",
    "1. Four factors, each at two levels: how many runs?\n",
    "2. Four factors, each at three levels: how many runs?"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "def full_factorial_size(num_factors: int, levels: int) -> int:\n",
    "    return levels ** num_factors\n",
    "\n",
    "\n",
    "two_level = full_factorial_size(num_factors=4, levels=2)\n",
    "three_level = full_factorial_size(num_factors=4, levels=3)\n",
    "print(f\"Full factorial, 4 factors x 2 levels: {two_level} runs\")\n",
    "print(f\"Full factorial, 4 factors x 3 levels: {three_level} runs\")"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. admonition:: Solution\n",
    "\n",
    "   1. ``2**4 = 16`` runs.\n",
    "   2. ``3**4 = 81`` runs.\n",
    "\n",
    "   The rapid growth in run count motivates *fractional* factorials\n",
    "   (Module 5):\n",
    "   even modest screens become unaffordable at three or four levels.\n",
    "\n",
    ".. admonition:: Guidance\n",
    "\n",
    "   Two levels per factor is the default starting point because four\n",
    "   factors fit in 16 runs, eight factors fit in 256 - but two levels\n",
    "   already test the linear effect of every factor.  A third level\n",
    "   (often a center point at ``0``) is added when you suspect *curvature*;\n",
    "   we revisit this in Module 4."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Questions 3 to 9 - greenhouse plant heights\n",
    "\n",
    "A greenhouse experiment varies two factors to maximize plant height:\n",
    "\n",
    "- **A = soil amount** [200 g or 400 g]\n",
    "- **B = water amount** [50 mL or 100 mL]\n",
    "\n",
    "The runs come out of the lab notebook *in experiment order*, not in\n",
    "the textbook's standard order:\n",
    "\n",
    "| Run | A (g) | B (mL) | Height (mm) |\n",
    "|----:|------:|-------:|------------:|\n",
    "|  1  |  200  |   100  |     61      |\n",
    "|  2  |  200  |    50  |     39      |\n",
    "|  3  |  400  |   100  |     82      |\n",
    "|  4  |  400  |    50  |     50      |\n",
    "\n",
    "The worksheet asks for the rewrite in standard order, the main effects\n",
    "of A and B, the prediction equation, and three predicted heights at\n",
    "new conditions."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. admonition:: Check yourself\n",
    "\n",
    "   Before reading on, write the four rows in *standard order*\n",
    "   (``A = -, -, +, +`` and ``B = -, +, -, +``) on paper and copy the\n",
    "   four heights across.  Standard order does not change the data, only\n",
    "   the row order; it is the layout the model fitter expects and the\n",
    "   easiest layout to spot patterns in."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "import plotly.graph_objects as go\n",
    "import plotly.io as pio\n",
    "\n",
    "from process_improve.experiments import c, gather, lm, main_effects_plot, predict\n",
    "from process_improve.experiments.visualization import visualize_doe\n",
    "\n",
    "pio.renderers.default = \"notebook_connected\"\n",
    "\n",
    "# Build the design directly in standard order: A varies fastest, then B.\n",
    "\n",
    "A = c(-1, +1, -1, +1, lo=200, hi=400, name=\"A\", units=\"grams\")\n",
    "B = c(-1, -1, +1, +1, lo=50,  hi=100, name=\"B\", units=\"mL\")\n",
    "height = c(39, 50, 61, 82, name=\"height\")\n",
    "plant = gather(A=A, B=B, height=height)\n",
    "plant"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Fit a main-effects model in coded units.\n",
    "\n",
    "model = lm(\"height ~ A + B\", plant)\n",
    "params = model.get_parameters(drop_intercept=False)\n",
    "print(params.to_string())\n",
    "\n",
    "effect_A = 2 * params[\"A\"]\n",
    "effect_B = 2 * params[\"B\"]\n",
    "print(f\"\\nMain effect of soil amount (A):  {effect_A:+.1f} mm\")\n",
    "print(f\"Main effect of water amount (B): {effect_B:+.1f} mm\")"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Verify the prediction equation by predicting at three test points.\n",
    "\n",
    "points = [\n",
    "    (\"300 g soil, 75 mL water\", dict(A=0,  B=0)),\n",
    "    (\"200 g soil, 75 mL water\", dict(A=-1, B=0)),\n",
    "    (\"500 g soil, 125 mL water (extrapolation!)\", dict(A=2, B=2)),\n",
    "]\n",
    "for label, coded in points:\n",
    "    y_hat = float(predict(model, **coded).iloc[0])\n",
    "    print(f\"{label:50s} -> {y_hat:6.1f} mm\")"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "fig = main_effects_plot(model, factor_labels={\"A\": \"Soil [g]\", \"B\": \"Water [mL]\"})\n",
    "fig.update_layout(width=620, height=380, title=\"Main-effects plot - greenhouse\")\n",
    "fig"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. admonition:: Solution\n",
    "\n",
    "   **Standard order and effects.**  Sorted (``A``, ``B``) =\n",
    "   ``(-,-)=39, (+,-)=50, (-,+)=61, (+,+)=82``.\n",
    "\n",
    "   - Main effect of soil: ``(50 + 82)/2 - (39 + 61)/2 = 66 - 50 =\n",
    "     +16 mm``.\n",
    "   - Main effect of water: ``(61 + 82)/2 - (39 + 50)/2 = 71.5 - 44.5 =\n",
    "     +27 mm``.\n",
    "\n",
    "   **Coefficients.**  The intercept is the grand mean,\n",
    "   ``b_0 = (39 + 50 + 61 + 82)/4 = 58``.  Each coefficient is half its\n",
    "   main effect, so ``b_A = +8`` and ``b_B = +13.5``.\n",
    "\n",
    "   **Prediction equation.**  In coded units,\n",
    "   ``y = 58 + 8 * x_A + 13.5 * x_B``.\n",
    "\n",
    "   **Three predictions.**\n",
    "\n",
    "   - 300 g soil and 75 mL water sit at the center of the design\n",
    "     (``x_A = 0``, ``x_B = 0``); the model returns the intercept,\n",
    "     ``y = 58 mm``.\n",
    "   - 200 g soil and 75 mL water moves ``A`` to ``-1`` but keeps ``B`` at\n",
    "     ``0``; predicted height ``y = 58 - 8 = 50 mm``.\n",
    "   - 500 g soil and 125 mL water sits at ``x_A = +2``, ``x_B = +2`` -\n",
    "     one whole step outside the design on both axes.  The model\n",
    "     extrapolates to ``y = 58 + 16 + 27 = 101 mm``; in real life\n",
    "     plants do not grow proportionally forever."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Questions 10 to 12 - extrapolation, one-factor-at-a-time, and interactions\n",
    "\n",
    "The worksheet now steps back from the maths:\n",
    "\n",
    "10. Can you extrapolate outside the design region? How far? And what are\n",
    "    contour lines useful for?\n",
    "11. A colleague wants to run their next study by changing **one factor\n",
    "    at a time** against a baseline. Why is that a bad idea?\n",
    "12. Give an everyday example of an interaction where increasing factor B:\n",
    "\n",
    "    - *amplifies* the effect of factor A;\n",
    "    - *cancels* or *reverses* the effect of factor A."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. admonition:: Solution\n",
    "\n",
    "   **Q10 - Extrapolation.**  A linear model lets you compute a\n",
    "   prediction at any ``(x_A, x_B)``; it does not promise that prediction\n",
    "   is right.  Inside the four corners you ran, the model interpolates\n",
    "   between observed responses and is usually trustworthy.  Just\n",
    "   outside, predictions are still useful as a *direction*: they tell\n",
    "   you which way to push for the next experiment.  Far outside, expect\n",
    "   the world to push back - biological limits, melting points, saturated\n",
    "   reactions.\n",
    "\n",
    "   Contour plots are the most useful summary view: they show you\n",
    "   how to *trade off* one factor against another while holding the\n",
    "   response constant.  A flat contour is a robust operating window;\n",
    "   crowded contours are where the process is sensitive.\n",
    "\n",
    "   **Q11 - One factor at a time (OFAT).**  Three failures of OFAT\n",
    "   compared with a factorial:\n",
    "\n",
    "   1. It misses interactions completely.  With OFAT you tweak ``A``\n",
    "      while holding ``B`` at its baseline, then tweak ``B`` while\n",
    "      holding ``A`` at its baseline; you never observe the\n",
    "      ``A = +, B = +`` corner where the interaction lives.\n",
    "   2. It uses runs less efficiently.  A 2x2 spends every run estimating\n",
    "      *both* main effects; OFAT spends each run estimating just one.\n",
    "   3. The \"winning\" combination depends on which factor you chose to\n",
    "      vary first.  That conclusion is fragile, and it is hard to\n",
    "      defend when a colleague asks \"what if we had started with B?\".\n",
    "\n",
    "   **Q12 - Everyday interactions.**\n",
    "\n",
    "   - *Amplification.*  Coffee and sleep deprivation: the same dose of\n",
    "     caffeine wakes you up much more after a short night than after a\n",
    "     long one.  A and B amplify each other.\n",
    "   - *Cancellation.*  Salt and soup volume: a teaspoon of salt is\n",
    "     perfect in one mug of soup and unnoticeable in a stock-pot.  The\n",
    "     effect of \"adding salt\" depends on (and is cancelled by) the\n",
    "     dilution from extra water."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Question 13 - a baked food with a negative interaction\n",
    "\n",
    "A product is baked for time *T* using a certain amount of fat *F*. The\n",
    "outcome is *crispiness*. Both factors have plausible main effects, but\n",
    "the data hide a negative interaction that turns up only when you fit\n",
    "the model.\n",
    "\n",
    "| Run | T (min) | F (g) | Crispiness |\n",
    "|----:|--------:|------:|-----------:|\n",
    "|  1  |    80   |   20  |     37     |\n",
    "|  2  |   100   |   20  |     57     |\n",
    "|  3  |    80   |   30  |     49     |\n",
    "|  4  |   100   |   30  |     53     |\n",
    "\n",
    "The worksheet asks for the cube plot with contours, an interaction plot\n",
    "in *both* orientations, a verification of the prediction equation, and\n",
    "two extrapolated crispness predictions."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "T = c(-1, +1, -1, +1, lo=80, hi=100, name=\"T\", units=\"min\")\n",
    "F = c(-1, -1, +1, +1, lo=20, hi=30,  name=\"F\", units=\"g\")\n",
    "crisp = c(37, 57, 49, 53, name=\"crisp\")\n",
    "bake = gather(T=T, F=F, crisp=crisp)\n",
    "bake"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "square = visualize_doe(\n",
    "    plot_type=\"square_plot\",\n",
    "    design_data=bake.to_dict(orient=\"records\"),\n",
    "    response_column=\"crisp\",\n",
    "    factors_to_plot=[\"T\", \"F\"],\n",
    "    factor_labels={\"T\": \"Time [min]\", \"F\": \"Fat [g]\"},\n",
    "    backend=\"plotly\",\n",
    ")\n",
    "fig = go.Figure(square[\"plotly\"])\n",
    "fig.update_layout(width=520, height=440, title=\"Baked-food 2x2: crispiness\")\n",
    "fig"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "model = lm(\"crisp ~ T + F + T:F\", bake)\n",
    "params = model.get_parameters(drop_intercept=False)\n",
    "print(params.to_string())\n",
    "print()\n",
    "print(\"Verification - the equation is:\")\n",
    "print(f\"  crisp = {params['Intercept']:.0f} \"\n",
    "      f\"+ {params['T']:+.0f} * x_T \"\n",
    "      f\"+ {params['F']:+.0f} * x_F \"\n",
    "      f\"+ {params['T:F']:+.0f} * x_T * x_F\")"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Interaction plot: lines with very different slopes signal interaction.\n",
    "# visualize_doe(plot_type=\"interaction\") draws both orientations from\n",
    "# the same call (Plotly subplots) so you can read it from either factor's\n",
    "# point of view.\n",
    "\n",
    "interaction = visualize_doe(\n",
    "    plot_type=\"interaction\",\n",
    "    design_data=bake.to_dict(orient=\"records\"),\n",
    "    response_column=\"crisp\",\n",
    "    factors_to_plot=[\"T\", \"F\"],\n",
    "    factor_labels={\"T\": \"Time [min]\", \"F\": \"Fat [g]\"},\n",
    "    backend=\"plotly\",\n",
    ")\n",
    "fig = go.Figure(interaction[\"plotly\"])\n",
    "fig.update_layout(width=720, height=380, title=\"Interaction T x F (both orientations)\")\n",
    "fig"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Predictions at the two extrapolation points the worksheet asks for.\n",
    "# In coded units: T-center = 90 min, half-range = 10 min;\n",
    "#                 F-center = 25 g,  half-range = 5 g.\n",
    "\n",
    "points = [\n",
    "    (\"110 min, 20 g fat\", dict(T=2, F=-1)),\n",
    "    (\"110 min, 15 g fat\", dict(T=2, F=-2)),\n",
    "]\n",
    "for label, coded in points:\n",
    "    y_hat = float(predict(model, **coded).iloc[0])\n",
    "    print(f\"{label:25s} -> crispness {y_hat:.1f}\")"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. admonition:: Guidance\n",
    "\n",
    "   An *interaction plot* in two orientations is the most reliable way\n",
    "   to read a negative interaction.  Plot mean crispiness against ``T``\n",
    "   for the two levels of ``F``: at low fat the line is steep and rises\n",
    "   fast; at high fat the line is much flatter.  Same plot with the\n",
    "   axes swapped: at short time the line rises with fat; at long time\n",
    "   it almost flattens.  Two lines with very different slopes - or\n",
    "   crossing - mean an interaction is present."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. admonition:: Solution\n",
    "\n",
    "   **Prediction equation.**  ``crisp = 49 + 6 * x_T + 2 * x_F - 4 * x_T * x_F``,\n",
    "   which is exactly what the code recovers.  The negative ``T:F``\n",
    "   coefficient is the interaction: fat helps at short bake times,\n",
    "   but penalises crispiness at long bake times.\n",
    "\n",
    "   **Advice to the manager.**  The factor with the biggest pull is\n",
    "   time; the interaction means more fat *hurts* once you push time to\n",
    "   the high level.  The maximum-crispiness corner is\n",
    "   ``T = +1, F = -1`` (100 min, 20 g) at 57 - confirmed by the data.\n",
    "\n",
    "   **Physical sense.**  At long bake times, extra fat seals moisture\n",
    "   in and softens the crust; at short bake times the crust has not\n",
    "   set yet, so additional fat makes the result richer and crisper.\n",
    "   The interaction\n",
    "   is real, and matches the physical reasoning above.\n",
    "\n",
    "   **Two predictions.**\n",
    "\n",
    "   - 110 min, 20 g fat: ``x_T = +2, x_F = -1`` gives\n",
    "     ``49 + 12 - 2 + 8 = 67``.\n",
    "   - 110 min, 15 g fat: ``x_T = +2, x_F = -2`` gives\n",
    "     ``49 + 12 - 4 + 16 = 73``.\n",
    "\n",
    "   Both are well above any of the four corners, but both are\n",
    "   extrapolations - the model assumes the linear trend keeps going,\n",
    "   which a real oven will not."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Question 14 - the code from the source worksheet\n",
    "\n",
    "The week-2 worksheet links to an R snippet at\n",
    "[yint.org/2w1](https://yint.org/2w1) that fits the baked-food\n",
    "interaction model.  The same job in `process_improve` is the\n",
    "block above:\n",
    "\n",
    "```python\n",
    "from process_improve.experiments import c, gather, lm\n",
    "\n",
    "T = c(-1, +1, -1, +1)\n",
    "F = c(-1, -1, +1, +1)\n",
    "y = c(37, 57, 49, 53)\n",
    "bake = gather(T=T, F=F, y=y)\n",
    "model = lm(\"y ~ T + F + T:F\", bake)\n",
    "print(model.get_parameters(drop_intercept=False))\n",
    "```\n",
    "\n",
    "No R required.  The library exposes `lm()`\n",
    "with the same formula syntax (patsy under the hood) and a\n",
    "`Model.get_parameters()`\n",
    "helper that returns the coefficients as a pandas Series."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Wrap-up\n",
    "\n",
    "You have now moved from \"four numbers and a manual calculation\" to a\n",
    "**coded linear model** that predicts anywhere in the design space, and\n",
    "beyond it with appropriate caution.  Two notable habits:\n",
    "\n",
    "- Coding factors to ``[-1, +1]`` makes coefficients directly comparable\n",
    "  and turns the prediction equation into something you can read off in\n",
    "  your head.\n",
    "- An interaction term costs nothing extra to fit in a 2x2 and tells\n",
    "  you when \"more is better\" stops being true.\n",
    "\n",
    "**Next:** Module 3 scales the same toolkit to **three factors** (the\n",
    "cube plot is back) and replicates the worksheet from week 3, where a\n",
    "fifth experiment at the center is added to gain a degree of freedom\n",
    "and start estimating curvature."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}