{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Coding, scaling, and the linear model\n", "\n", "**Source worksheet:** [yint.org/w2](https://yint.org/w2) - week 2 of the applied DoE short course.\n", "\n", "Module 1 read effects straight off corner averages of a 2x2. That was\n", "appropriate for two factors and four runs, but the same approach\n", "becomes cumbersome\n", "once you scale beyond it. Module 2 introduces the *coded linear model*\n", "- the workhorse that scales to any number of factors, copes with\n", "interactions, and gives you predictions outside the corners you ran.\n", "\n", "The math is unchanged from Module 1; what we add is **coded units**\n", "(-1 and +1 instead of grams and minutes), a **linear regression** that\n", "recovers the same effects you computed by hand, and a sober look at\n", "**extrapolation** - what the model promises, and what it cannot promise.\n", "\n", "Everything below uses Python with `process_improve`; the original\n", "worksheet's R snippet at ``yint.org/2w1`` is translated to its Python\n", "equivalent." ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. tip::\n", "\n", " The first habit to build: in design space, every factor lives on the\n", " same scale, with ``-1`` at its low extreme and ``+1`` at its high\n", " extreme. A 30 mL setting becomes ``+1``; 200 grams becomes ``-1``.\n", " This is *coding*. It makes coefficients directly comparable and\n", " makes predictions outside the corner runs straightforward to write\n", " down. In ``process_improve`` you state the real-world range with\n", " ``c(..., lo=..., hi=...)`` and the package handles the coding." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Questions 1 and 2 - how many runs does a full factorial need?\n", "\n", "A full factorial runs every combination of every factor level. For *k*\n", "factors at *L* levels each, that is **L**^**k** experiments.\n", "\n", "1. Four factors, each at two levels: how many runs?\n", "2. Four factors, each at three levels: how many runs?" ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "def full_factorial_size(num_factors: int, levels: int) -> int:\n", " return levels ** num_factors\n", "\n", "\n", "two_level = full_factorial_size(num_factors=4, levels=2)\n", "three_level = full_factorial_size(num_factors=4, levels=3)\n", "print(f\"Full factorial, 4 factors x 2 levels: {two_level} runs\")\n", "print(f\"Full factorial, 4 factors x 3 levels: {three_level} runs\")" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. admonition:: Solution\n", "\n", " 1. ``2**4 = 16`` runs.\n", " 2. ``3**4 = 81`` runs.\n", "\n", " The rapid growth in run count motivates *fractional* factorials\n", " (Module 5):\n", " even modest screens become unaffordable at three or four levels.\n", "\n", ".. admonition:: Guidance\n", "\n", " Two levels per factor is the default starting point because four\n", " factors fit in 16 runs, eight factors fit in 256 - but two levels\n", " already test the linear effect of every factor. A third level\n", " (often a center point at ``0``) is added when you suspect *curvature*;\n", " we revisit this in Module 4." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Questions 3 to 9 - greenhouse plant heights\n", "\n", "A greenhouse experiment varies two factors to maximize plant height:\n", "\n", "- **A = soil amount** [200 g or 400 g]\n", "- **B = water amount** [50 mL or 100 mL]\n", "\n", "The runs come out of the lab notebook *in experiment order*, not in\n", "the textbook's standard order:\n", "\n", "| Run | A (g) | B (mL) | Height (mm) |\n", "|----:|------:|-------:|------------:|\n", "| 1 | 200 | 100 | 61 |\n", "| 2 | 200 | 50 | 39 |\n", "| 3 | 400 | 100 | 82 |\n", "| 4 | 400 | 50 | 50 |\n", "\n", "The worksheet asks for the rewrite in standard order, the main effects\n", "of A and B, the prediction equation, and three predicted heights at\n", "new conditions." ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. admonition:: Check yourself\n", "\n", " Before reading on, write the four rows in *standard order*\n", " (``A = -, -, +, +`` and ``B = -, +, -, +``) on paper and copy the\n", " four heights across. Standard order does not change the data, only\n", " the row order; it is the layout the model fitter expects and the\n", " easiest layout to spot patterns in." ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "import plotly.graph_objects as go\n", "import plotly.io as pio\n", "\n", "from process_improve.experiments import c, gather, lm, main_effects_plot, predict\n", "from process_improve.experiments.visualization import visualize_doe\n", "\n", "pio.renderers.default = \"notebook_connected\"\n", "\n", "# Build the design directly in standard order: A varies fastest, then B.\n", "\n", "A = c(-1, +1, -1, +1, lo=200, hi=400, name=\"A\", units=\"grams\")\n", "B = c(-1, -1, +1, +1, lo=50, hi=100, name=\"B\", units=\"mL\")\n", "height = c(39, 50, 61, 82, name=\"height\")\n", "plant = gather(A=A, B=B, height=height)\n", "plant" ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "# Fit a main-effects model in coded units.\n", "\n", "model = lm(\"height ~ A + B\", plant)\n", "params = model.get_parameters(drop_intercept=False)\n", "print(params.to_string())\n", "\n", "effect_A = 2 * params[\"A\"]\n", "effect_B = 2 * params[\"B\"]\n", "print(f\"\\nMain effect of soil amount (A): {effect_A:+.1f} mm\")\n", "print(f\"Main effect of water amount (B): {effect_B:+.1f} mm\")" ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "# Verify the prediction equation by predicting at three test points.\n", "\n", "points = [\n", " (\"300 g soil, 75 mL water\", dict(A=0, B=0)),\n", " (\"200 g soil, 75 mL water\", dict(A=-1, B=0)),\n", " (\"500 g soil, 125 mL water (extrapolation!)\", dict(A=2, B=2)),\n", "]\n", "for label, coded in points:\n", " y_hat = float(predict(model, **coded).iloc[0])\n", " print(f\"{label:50s} -> {y_hat:6.1f} mm\")" ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "fig = main_effects_plot(model, factor_labels={\"A\": \"Soil [g]\", \"B\": \"Water [mL]\"})\n", "fig.update_layout(width=620, height=380, title=\"Main-effects plot - greenhouse\")\n", "fig" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. admonition:: Solution\n", "\n", " **Standard order and effects.** Sorted (``A``, ``B``) =\n", " ``(-,-)=39, (+,-)=50, (-,+)=61, (+,+)=82``.\n", "\n", " - Main effect of soil: ``(50 + 82)/2 - (39 + 61)/2 = 66 - 50 =\n", " +16 mm``.\n", " - Main effect of water: ``(61 + 82)/2 - (39 + 50)/2 = 71.5 - 44.5 =\n", " +27 mm``.\n", "\n", " **Coefficients.** The intercept is the grand mean,\n", " ``b_0 = (39 + 50 + 61 + 82)/4 = 58``. Each coefficient is half its\n", " main effect, so ``b_A = +8`` and ``b_B = +13.5``.\n", "\n", " **Prediction equation.** In coded units,\n", " ``y = 58 + 8 * x_A + 13.5 * x_B``.\n", "\n", " **Three predictions.**\n", "\n", " - 300 g soil and 75 mL water sit at the center of the design\n", " (``x_A = 0``, ``x_B = 0``); the model returns the intercept,\n", " ``y = 58 mm``.\n", " - 200 g soil and 75 mL water moves ``A`` to ``-1`` but keeps ``B`` at\n", " ``0``; predicted height ``y = 58 - 8 = 50 mm``.\n", " - 500 g soil and 125 mL water sits at ``x_A = +2``, ``x_B = +2`` -\n", " one whole step outside the design on both axes. The model\n", " extrapolates to ``y = 58 + 16 + 27 = 101 mm``; in real life\n", " plants do not grow proportionally forever." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Questions 10 to 12 - extrapolation, one-factor-at-a-time, and interactions\n", "\n", "The worksheet now steps back from the maths:\n", "\n", "10. Can you extrapolate outside the design region? How far? And what are\n", " contour lines useful for?\n", "11. A colleague wants to run their next study by changing **one factor\n", " at a time** against a baseline. Why is that a bad idea?\n", "12. Give an everyday example of an interaction where increasing factor B:\n", "\n", " - *amplifies* the effect of factor A;\n", " - *cancels* or *reverses* the effect of factor A." ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. admonition:: Solution\n", "\n", " **Q10 - Extrapolation.** A linear model lets you compute a\n", " prediction at any ``(x_A, x_B)``; it does not promise that prediction\n", " is right. Inside the four corners you ran, the model interpolates\n", " between observed responses and is usually trustworthy. Just\n", " outside, predictions are still useful as a *direction*: they tell\n", " you which way to push for the next experiment. Far outside, expect\n", " the world to push back - biological limits, melting points, saturated\n", " reactions.\n", "\n", " Contour plots are the most useful summary view: they show you\n", " how to *trade off* one factor against another while holding the\n", " response constant. A flat contour is a robust operating window;\n", " crowded contours are where the process is sensitive.\n", "\n", " **Q11 - One factor at a time (OFAT).** Three failures of OFAT\n", " compared with a factorial:\n", "\n", " 1. It misses interactions completely. With OFAT you tweak ``A``\n", " while holding ``B`` at its baseline, then tweak ``B`` while\n", " holding ``A`` at its baseline; you never observe the\n", " ``A = +, B = +`` corner where the interaction lives.\n", " 2. It uses runs less efficiently. A 2x2 spends every run estimating\n", " *both* main effects; OFAT spends each run estimating just one.\n", " 3. The \"winning\" combination depends on which factor you chose to\n", " vary first. That conclusion is fragile, and it is hard to\n", " defend when a colleague asks \"what if we had started with B?\".\n", "\n", " **Q12 - Everyday interactions.**\n", "\n", " - *Amplification.* Coffee and sleep deprivation: the same dose of\n", " caffeine wakes you up much more after a short night than after a\n", " long one. A and B amplify each other.\n", " - *Cancellation.* Salt and soup volume: a teaspoon of salt is\n", " perfect in one mug of soup and unnoticeable in a stock-pot. The\n", " effect of \"adding salt\" depends on (and is cancelled by) the\n", " dilution from extra water." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 13 - a baked food with a negative interaction\n", "\n", "A product is baked for time *T* using a certain amount of fat *F*. The\n", "outcome is *crispiness*. Both factors have plausible main effects, but\n", "the data hide a negative interaction that turns up only when you fit\n", "the model.\n", "\n", "| Run | T (min) | F (g) | Crispiness |\n", "|----:|--------:|------:|-----------:|\n", "| 1 | 80 | 20 | 37 |\n", "| 2 | 100 | 20 | 57 |\n", "| 3 | 80 | 30 | 49 |\n", "| 4 | 100 | 30 | 53 |\n", "\n", "The worksheet asks for the cube plot with contours, an interaction plot\n", "in *both* orientations, a verification of the prediction equation, and\n", "two extrapolated crispness predictions." ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "T = c(-1, +1, -1, +1, lo=80, hi=100, name=\"T\", units=\"min\")\n", "F = c(-1, -1, +1, +1, lo=20, hi=30, name=\"F\", units=\"g\")\n", "crisp = c(37, 57, 49, 53, name=\"crisp\")\n", "bake = gather(T=T, F=F, crisp=crisp)\n", "bake" ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "square = visualize_doe(\n", " plot_type=\"square_plot\",\n", " design_data=bake.to_dict(orient=\"records\"),\n", " response_column=\"crisp\",\n", " factors_to_plot=[\"T\", \"F\"],\n", " factor_labels={\"T\": \"Time [min]\", \"F\": \"Fat [g]\"},\n", " backend=\"plotly\",\n", ")\n", "fig = go.Figure(square[\"plotly\"])\n", "fig.update_layout(width=520, height=440, title=\"Baked-food 2x2: crispiness\")\n", "fig" ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "model = lm(\"crisp ~ T + F + T:F\", bake)\n", "params = model.get_parameters(drop_intercept=False)\n", "print(params.to_string())\n", "print()\n", "print(\"Verification - the equation is:\")\n", "print(f\" crisp = {params['Intercept']:.0f} \"\n", " f\"+ {params['T']:+.0f} * x_T \"\n", " f\"+ {params['F']:+.0f} * x_F \"\n", " f\"+ {params['T:F']:+.0f} * x_T * x_F\")" ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "# Interaction plot: lines with very different slopes signal interaction.\n", "# visualize_doe(plot_type=\"interaction\") draws both orientations from\n", "# the same call (Plotly subplots) so you can read it from either factor's\n", "# point of view.\n", "\n", "interaction = visualize_doe(\n", " plot_type=\"interaction\",\n", " design_data=bake.to_dict(orient=\"records\"),\n", " response_column=\"crisp\",\n", " factors_to_plot=[\"T\", \"F\"],\n", " factor_labels={\"T\": \"Time [min]\", \"F\": \"Fat [g]\"},\n", " backend=\"plotly\",\n", ")\n", "fig = go.Figure(interaction[\"plotly\"])\n", "fig.update_layout(width=720, height=380, title=\"Interaction T x F (both orientations)\")\n", "fig" ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "# Predictions at the two extrapolation points the worksheet asks for.\n", "# In coded units: T-center = 90 min, half-range = 10 min;\n", "# F-center = 25 g, half-range = 5 g.\n", "\n", "points = [\n", " (\"110 min, 20 g fat\", dict(T=2, F=-1)),\n", " (\"110 min, 15 g fat\", dict(T=2, F=-2)),\n", "]\n", "for label, coded in points:\n", " y_hat = float(predict(model, **coded).iloc[0])\n", " print(f\"{label:25s} -> crispness {y_hat:.1f}\")" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. admonition:: Guidance\n", "\n", " An *interaction plot* in two orientations is the most reliable way\n", " to read a negative interaction. Plot mean crispiness against ``T``\n", " for the two levels of ``F``: at low fat the line is steep and rises\n", " fast; at high fat the line is much flatter. Same plot with the\n", " axes swapped: at short time the line rises with fat; at long time\n", " it almost flattens. Two lines with very different slopes - or\n", " crossing - mean an interaction is present." ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. admonition:: Solution\n", "\n", " **Prediction equation.** ``crisp = 49 + 6 * x_T + 2 * x_F - 4 * x_T * x_F``,\n", " which is exactly what the code recovers. The negative ``T:F``\n", " coefficient is the interaction: fat helps at short bake times,\n", " but penalises crispiness at long bake times.\n", "\n", " **Advice to the manager.** The factor with the biggest pull is\n", " time; the interaction means more fat *hurts* once you push time to\n", " the high level. The maximum-crispiness corner is\n", " ``T = +1, F = -1`` (100 min, 20 g) at 57 - confirmed by the data.\n", "\n", " **Physical sense.** At long bake times, extra fat seals moisture\n", " in and softens the crust; at short bake times the crust has not\n", " set yet, so additional fat makes the result richer and crisper.\n", " The interaction\n", " is real, and matches the physical reasoning above.\n", "\n", " **Two predictions.**\n", "\n", " - 110 min, 20 g fat: ``x_T = +2, x_F = -1`` gives\n", " ``49 + 12 - 2 + 8 = 67``.\n", " - 110 min, 15 g fat: ``x_T = +2, x_F = -2`` gives\n", " ``49 + 12 - 4 + 16 = 73``.\n", "\n", " Both are well above any of the four corners, but both are\n", " extrapolations - the model assumes the linear trend keeps going,\n", " which a real oven will not." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 14 - the code from the source worksheet\n", "\n", "The week-2 worksheet links to an R snippet at\n", "[yint.org/2w1](https://yint.org/2w1) that fits the baked-food\n", "interaction model. The same job in `process_improve` is the\n", "block above:\n", "\n", "```python\n", "from process_improve.experiments import c, gather, lm\n", "\n", "T = c(-1, +1, -1, +1)\n", "F = c(-1, -1, +1, +1)\n", "y = c(37, 57, 49, 53)\n", "bake = gather(T=T, F=F, y=y)\n", "model = lm(\"y ~ T + F + T:F\", bake)\n", "print(model.get_parameters(drop_intercept=False))\n", "```\n", "\n", "No R required. The library exposes `lm()`\n", "with the same formula syntax (patsy under the hood) and a\n", "`Model.get_parameters()`\n", "helper that returns the coefficients as a pandas Series." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Wrap-up\n", "\n", "You have now moved from \"four numbers and a manual calculation\" to a\n", "**coded linear model** that predicts anywhere in the design space, and\n", "beyond it with appropriate caution. Two notable habits:\n", "\n", "- Coding factors to ``[-1, +1]`` makes coefficients directly comparable\n", " and turns the prediction equation into something you can read off in\n", " your head.\n", "- An interaction term costs nothing extra to fit in a 2x2 and tells\n", " you when \"more is better\" stops being true.\n", "\n", "**Next:** Module 3 scales the same toolkit to **three factors** (the\n", "cube plot is back) and replicates the worksheet from week 3, where a\n", "fifth experiment at the center is added to gain a degree of freedom\n", "and start estimating curvature." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 4 }