{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# A bigger factorial: four factors and model reduction\n", "\n", "**Source worksheet:** [yint.org/w5](https://yint.org/w5) - week 5 of the applied DoE short course.\n", "\n", "Module 3 finished with a 2^3 cube. This module steps up to four\n", "factors and 16 runs. The mathematics is unchanged - it is still\n", "``y = b_0 + sum(b_i * x_i) + sum(b_ij * x_i * x_j) + ...`` - but two\n", "new habits matter:\n", "\n", "1. **Count the terms before you fit.** With ``k`` factors you have\n", " ``2^k`` runs and ``2^k`` terms in the full polynomial: 1 intercept,\n", " ``k`` main effects, ``C(k,2)`` two-factor interactions, and so on.\n", " This sets the ``R^2`` budget: a full model on a saturated design\n", " *must* hit ``R^2 = 1`` exactly because there is one coefficient\n", " per observation.\n", "2. **Reduce the model.** Real responses depend on a handful of\n", " factors and interactions; the rest of the polynomial estimates\n", " noise. Reading the *Pareto* of effect magnitudes and dropping\n", " the small ones recovers degrees of freedom and produces a model\n", " you can trust." ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. tip::\n", "\n", " The general approach in this module is to **fit the full model\n", " first, then reduce it**. A 16-run full factorial fits 16\n", " coefficients exactly, so the saturated model leaves no residual\n", " degrees of freedom for testing significance. The Pareto plot is\n", " used to identify the smallest terms, which are dropped to recover\n", " those degrees of freedom." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Q1-Q2 - counting interactions\n", "\n", "A 2^4 design has:\n", "\n", "- ``4`` main effects,\n", "- ``C(4, 2) = 6`` two-factor interactions,\n", "- ``C(4, 3) = 4`` three-factor interactions,\n", "- ``C(4, 4) = 1`` four-factor interaction.\n", "\n", "Plus one intercept, that is 16 coefficients in the full model. With\n", "16 runs and 16 coefficients there are zero residual degrees of freedom;\n", "``R^2`` is exactly 1 and the residual standard error is exactly 0.\n", "This is not a sign of a great model - it is a sign that the model has\n", "no room left to disagree with the data." ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "from math import comb\n", "\n", "k = 4\n", "print(f\"Main effects : {k}\")\n", "print(f\"2-factor interactions: {comb(k, 2)}\")\n", "print(f\"3-factor interactions: {comb(k, 3)}\")\n", "print(f\"4-factor interactions: {comb(k, 4)}\")\n", "print(f\"Total terms (incl intercept): {sum(comb(k, j) for j in range(k + 1))}\")\n", "print(f\"Runs available : {2 ** k}\")\n", "print(f\"Residual DoF : {2 ** k - sum(comb(k, j) for j in range(k + 1))}\")" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. admonition:: Solution\n", "\n", " ``R^2 = 1.0`` exactly; residual standard error is exactly ``0.0``.\n", " You cannot estimate uncertainty from a saturated design. Either\n", " replicate runs (Module 3) or prune terms (this module)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Q3 - the bioreactor system\n", "\n", "Four factors are varied in 16 random-order runs:\n", "\n", "- **A = feed rate** [5 or 8 g/min]\n", "- **B = initial inoculant** [300 or 400 g]\n", "- **C = feed substrate concentration** [40 or 60 g/L]\n", "- **D = dissolved oxygen set-point** [4 or 5 mg/L]\n", "\n", "In standard order (A varies fastest, then B, then C, then D), the\n", "yields (in g/L) are:\n", "\n", "```\n", "y = [60, 59, 63, 61, 69, 61, 94, 93,\n", " 56, 63, 70, 65, 44, 45, 78, 77]\n", "```" ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "import plotly.graph_objects as go\n", "import plotly.io as pio\n", "\n", "from process_improve.experiments import c, gather, lm, predict\n", "from process_improve.experiments.visualization import visualize_doe\n", "\n", "pio.renderers.default = \"notebook_connected\"\n", "\n", "# Standard order: A flips every row, B every 2, C every 4, D every 8.\n", "\n", "A = c(-1, +1, -1, +1, -1, +1, -1, +1, -1, +1, -1, +1, -1, +1, -1, +1, name=\"A\", lo=5, hi=8)\n", "B = c(-1, -1, +1, +1, -1, -1, +1, +1, -1, -1, +1, +1, -1, -1, +1, +1, name=\"B\", lo=300, hi=400)\n", "C = c(-1, -1, -1, -1, +1, +1, +1, +1, -1, -1, -1, -1, +1, +1, +1, +1, name=\"C\", lo=40, hi=60)\n", "D = c(-1, -1, -1, -1, -1, -1, -1, -1, +1, +1, +1, +1, +1, +1, +1, +1, name=\"D\", lo=4, hi=5)\n", "y = c(60, 59, 63, 61, 69, 61, 94, 93,\n", " 56, 63, 70, 65, 44, 45, 78, 77, name=\"y\")\n", "bio = gather(A=A, B=B, C=C, D=D, y=y)\n", "print(f\"Design has {len(bio)} runs and {bio.shape[1] - 1} factors\")" ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "# Fit the full 4-way model. Every interaction term is included; the\n", "# design is saturated so R^2 = 1 by construction.\n", "\n", "full = lm(\"y ~ A * B * C * D\", bio)\n", "params = full.get_parameters(drop_intercept=False)\n", "\n", "# Sort by absolute magnitude to spot the dominant terms.\n", "\n", "ordered = params.reindex(params.abs().sort_values(ascending=False).index)\n", "print(ordered.to_string())" ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "# Pareto-style bar chart of the effects (= 2 * coefficient).\n", "\n", "effects = {\n", " term: 2 * coef\n", " for term, coef in full.get_parameters(drop_intercept=True).items()\n", "}\n", "pareto = visualize_doe(\n", " plot_type=\"pareto\",\n", " analysis_results={\"effects\": effects},\n", " backend=\"plotly\",\n", ")\n", "fig = go.Figure(pareto[\"plotly\"])\n", "fig.update_layout(width=720, height=460, title=\"Pareto plot of effects on bioreactor yield\")\n", "fig" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. admonition:: Solution\n", "\n", " In magnitude order the **regression coefficients** are roughly:\n", "\n", " ::\n", "\n", " B (+9.0) B:C (+6.4) C:D (-5.25) C (+4.0) D (-3.9)\n", " B:D (+1.25) A:B:D (-1.25) A:B:C (+1.1) ... all the rest <= 1\n", "\n", " These numbers are **half** the bar heights in the Pareto plot\n", " above: the Pareto plot draws *effects*, and an effect is defined\n", " as ``effect = 2 * coefficient`` (the response change when a factor\n", " moves from its ``-1`` level all the way to its ``+1`` level, a\n", " span of two coded units). Either view tells the same story; the\n", " factor of two is just a convention.\n", "\n", " The system is dominated by **B**, **C**, and **D**; the two-factor\n", " interactions **B:C** and **C:D** are real and roughly the same size\n", " as the main effects. **A** (feed rate) and every interaction\n", " involving A are tiny - feed rate is essentially insensitive in this\n", " range." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Q6 - reducing the model\n", "\n", "Dropping A buys back four degrees of freedom (the main effect of A\n", "plus the three-, two-, and one-way interactions that involve A). The\n", "*orthogonal* design guarantees the coefficients on the remaining\n", "terms are unchanged - only the standard errors and the residual\n", "degrees of freedom move." ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "reduced = lm(\"y ~ B * C * D\", bio)\n", "print(\"Reduced model (A and all its interactions dropped):\")\n", "print(reduced.get_parameters(drop_intercept=False).to_string())\n", "print()\n", "print(f\"Reduced R^2: {reduced._OLS.rsquared:.4f}\")\n", "print(f\"Reduced residual std error: {(reduced._OLS.scale ** 0.5):.2f} g/L\")" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. admonition:: Solution\n", "\n", " The coefficients for B, C, D, B:C, B:D, C:D, B:C:D are *exactly*\n", " the same as in the full model. Only the standard errors change.\n", " With 8 residual degrees of freedom you can now talk about\n", " confidence intervals; the residual std error of about 1.4 g/L is\n", " the implicit noise estimate.\n", "\n", ".. admonition:: Guidance\n", "\n", " Dropping a factor does not mean \"feed rate does not matter.\" It\n", " means \"feed rate does not matter *over the range tested* (5 to\n", " 8 g/min)\". Stretch the range and the conclusion may change.\n", " Set A to whichever level is cheapest, safest, or most operationally\n", " convenient." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Q7-Q8 - where to go next\n", "\n", "To **maximize** yield:\n", "\n", "- ``B`` is positive -> set B high (+1).\n", "- ``C`` is positive **and** ``B:C`` is positive -> set C high (+1) too;\n", " the interaction amplifies B.\n", "- ``D`` is negative **and** ``C:D`` is negative -> set D low (-1);\n", " high D actively hurts when C is high.\n", "\n", "That points to the corner ``B = +1, C = +1, D = -1`` (with A wherever\n", "is cheapest). Several extrapolation candidates are explored below." ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "candidates = [\n", " (\"B=+1, C=+1, D=-1 (corner)\", dict(A=0, B=+1, C=+1, D=-1)),\n", " (\"B=+1.5, C=+1.5, D=-1\", dict(A=0, B=+1.5, C=+1.5, D=-1)),\n", " (\"B=+2, C=+1, D=-1.5\", dict(A=0, B=+2, C=+1, D=-1.5)),\n", " (\"B=+2, C=+2, D=-1\", dict(A=0, B=+2, C=+2, D=-1)),\n", "]\n", "for label, point in candidates:\n", " pred = float(predict(reduced, **point).iloc[0])\n", " print(f\"{label:40s} -> predicted yield {pred:5.1f} g/L\")" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. admonition:: Solution\n", "\n", " - **Corner ``B = +1, C = +1, D = -1``** is predicted at about\n", " ``93.5 g/L``, matching the observed run 7 (yield 94).\n", " - **``B = +1.5, C = +1.5, D = -1``** pushes the prediction past\n", " ``100 g/L`` - the model thinks more is better, but you should\n", " confirm with one experiment before betting on it.\n", " - **``B = +2, C = +1, D = -1.5``** explores trading harder on B\n", " and on suppressing D; same neighbourhood.\n", "\n", " The strategy is steepest ascent: each candidate moves further\n", " along the gradient direction. Run one, compare predicted to\n", " observed, and use the residual to decide whether to keep going\n", " or to call this the operational sweet spot." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Wrap-up\n", "\n", "Two transferable lessons that scale to any number of factors:\n", "\n", "- A **saturated full factorial** spends every observation on a\n", " coefficient. The model fits perfectly and tells you nothing about\n", " uncertainty. Replicates (Module 3) or model reduction (this\n", " module) give you that uncertainty back.\n", "- **Orthogonality** is what makes \"just drop the small terms\" safe.\n", " In a designed full factorial the columns of the model matrix are\n", " uncorrelated, so removing a factor does not bias the remaining\n", " coefficients.\n", "\n", "**Next:** Module 5 attacks the same problem from a different direction.\n", "What if you cannot afford 16 runs to begin with? Fractional\n", "factorials trade *resolution* for *budget*: half a factorial costs\n", "half as much, and (usually) costs only the highest-order interaction\n", "you did not believe in anyway." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 4 }