A bigger factorial: four factors and model reduction#

Source worksheet: yint.org/w5 - week 5 of the applied DoE short course.

Module 3 finished with a 2^3 cube. This module steps up to four factors and 16 runs. The mathematics is unchanged - it is still y = b_0 + sum(b_i * x_i) + sum(b_ij * x_i * x_j) + ... - but two new habits matter:

  1. Count the terms before you fit. With k factors you have 2^k runs and 2^k terms in the full polynomial: 1 intercept, k main effects, C(k,2) two-factor interactions, and so on. This sets the R^2 budget: a full model on a saturated design must hit R^2 = 1 exactly because there is one coefficient per observation.

  2. Reduce the model. Real responses depend on a handful of factors and interactions; the rest of the polynomial estimates noise. Reading the Pareto of effect magnitudes and dropping the small ones recovers degrees of freedom and produces a model you can trust.

Tip

The general approach in this module is to fit the full model first, then reduce it. A 16-run full factorial fits 16 coefficients exactly, so the saturated model leaves no residual degrees of freedom for testing significance. The Pareto plot is used to identify the smallest terms, which are dropped to recover those degrees of freedom.

Q1-Q2 - counting interactions#

A 2^4 design has:

  • 4 main effects,

  • C(4, 2) = 6 two-factor interactions,

  • C(4, 3) = 4 three-factor interactions,

  • C(4, 4) = 1 four-factor interaction.

Plus one intercept, that is 16 coefficients in the full model. With 16 runs and 16 coefficients there are zero residual degrees of freedom; R^2 is exactly 1 and the residual standard error is exactly 0. This is not a sign of a great model - it is a sign that the model has no room left to disagree with the data.

[1]:
from math import comb

k = 4
print(f"Main effects     : {k}")
print(f"2-factor interactions: {comb(k, 2)}")
print(f"3-factor interactions: {comb(k, 3)}")
print(f"4-factor interactions: {comb(k, 4)}")
print(f"Total terms (incl intercept): {sum(comb(k, j) for j in range(k + 1))}")
print(f"Runs available  : {2 ** k}")
print(f"Residual DoF    : {2 ** k - sum(comb(k, j) for j in range(k + 1))}")
Main effects     : 4
2-factor interactions: 6
3-factor interactions: 4
4-factor interactions: 1
Total terms (incl intercept): 16
Runs available  : 16
Residual DoF    : 0

Solution

R^2 = 1.0 exactly; residual standard error is exactly 0.0. You cannot estimate uncertainty from a saturated design. Either replicate runs (Module 3) or prune terms (this module).

Q3 - the bioreactor system#

Four factors are varied in 16 random-order runs:

  • A = feed rate [5 or 8 g/min]

  • B = initial inoculant [300 or 400 g]

  • C = feed substrate concentration [40 or 60 g/L]

  • D = dissolved oxygen set-point [4 or 5 mg/L]

In standard order (A varies fastest, then B, then C, then D), the yields (in g/L) are:

y = [60, 59, 63, 61, 69, 61, 94, 93,
     56, 63, 70, 65, 44, 45, 78, 77]
[2]:
import plotly.graph_objects as go
import plotly.io as pio

from process_improve.experiments import c, gather, lm, predict
from process_improve.experiments.visualization import visualize_doe

pio.renderers.default = "notebook_connected"

# Standard order: A flips every row, B every 2, C every 4, D every 8.

A = c(-1, +1, -1, +1, -1, +1, -1, +1, -1, +1, -1, +1, -1, +1, -1, +1, name="A", lo=5,   hi=8)
B = c(-1, -1, +1, +1, -1, -1, +1, +1, -1, -1, +1, +1, -1, -1, +1, +1, name="B", lo=300, hi=400)
C = c(-1, -1, -1, -1, +1, +1, +1, +1, -1, -1, -1, -1, +1, +1, +1, +1, name="C", lo=40,  hi=60)
D = c(-1, -1, -1, -1, -1, -1, -1, -1, +1, +1, +1, +1, +1, +1, +1, +1, name="D", lo=4,   hi=5)
y = c(60, 59, 63, 61, 69, 61, 94, 93,
      56, 63, 70, 65, 44, 45, 78, 77, name="y")
bio = gather(A=A, B=B, C=C, D=D, y=y)
print(f"Design has {len(bio)} runs and {bio.shape[1] - 1} factors")
Design has 16 runs and 4 factors
[3]:
# Fit the full 4-way model.  Every interaction term is included; the
# design is saturated so R^2 = 1 by construction.

full = lm("y ~ A * B * C * D", bio)
params = full.get_parameters(drop_intercept=False)

# Sort by absolute magnitude to spot the dominant terms.

ordered = params.reindex(params.abs().sort_values(ascending=False).index)
print(ordered.to_string())
Intercept    66.125
B             9.000
B:C           6.375
C:D          -5.250
C             4.000
D            -3.875
A:B:D        -1.250
B:D           1.250
A:B:C         1.125
A:D           0.875
A            -0.625
A:C          -0.500
A:B          -0.500
A:C:D         0.250
A:B:C:D       0.125
B:C:D        -0.125
[4]:
# Pareto-style bar chart of the effects (= 2 * coefficient).

effects = {
    term: 2 * coef
    for term, coef in full.get_parameters(drop_intercept=True).items()
}
pareto = visualize_doe(
    plot_type="pareto",
    analysis_results={"effects": effects},
    backend="plotly",
)
fig = go.Figure(pareto["plotly"])
fig.update_layout(width=720, height=460, title="Pareto plot of effects on bioreactor yield")
fig

Solution

In magnitude order the regression coefficients are roughly:

B (+9.0)  B:C (+6.4)  C:D (-5.25)  C (+4.0)  D (-3.9)
B:D (+1.25)  A:B:D (-1.25)  A:B:C (+1.1)   ... all the rest <= 1

These numbers are half the bar heights in the Pareto plot above: the Pareto plot draws effects, and an effect is defined as effect = 2 * coefficient (the response change when a factor moves from its -1 level all the way to its +1 level, a span of two coded units). Either view tells the same story; the factor of two is just a convention.

The system is dominated by B, C, and D; the two-factor interactions B:C and C:D are real and roughly the same size as the main effects. A (feed rate) and every interaction involving A are tiny - feed rate is essentially insensitive in this range.

Q6 - reducing the model#

Dropping A buys back four degrees of freedom (the main effect of A plus the three-, two-, and one-way interactions that involve A). The orthogonal design guarantees the coefficients on the remaining terms are unchanged - only the standard errors and the residual degrees of freedom move.

[5]:
reduced = lm("y ~ B * C * D", bio)
print("Reduced model (A and all its interactions dropped):")
print(reduced.get_parameters(drop_intercept=False).to_string())
print()
print(f"Reduced R^2: {reduced._OLS.rsquared:.4f}")
print(f"Reduced residual std error: {(reduced._OLS.scale ** 0.5):.2f} g/L")
Reduced model (A and all its interactions dropped):
Intercept    66.125
B             9.000
C             4.000
B:C           6.375
D            -3.875
B:D           1.250
C:D          -5.250
B:C:D        -0.125

Reduced R^2: 0.9755
Reduced residual std error: 3.02 g/L

Solution

The coefficients for B, C, D, B:C, B:D, C:D, B:C:D are exactly the same as in the full model. Only the standard errors change. With 8 residual degrees of freedom you can now talk about confidence intervals; the residual std error of about 1.4 g/L is the implicit noise estimate.

Guidance

Dropping a factor does not mean “feed rate does not matter.” It means “feed rate does not matter over the range tested (5 to 8 g/min)”. Stretch the range and the conclusion may change. Set A to whichever level is cheapest, safest, or most operationally convenient.

Q7-Q8 - where to go next#

To maximize yield:

  • B is positive -> set B high (+1).

  • C is positive and B:C is positive -> set C high (+1) too; the interaction amplifies B.

  • D is negative and C:D is negative -> set D low (-1); high D actively hurts when C is high.

That points to the corner B = +1, C = +1, D = -1 (with A wherever is cheapest). Several extrapolation candidates are explored below.

[6]:
candidates = [
    ("B=+1, C=+1, D=-1 (corner)",       dict(A=0, B=+1,   C=+1,   D=-1)),
    ("B=+1.5, C=+1.5, D=-1",             dict(A=0, B=+1.5, C=+1.5, D=-1)),
    ("B=+2,   C=+1,   D=-1.5",           dict(A=0, B=+2,   C=+1,   D=-1.5)),
    ("B=+2,   C=+2,   D=-1",             dict(A=0, B=+2,   C=+2,   D=-1)),
]
for label, point in candidates:
    pred = float(predict(reduced, **point).iloc[0])
    print(f"{label:40s} -> predicted yield {pred:5.1f} g/L")
B=+1, C=+1, D=-1 (corner)                -> predicted yield  93.5 g/L
B=+1.5, C=+1.5, D=-1                     -> predicted yield 110.1 g/L
B=+2,   C=+1,   D=-1.5                   -> predicted yield 111.2 g/L
B=+2,   C=+2,   D=-1                     -> predicted yield 130.0 g/L

Solution

  • Corner ``B = +1, C = +1, D = -1`` is predicted at about 93.5 g/L, matching the observed run 7 (yield 94).

  • ``B = +1.5, C = +1.5, D = -1`` pushes the prediction past 100 g/L - the model thinks more is better, but you should confirm with one experiment before betting on it.

  • ``B = +2, C = +1, D = -1.5`` explores trading harder on B and on suppressing D; same neighbourhood.

The strategy is steepest ascent: each candidate moves further along the gradient direction. Run one, compare predicted to observed, and use the residual to decide whether to keep going or to call this the operational sweet spot.

Wrap-up#

Two transferable lessons that scale to any number of factors:

  • A saturated full factorial spends every observation on a coefficient. The model fits perfectly and tells you nothing about uncertainty. Replicates (Module 3) or model reduction (this module) give you that uncertainty back.

  • Orthogonality is what makes “just drop the small terms” safe. In a designed full factorial the columns of the model matrix are uncorrelated, so removing a factor does not bias the remaining coefficients.

Next: Module 5 attacks the same problem from a different direction. What if you cannot afford 16 runs to begin with? Fractional factorials trade resolution for budget: half a factorial costs half as much, and (usually) costs only the highest-order interaction you did not believe in anyway.