Doing less: fractional factorials#
Source worksheets: yint.org/w6 and yint.org/w7 - weeks 6 and 7 of the applied DoE short course.
A 2^5 factorial costs 32 runs. A 2^6 costs 64. Real budgets rarely stretch that far - and most of the high-order interactions you would spend the runs on are noise anyway. Fractional factorials let you buy back the budget by trading resolution: you replace a high-order interaction you do not believe in (say, the four-factor ABCD interaction) with a new factor (say, E). Half the runs, almost all the answers.
The cost is aliasing - some effects can no longer be separated from each other. This module shows the trade in action with two worked examples, then introduces the vocabulary that DesignExpert, Minitab, and the DoE literature all use: generators, defining relation, words, resolution.
Tip
The central trade-off is “I will not estimate ABCD separately
from any single factor”; you gain a 32 -> 16 (or 16 -> 8) drop
in runs. In practice the high-order interaction was always
going to be tiny, so the trade is almost free.
Q1 - counting runs in full and half fractions#
A full factorial in five factors costs \(2^5 = 32\) runs and fits 32 coefficients: 1 intercept, 5 main effects, 10 two-factor interactions, 10 three-factor interactions, 5 four-factor interactions, and 1 five-factor interaction. A half fraction uses one generator (\(p = 1\)), so the design has \(2^{5-1} = 16\) runs.
[1]:
from math import comb
k = 5
print(f"Full factorial 2^{k} = {2 ** k} runs and {2 ** k} coefficients")
for j in range(k + 1):
label = f"{j}-factor interactions" if j else "intercept"
print(f" {label}: {comb(k, j)}")
print()
print(f"Half-fraction 2^({k}-1) = {2 ** (k - 1)} runs")
print(f"Quarter-fraction 2^({k}-2) = {2 ** (k - 2)} runs")
Full factorial 2^5 = 32 runs and 32 coefficients
intercept: 1
1-factor interactions: 5
2-factor interactions: 10
3-factor interactions: 10
4-factor interactions: 5
5-factor interactions: 1
Half-fraction 2^(5-1) = 16 runs
Quarter-fraction 2^(5-2) = 8 runs
Q2-Q4 - the stability system, both halves#
Going back to the three-factor stability study from Module 3. The full \(2^3\) design has 8 runs; a half-fraction has 4 runs. With only 4 runs and 3 main effects to estimate, the design is saturated and we get the alias relationship \(C = \pm A \cdot B\).
The two halves are obtained by choosing the rows where \(C = +A \cdot B\) and where \(C = -A \cdot B\). They come from the same 8-run table:
Full table for the \(2^3\) design, columns \((A, B, C, y)\):
(-,-,-,40) (+,-,-,27) (-,+,-,35) (+,+,-,21)
(-,-,+,41) (+,-,+,27) (-,+,+,31) (+,+,+,20)
Half-fraction with \(C = +A \cdot B\) (rows where \(C = A \cdot B\)):
(-,-,+,41) (+,-,-,27) (-,+,-,35) (+,+,+,20)
Half-fraction with \(C = -A \cdot B\) (rows where \(C = -A \cdot B\)):
(-,-,-,40) (+,-,+,27) (-,+,+,31) (+,+,-,21)
[2]:
from process_improve.experiments import c, gather, lm
# Half-fraction with C = +A*B
A1 = c(-1, +1, -1, +1, name="A")
B1 = c(-1, -1, +1, +1, name="B")
C1 = c(+1, -1, -1, +1, name="C")
y1 = c(41, 27, 35, 20, name="y")
half_pos = gather(A=A1, B=B1, C=C1, y=y1)
m_pos = lm("y ~ A + B + C", half_pos)
print("Half-fraction with C = +A*B:")
print(m_pos.get_parameters(drop_intercept=False).to_string())
Half-fraction with C = +A*B:
Intercept 30.75
A -7.25
B -3.25
C -0.25
[3]:
# Half-fraction with C = -A*B
A2 = c(-1, +1, -1, +1, name="A")
B2 = c(-1, -1, +1, +1, name="B")
C2 = c(-1, +1, +1, -1, name="C")
y2 = c(40, 27, 31, 21, name="y")
half_neg = gather(A=A2, B=B2, C=C2, y=y2)
m_neg = lm("y ~ A + B + C", half_neg)
print("Half-fraction with C = -A*B:")
print(m_neg.get_parameters(drop_intercept=False).to_string())
Half-fraction with C = -A*B:
Intercept 29.75
A -5.75
B -3.75
C -0.75
[4]:
# Full 2^3 model (Module 3) gave these coefficients:
# Intercept = 30.25, A = -6.5, B = -3.5, C = -0.5
# The two half-fractions average back to the full coefficients.
import pandas as pd
full = pd.Series({"Intercept": 30.25, "A": -6.5, "B": -3.5, "C": -0.5})
pos = m_pos.get_parameters(drop_intercept=False)
neg = m_neg.get_parameters(drop_intercept=False)
avg = (pos + neg) / 2
out = pd.DataFrame({"Half C=+AB": pos, "Half C=-AB": neg, "Average of halves": avg, "Full 2^3": full})
print(out.to_string())
Half C=+AB Half C=-AB Average of halves Full 2^3
Intercept 30.75 29.75 30.25 30.25
A -7.25 -5.75 -6.50 -6.50
B -3.25 -3.75 -3.50 -3.50
C -0.25 -0.75 -0.50 -0.50
Solution
What you read off the halves:
Half
C = +A*B:A = -7.25,B = -3.25,C = -0.25.Half
C = -A*B:A = -5.75,B = -3.75,C = -0.75.
The averages, A = -6.5, B = -3.5, C = -0.5, match the
full 2^3 model exactly. This is the beautiful property of
complementary half-fractions: each is biased by the
confounded interaction, but the bias has opposite sign so averaging
cancels it.
The aliasing pattern. In the C=AB half:
b_A_hat = b_A + b_BC, b_B_hat = b_B + b_AC,
b_C_hat = b_C + b_AB. Plugging in the full-model values
b_BC = -0.75, b_AC = +0.25, b_AB = +0.25 reproduces the
half’s coefficients to the dollar. In the C=-AB half the signs flip.
Q5-Q8 - half-fraction of the bioreactor (D = ABC)#
Back to the 16-run bioreactor from Module 4, but now imagine the budget only stretched to 8 runs. Using the generator D = A*B*C keeps the design balanced and gives a resolution-IV fraction: main effects are clear of two-factor interactions, but two-factor interactions are aliased with each other.
The defining relation is I = ABCD, so the alias pairs are AB = CD, AC = BD, AD = BC, and every main effect is aliased with the three-factor interaction obtained by multiplying through by ABCD.
[5]:
# Pick the 8 rows from the standard-order 16 where D == A*B*C.
A = c(-1, +1, +1, -1, +1, -1, -1, +1, name="A")
B = c(-1, +1, -1, +1, -1, +1, -1, +1, name="B")
C = c(-1, -1, +1, +1, -1, -1, +1, +1, name="C")
D = c(-1, -1, -1, -1, +1, +1, +1, +1, name="D")
y = c(60, 61, 61, 94, 63, 70, 44, 77, name="y")
half_bio = gather(A=A, B=B, C=C, D=D, y=y)
half_bio
[5]:
| A | B | C | D | y | |
|---|---|---|---|---|---|
| 1 | -1.0 | -1.0 | -1.0 | -1.0 | 60.0 |
| 2 | 1.0 | 1.0 | -1.0 | -1.0 | 61.0 |
| 3 | 1.0 | -1.0 | 1.0 | -1.0 | 61.0 |
| 4 | -1.0 | 1.0 | 1.0 | -1.0 | 94.0 |
| 5 | 1.0 | -1.0 | -1.0 | 1.0 | 63.0 |
| 6 | -1.0 | 1.0 | -1.0 | 1.0 | 70.0 |
| 7 | -1.0 | -1.0 | 1.0 | 1.0 | 44.0 |
| 8 | 1.0 | 1.0 | 1.0 | 1.0 | 77.0 |
[6]:
# Fit main effects and the three independent two-factor groups.
# (AB and CD share a column; the design only resolves the sum of their
# coefficients. Same for AC = BD and AD = BC.)
m_half = lm("y ~ A + B + C + D + A:B + A:C + A:D", half_bio)
print(m_half.get_parameters(drop_intercept=False).to_string())
Intercept 66.25
A -0.75
B 9.25
C 2.75
D -2.75
A:B -5.75
A:C 0.75
A:D 7.25
Solution
The half-fraction’s coefficients line up with the full-model coefficients from Module 4 plus their aliased partners:
half b_B = +9.25 ~= full b_B (+9.0) + b_ACD (~0.25)
half b_C = +2.75 ~= full b_C (+4.0) - b_ABD (~1.25)
half b_D = -2.75 ~= full b_D (-3.875) + b_ABC (~1.125)
half b_AB = -5.75 ~= full b_AB (-0.5) + b_CD (-5.25)
half b_AC = +0.75 ~= full b_AC (-0.5) + b_BD (+1.25)
half b_AD = +7.25 ~= full b_AD (+0.875) + b_BC (+6.375)
Same qualitative conclusions as the full design: B dominates, D hurts, and one of the (AB=CD), (AD=BC) interactions is large. The half-fraction cannot tell you which member of the pair is responsible - you would resolve that with a fold-over: run the other half and combine the data.
Guidance
Half-fractions are the cheapest way to screen many factors: spend the first 8 runs on a half, see which main effects and two-factor groups light up, then decide whether to spend the second 8 to resolve the aliases or to move on to a new study focused on the surviving factors.
Vocabulary you will meet (w7)#
These terms appear in Minitab, DesignExpert, JMP, and the DoE literature. None of them are unique to process_improve, but you will hit them every time you read a screening study.
Term |
Plain English |
|---|---|
Factor |
Something we deliberately change. Measured and controlled. |
Disturbance |
A real-world influence we cannot control and usually cannot measure. |
Covariate |
A real-world influence we can measure but not control. Worth recording so we can model it. |
Nuisance factor |
A controlled factor we do not care about scientifically (operator, batch, day). Handle with blocking. |
Generator |
An equation like |
Defining relation |
The product of all generators with the identity, e.g. |
Word |
Each |
Resolution |
Length of the shortest word in the defining relation. Res IV = main effects clear of two-factor interactions; Res V = also clear of two-factor x two-factor confounding. |
Check yourself
Q7.2 - A variable that cannot be measured or controlled is a disturbance.
Q7.3 - A variable measured but not controlled is a covariate.
Q7.4 - Yes, you can have something controlled but not measured - a held-constant condition. In practice you measure it anyway, because constants drift.
Q7.5 - Refusing to randomize a hard-to-change factor means you are confounding it with time and any drift in the equipment. If the experimenter is the same person and the equipment is the same, you are still confounding with operator fatigue, batch of reagent, ambient temperature, and the order itself.
A worked example - the CalApp screening study#
The source worksheet (w7 Q6) ends with a small case study, CalApp, to make the vocabulary concrete. A team is screening drivers of 60-day app retention. Three of the inputs are deliberately manipulated and become the factors of the design:
A = promotional offer (yes / no),
B = marketing message (variant 1 / variant 2),
C = in-app purchase price (low / high).
Six other variables describe each user or device but are not set by the experimenter:
E = the user’s age,
N = the user’s gender,
S = the user’s connection type (cellular or wifi),
R = the device’s free memory (RAM),
F = which advertising network served the install (G or H),
D = whether the device is Apple or Android.
For each of those six, decide: is it a factor, a covariate, a disturbance, or a nuisance variable? The solution below walks through them.
Solution
For the CalApp screening example (Q7.6):
E (user age): covariate - measured, not controlled.
N (gender): covariate - measured, not controlled.
S (cell vs wifi): covariate that could also be a nuisance factor if it correlates with engagement.
R (free RAM): covariate - measured, not controlled.
F (ad network G vs H): could be a factor (you choose it) or a nuisance factor depending on whether it is part of the study’s question.
D (Apple vs Android): could be a factor, a nuisance factor (block on it), or a covariate depending on the hypothesis.
Wrap-up#
Two transferable habits:
Fractional design first, full design only if needed. A half- fraction usually answers 80% of the question for 50% of the budget.
Read the defining relation before fitting. Knowing the alias pattern upfront tells you which conclusions are robust and which need a fold-over to resolve.
Next: Module 6 returns to the trade-off table from w8 and starts the move into optimization with a 1-D response surface study, introducing path-of-steepest-ascent thinking that Module 7 then generalizes to two dimensions.