PCA on food texture#

A small dataset where five sensory and instrumental measurements are taken on each of 50 pastry samples: oil percentage, density, crispiness, fracture force, and hardness. The variables are correlated and measured on different scales, so PCA on the autoscaled data shows the dominant patterns in two components and lets us identify samples that do not look like the rest of the batch.

Data. food-texture.csv from openmv.net. The first column is a sample identifier; the remaining five are the measurements.

Adapted from the Principal Component Analysis chapter of the Process Improvement using Data book (CC BY-SA 4.0).

[1]:

import matplotlib.pyplot as plt
import pandas as pd
import plotly.io as pio

from process_improve.multivariate.methods import PCA, MCUVScaler

pio.renderers.default = "notebook_connected"

Load the data#

[2]:

foods = pd.read_csv("https://openmv.net/file/food-texture.csv", index_col=0)
print(f"Shape: {foods.shape[0]} samples x {foods.shape[1]} variables")
foods.head()

Shape: 50 samples x 5 variables

[2]:

	Oil	Density	Crispy	Fracture	Hardness
B110	16.5	2955	10	23	97
B136	17.7	2660	14	9	139
B171	16.2	2870	12	17	143
B192	16.7	2920	10	31	95
B225	16.3	2975	11	26	143

[3]:

foods.describe().round(2)

[3]:

	Oil	Density	Crispy	Fracture	Hardness
count	50.00	50.0	50.00	50.00	50.00
mean	17.20	2857.6	11.52	20.86	128.18
std	1.59	124.5	1.78	5.47	31.13
min	13.70	2570.0	7.00	9.00	63.00
25%	16.30	2772.5	10.00	17.00	107.25
50%	16.90	2867.5	12.00	21.00	126.00
75%	18.10	2945.0	13.00	25.00	143.75
max	21.20	3125.0	15.00	33.00	192.00

The five variables span very different ranges (oil percentage is in the teens, density is in the thousands, the rest are unitless sensory scores). Without scaling, density would dominate the model purely through its variance.

Preprocess and fit#

[4]:

X = MCUVScaler().fit_transform(foods)

selection = PCA.select_n_components(X, max_components=4, cv=5)
print(f"Recommended A = {selection.n_components}")
selection.press_ratio.round(3)

Recommended A = 4

[4]:

n_components
2    0.338
3    0.649
4    0.362
Name: PRESS ratio, dtype: float64

[5]:

model = PCA(n_components=max(2, selection.n_components)).fit(X)
model.r2_cumulative_.round(3)

[5]:

1    0.606
2    0.865
3    0.927
4    0.976
Name: Cumulative R², dtype: float64

Score and loading plots#

With only five variables, the loading plot is more useful than for high-dimensional spectra: every loading carries a recognisable label, and we can read directly which sensory and instrumental measurements drive each component.

[6]:

model.score_plot()

[7]:

model.loading_plot()

SPE and Hotelling’s T squared#

[8]:

model.spe_plot()

[9]:

model.t2_plot()

Outlier explanation#

Each flagged sample can be explained back in terms of the original five variables with score_contributions.

[10]:

outliers = model.detect_outliers(conf_level=0.95)
print(f"{len(outliers)} samples flagged")
for o in outliers[:3]:
    print(f"  {o['observation']}: {o['outlier_types']} (severity={o['severity']:.2f})")

1 samples flagged
  B575: ['spe'] (severity=1.77)

[11]:

if outliers:
    worst = outliers[0]["observation"]
    contrib = model.score_contributions(model.scores_.loc[worst].values)

    fig, ax = plt.subplots(figsize=(6, 3))
    contrib.plot.bar(ax=ax, color="steelblue")
    ax.axhline(0, color="k", linewidth=0.5)
    ax.set_ylabel("Contribution")
    ax.set_title(f"Score contributions for sample {worst}")
    plt.tight_layout()
    plt.show()

../../../_images/user_guide_case_studies_latent-variable-modelling_pca-food-texture_16_0.png

What to try next#

The food-texture dataset is small enough that you can recompute everything by hand and check intuition. Try changing the number of components and watch what happens to the score plot, the SPE limit, and the list of flagged samples.
Compare the loading plot against domain knowledge: do the variables that cluster together in the loading plot also correlate strongly in foods.corr()?
Once a property like consumer acceptance is added, switch to PLS to relate the texture variables to it.