PCA with missing data: Kamyr digester#

Process data from a continuous Kamyr pulp digester: ten variables sampled over time, including chip feed rates, white liquor charges, blow flows, and Kappa number (the standard pulp quality metric). Some sensors fail or are not available for every observation, so roughly half the rows carry at least one missing value. Most off-the-shelf PCA implementations would force you to drop those rows; this package fits the model directly on the incomplete matrix using the iterative NIPALS or TSR algorithms.

Data. kamyr-digester.csv from openmv.net. 96 rows by 10 numeric columns.

What we do. Quantify how much data is missing, scale the matrix (mean and standard deviation are computed column-wise ignoring NaN), fit PCA with algorithm="auto" so it selects the missing-data-aware NIPALS solver, and read the convergence diagnostics from fitting_info_.

Adapted from the Principal Component Analysis chapter of the Process Improvement using Data book (CC BY-SA 4.0).

[1]:

import pandas as pd
import plotly.io as pio

from process_improve.multivariate.methods import PCA, MCUVScaler

pio.renderers.default = "notebook_connected"

Load and inspect missingness#

We read with header=None first; if the resulting frame has only numeric dtypes the file really is headerless and we attach generic tag_K column names. Otherwise the first row was a header, so we reload with the default header-inference behaviour and keep the original column labels.

Either way we then coerce every column to numeric (cells that cannot parse become NaN, which the missing-data PCA solver handles natively) and drop any column that becomes entirely NaN, which removes any non-numeric identifier or date column the dataset might carry alongside the process variables.

[2]:

url = "https://openmv.net/file/kamyr-digester.csv"
kamyr = pd.read_csv(url, header=None)
if all(pd.api.types.is_numeric_dtype(kamyr[c]) for c in kamyr.columns):
    kamyr.columns = [f"tag_{i + 1}" for i in range(kamyr.shape[1])]
else:
    kamyr = pd.read_csv(url)
kamyr = kamyr.apply(pd.to_numeric, errors="coerce").dropna(axis=1, how="all")
print(f"Shape: {kamyr.shape[0]} observations x {kamyr.shape[1]} variables")
print(f"Total missing values: {int(kamyr.isna().sum().sum())}")
print(f"Rows with at least one missing value: {int(kamyr.isna().any(axis=1).sum())}")
kamyr.isna().sum().to_frame("missing_per_column")

Shape: 301 observations x 22 variables
Total missing values: 352
Rows with at least one missing value: 170

[2]:

	missing_per_column
Y-Kappa	0
ChipRate	4
BF-CMratio	14
BlowFlow	13
ChipLevel4	1
T-upperExt-2	1
T-lowerExt-2	1
UCZAA	24
WhiteFlow-4	1
AAWhiteSt-4	141
AA-Wood-4	1
ChipMoisture-4	1
SteamFlow-4	1
Lower-HeatT-3	1
Upper-HeatT-3	1
ChipMass-4	1
WeakLiquorF	1
BlackFlow-2	1
WeakWashF	1
SteamHeatF-3	1
T-Top-Chips-4	1
SulphidityL-4	141

About half the rows have at least one missing value, and one column is missing for almost half of all observations. Listwise deletion would throw away most of the data; we keep all rows and let PCA estimate the scores from the variables that are observed.

Scale and fit#

MCUVScaler computes per-column mean and standard deviation ignoring NaN, so scaling a matrix that has missing values is well-defined. We then fit PCA with algorithm="auto": it detects the missing values in X and switches from SVD to the iterative NIPALS solver.

[3]:

X = MCUVScaler().fit_transform(kamyr)
model = PCA(n_components=3).fit(X)
print(f"algorithm     : {model.algorithm_}")
print(f"missing data  : {model.has_missing_data_}")
print(f"iterations    : {model.fitting_info_['iterations']}")
model.r2_cumulative_.round(3)

algorithm     : nipals
missing data  : True
iterations    : [29. 57. 77.]

[3]:

1    0.333
2    0.488
3    0.593
Name: Cumulative R², dtype: float64

fitting_info_['iterations'] lists how many NIPALS iterations each component required. Components with many missing entries typically take longer to converge; the default tolerance and iteration cap can be tightened or relaxed via the missing_data_settings={"md_tol": ..., "md_max_iter": ...} argument to PCA.

Score plot#

[4]:

model.score_plot()

Loading plot#

Variables that load together describe the same phenomenon. On the Kamyr digester the upper-vessel and lower-vessel measurements typically separate cleanly because the digester behaves like two stacked compartments.

[5]:

model.loading_plot()

SPE and Hotelling’s T squared#

[6]:

model.spe_plot()

[7]:

model.t2_plot()

[8]:

outliers = model.detect_outliers(conf_level=0.95)
print(f"{len(outliers)} observations flagged")
for o in outliers[:5]:
    print(f"  obs {o['observation']}: {o['outlier_types']} (severity={o['severity']:.2f})")

17 observations flagged
  obs 181: ['spe', 'hotellings_t2'] (severity=2.31)
  obs 283: ['hotellings_t2'] (severity=1.49)
  obs 285: ['hotellings_t2'] (severity=1.44)
  obs 63: ['hotellings_t2'] (severity=1.41)
  obs 286: ['hotellings_t2'] (severity=1.34)

TSR as an alternative#

Trimmed Score Regression is generally a more reliable choice when the missingness pattern is structured (a whole shift’s worth of a variable is gone, for example). Switching is a one-line change:

model_tsr = PCA(n_components=3, algorithm="tsr").fit(X)

and the rest of the workflow above is identical.

[9]:

model_tsr = PCA(n_components=3, algorithm="tsr").fit(X)
comparison = pd.DataFrame(
    {
        "NIPALS R2": model.r2_cumulative_,
        "TSR R2": model_tsr.r2_cumulative_,
    }
).round(3)
comparison

[9]:

	NIPALS R2	TSR R2
1	0.333	0.333
2	0.488	0.488
3	0.593	0.593

What to try next#

Tighten md_tol and increase md_max_iter to see how far the iterative solver can push the fit when more iterations are budgeted.
Re-run with the rows that have any missing values dropped (kamyr.dropna()) and compare the score plots; this is the listwise-deletion baseline that the missing-data algorithms are designed to avoid.
If a quality variable like Kappa number is broken out from the predictor block, switch from PCA to PLS to relate the digester process variables to product quality.