Architecture overview#

This page is the map of the codebase: how process-improve is laid out, the conventions every subpackage follows, and the two cross-cutting systems (the estimator stack and the MCP tool layer) that most changes touch. Read it before your first contribution; the per-topic policy pages under Development go deeper.

Package layout#

process_improve/
    multivariate/    # PCA, PLS, TPLS, and multi-block (MBPCA / MBPLS)
    experiments/     # designed experiments: designs, analysis, optimisation
    monitoring/      # control charts (Shewhart / CUSUM / EWMA-style)
    batch/           # batch data alignment (DTW), features, preprocessing
    regression/      # robust regression (repeated median, Theil-Sen)
    bivariate/       # elbow / peak detection, area-under-curve
    univariate/      # robust summary statistics, outlier detection
    visualization/   # shared plotting themes and helpers
    simulation/      # process simulators
    datasets/        # sample datasets used by examples and tests
    tool_spec.py     # the MCP @tool_spec decorator + global tool registry
    config.py        # runtime settings (caps, limits)
    _linalg.py, _random.py, _extras.py   # small shared utilities

Each domain subpackage exposes its public API through a thin re-export module (methods.py for multivariate, package __init__ elsewhere) so the import path callers use is stable even when the implementation files move.

Conventions every subpackage follows#

sklearn-compatible estimators. Estimators inherit BaseEstimator plus the relevant mixin (TransformerMixin / RegressorMixin). They do not inherit a concrete sklearn estimator - the mixins give get_params / set_params / clone / Pipeline support without coupling to sklearn’s private attribute layout (ENG-07). fit() returns self; fitted attributes use the trailing-underscore convention (scores_, spe_, hotellings_t2_) and are set only in fit(), never in __init__.
Optional dependencies live in extras. Plotting (plotly / ridgeplot), the experiments designed-experiment generators (pyDOE3 / pyoptex), the batch and MCP layers are installed via [plotting] / [expt] / [batch] / [mcp] extras (ENG-13). Modules import them through a _MissingExtra stand-in so a missing optional dependency only fails when the feature is actually used, not at import time.
Diagnostic logging. Modules that do real work define logger = logging.getLogger(__name__) and emit debug records at major algorithm steps; nothing configures handlers. See Logging.
Error handling. Tool wrappers narrow their except to a canonical set so unexpected errors propagate to the server and get redacted. See Error-Handling Style Guide.
Reproducibility. Randomised paths take an explicit random_state. See Reproducibility Contract (RNG Handling).

The multivariate estimator stack#

The latent-variable estimators are split into single-responsibility modules under multivariate/ and aggregated by methods.py (the stable public import path; _pca_pls.py remains as a backward-compatibility shim):

_common.py        # DataMatrix alias, epsqrt, _nz, SpecificationWarning, _model_method
_preprocessing.py # MCUVScaler, center, scale
_nipals.py        # NIPALS / least-squares kernels (missing-data aware)
_limits.py        # Hotelling's T2 / SPE / score limits, ellipse geometry
_diagnostics.py   # VIP, squared cosine, contributions, RV coefficients
_base.py          # _LatentVariableModel base + mixins (see below)
_pca.py  _pls.py  _tpls.py  _mbpls.py  _mbpca.py   # the estimators
_resampling.py    # jackknife / bootstrap resampling
plots.py          # score / loading / SPE / T2 / coefficient plots + Plot accessor

Two base-class ideas tie the estimators together (_base.py):

``_LatentVariableModel`` owns the scaffolding PCA and PLS share: the convenience methods (score_plot, vip, spe_limit, …) that forward to the standalone functions, ellipse_coordinates, and the attribute-rename __getattr__ (driven by a per-class _ATTRIBUTE_RENAMES map). The convenience methods are real methods built by the _model_method factory, so help / inspect.signature report the underlying function and the fitted model pickles and subclasses cleanly (ENG-05, ENG-17). MBPLS / MBPCA share only _HotellingsT2LimitMixin.
Ndarray-backed fitted attributes. Hot-path attributes (scores_, loadings_, spe_, …) are stored as private numpy ndarrays; the public pd.DataFrame is a lazily-built, cached view via the _LazyFrame descriptor. Internal math reads the ndarray (no per-call .values conversion); the cache is excluded from pickling (ENG-18).

A typical fit therefore: validates input, runs the algorithm-specific _fit_* (SVD / NIPALS / TSR for PCA; NIPALS for PLS; hierarchical NIPALS for the multi-block models), stores the ndarrays + index/column metadata, and computes limits and R-squared bookkeeping. The numerical kernels live in _nipals.py and are shared, so a fix there benefits every estimator.

The MCP tool layer#

Agent-callable tools are declared with the @tool_spec decorator (process_improve/tool_spec.py). Each tool pairs a pydantic BaseModel input contract (ConfigDict(extra="forbid")) with a wrapper function; the decorator registers the function in a global _TOOL_REGISTRY and attaches the JSON-schema spec. get_tool_specs() returns the specs in registry (decorator-execution) order; discover_tools() imports each subpackage’s tools module so the decorators run.

In experiments/ the tools and analyses are split one-per-module (_tools/<tool>.py, _analyses/<analysis>.py) with tools.py acting as the ordered aggregator (ENG-02). To add a tool, see Authoring an MCP tool.

Where to make a change#

A numerical fix to a latent-variable algorithm: the shared kernels in multivariate/_nipals.py / _limits.py / _common.py, or the estimator’s _fit_* method.
A new estimator convenience method shared by PCA and PLS: multivariate/_base.py.
A new agent-callable tool: a new module under the subpackage’s _tools/ (or tools.py) - see Authoring an MCP tool.
A new designed-experiment analysis: a module under experiments/_analyses/ dispatched from experiments/analysis.py.