Architecture overview#
This page is the map of the codebase: how process-improve is laid out, the
conventions every subpackage follows, and the two cross-cutting systems (the
estimator stack and the MCP tool layer) that most changes touch. Read it before
your first contribution; the per-topic policy pages under
Development go deeper.
Package layout#
process_improve/
multivariate/ # PCA, PLS, TPLS, and multi-block (MBPCA / MBPLS)
experiments/ # designed experiments: designs, analysis, optimisation
monitoring/ # control charts (Shewhart / CUSUM / EWMA-style)
batch/ # batch data alignment (DTW), features, preprocessing
regression/ # robust regression (repeated median, Theil-Sen)
bivariate/ # elbow / peak detection, area-under-curve
univariate/ # robust summary statistics, outlier detection
visualization/ # shared plotting themes and helpers
simulation/ # process simulators
datasets/ # sample datasets used by examples and tests
tool_spec.py # the MCP @tool_spec decorator + global tool registry
config.py # runtime settings (caps, limits)
_linalg.py, _random.py, _extras.py # small shared utilities
Each domain subpackage exposes its public API through a thin re-export module
(methods.py for multivariate, package __init__ elsewhere) so the import
path callers use is stable even when the implementation files move.
Conventions every subpackage follows#
sklearn-compatible estimators. Estimators inherit
BaseEstimatorplus the relevant mixin (TransformerMixin/RegressorMixin). They do not inherit a concrete sklearn estimator - the mixins giveget_params/set_params/clone/ Pipeline support without coupling to sklearn’s private attribute layout (ENG-07).fit()returnsself; fitted attributes use the trailing-underscore convention (scores_,spe_,hotellings_t2_) and are set only infit(), never in__init__.Optional dependencies live in extras. Plotting (plotly / ridgeplot), the experiments designed-experiment generators (pyDOE3 / pyoptex), the batch and MCP layers are installed via
[plotting]/[expt]/[batch]/[mcp]extras (ENG-13). Modules import them through a_MissingExtrastand-in so a missing optional dependency only fails when the feature is actually used, not at import time.Diagnostic logging. Modules that do real work define
logger = logging.getLogger(__name__)and emitdebugrecords at major algorithm steps; nothing configures handlers. See Logging.Error handling. Tool wrappers narrow their
exceptto a canonical set so unexpected errors propagate to the server and get redacted. See Error-Handling Style Guide.Reproducibility. Randomised paths take an explicit
random_state. See Reproducibility Contract (RNG Handling).
The multivariate estimator stack#
The latent-variable estimators are split into single-responsibility modules
under multivariate/ and aggregated by methods.py (the stable public
import path; _pca_pls.py remains as a backward-compatibility shim):
_common.py # DataMatrix alias, epsqrt, _nz, SpecificationWarning, _model_method
_preprocessing.py # MCUVScaler, center, scale
_nipals.py # NIPALS / least-squares kernels (missing-data aware)
_limits.py # Hotelling's T2 / SPE / score limits, ellipse geometry
_diagnostics.py # VIP, squared cosine, contributions, RV coefficients
_base.py # _LatentVariableModel base + mixins (see below)
_pca.py _pls.py _tpls.py _mbpls.py _mbpca.py # the estimators
_resampling.py # jackknife / bootstrap resampling
plots.py # score / loading / SPE / T2 / coefficient plots + Plot accessor
Two base-class ideas tie the estimators together (_base.py):
``_LatentVariableModel`` owns the scaffolding PCA and PLS share: the convenience methods (
score_plot,vip,spe_limit, …) that forward to the standalone functions,ellipse_coordinates, and the attribute-rename__getattr__(driven by a per-class_ATTRIBUTE_RENAMESmap). The convenience methods are real methods built by the_model_methodfactory, sohelp/inspect.signaturereport the underlying function and the fitted model pickles and subclasses cleanly (ENG-05, ENG-17). MBPLS / MBPCA share only_HotellingsT2LimitMixin.Ndarray-backed fitted attributes. Hot-path attributes (
scores_,loadings_,spe_, …) are stored as private numpy ndarrays; the publicpd.DataFrameis a lazily-built, cached view via the_LazyFramedescriptor. Internal math reads the ndarray (no per-call.valuesconversion); the cache is excluded from pickling (ENG-18).
A typical fit therefore: validates input, runs the algorithm-specific
_fit_* (SVD / NIPALS / TSR for PCA; NIPALS for PLS; hierarchical NIPALS for
the multi-block models), stores the ndarrays + index/column metadata, and
computes limits and R-squared bookkeeping. The numerical kernels live in
_nipals.py and are shared, so a fix there benefits every estimator.
The MCP tool layer#
Agent-callable tools are declared with the @tool_spec decorator
(process_improve/tool_spec.py). Each tool pairs a pydantic
BaseModel input contract (ConfigDict(extra="forbid")) with a wrapper
function; the decorator registers the function in a global _TOOL_REGISTRY
and attaches the JSON-schema spec. get_tool_specs() returns the specs in
registry (decorator-execution) order; discover_tools() imports each
subpackage’s tools module so the decorators run.
In experiments/ the tools and analyses are split one-per-module
(_tools/<tool>.py, _analyses/<analysis>.py) with tools.py acting as
the ordered aggregator (ENG-02). To add a tool, see Authoring an MCP tool.
Where to make a change#
A numerical fix to a latent-variable algorithm: the shared kernels in
multivariate/_nipals.py/_limits.py/_common.py, or the estimator’s_fit_*method.A new estimator convenience method shared by PCA and PLS:
multivariate/_base.py.A new agent-callable tool: a new module under the subpackage’s
_tools/(ortools.py) - see Authoring an MCP tool.A new designed-experiment analysis: a module under
experiments/_analyses/dispatched fromexperiments/analysis.py.