Regression#

Backwards-compatible re-exporter for process_improve.regression.

The implementation now lives in process_improve.regression._robust_regression (ENG-23 / #305): the renamed file makes filename-ranked tooling (Jump-to-File, fuzzy search, codecov reports) less ambiguous about which methods.py is being shown.

Every public name remains importable as before:

from process_improve.regression.methods import OLS, robust_regression, repeated_median_slope
class process_improve.regression.methods.OLS(fit_intercept=True, na_rm=True, conflevel=0.95, pi_resolution=50)[source]#

Bases: RegressorMixin, BaseEstimator

Ordinary Least Squares regression with statistical diagnostics.

A scikit-learn-compatible estimator that fits an OLS model and exposes inferential statistics (standard errors, t-values, p-values, confidence intervals, F-statistic) and influence diagnostics (leverage, Cook’s distance). Calling print(model) after fitting renders a summary similar to R’s summary(lm(...)).

Parameters:
  • fit_intercept (bool, default=True) – If True, fits an intercept term. If False, the regression is forced through the origin.

  • na_rm (bool, default=True) – If True, drops rows with one or more missing values before fitting.

  • conflevel (float, default=0.95) – Confidence level for confidence and prediction intervals.

  • pi_resolution (int, default=50) – Number of grid points at which to compute prediction intervals over the range of x. Only used when X has a single column and an intercept is fitted.

coefficients_#

Fitted slope coefficients (excludes the intercept).

Type:

np.ndarray of shape (K,)

intercept_#

Fitted intercept (np.nan if fit_intercept is False).

Type:

float

standard_errors_#

Standard errors of coefficients_.

Type:

np.ndarray of shape (K,)

standard_error_intercept_#

Standard error of the intercept.

Type:

float

t_values_#

t-statistics for each coefficient.

Type:

np.ndarray of shape (K,)

t_value_intercept_#

t-statistic for the intercept.

Type:

float

p_values_#

Two-sided p-values for each coefficient.

Type:

np.ndarray of shape (K,)

p_value_intercept_#

p-value for the intercept.

Type:

float

conf_intervals_#

Lower and upper bounds of the coefficient confidence intervals.

Type:

np.ndarray of shape (K, 2)

conf_interval_intercept_#

Lower and upper bounds of the intercept confidence interval.

Type:

np.ndarray of shape (2,)

r2_#

Coefficient of determination.

Type:

float

adj_r2_#

Adjusted R-squared.

Type:

float

se_#

Residual standard error (sqrt of residual variance).

Type:

float

df_resid_#

Residual degrees of freedom.

Type:

int

df_model_#

Model degrees of freedom (number of slope coefficients).

Type:

int

f_statistic_#

F-statistic for the overall regression.

Type:

float

f_pvalue_#

p-value associated with the F-statistic.

Type:

float

fitted_values_#

In-sample predictions.

Type:

np.ndarray of shape (N,)

residuals_#

In-sample residuals (NaN at rows removed by na_rm).

Type:

np.ndarray of shape (N_original,)

leverage_#

Hat-matrix diagonal (only computed for single-feature X).

Type:

np.ndarray of shape (N,)

influence_#

Cook’s distance (only computed for single-feature X with intercept).

Type:

np.ndarray of shape (N,)

pi_range_#

Columns are x-grid, lower bound, upper bound of the prediction interval. np.nan if not applicable.

Type:

np.ndarray of shape (pi_resolution, 3) or float

feature_names_in_#

Column names of the feature matrix.

Type:

list[str]

target_name_#

Name of the target variable.

Type:

str

n_samples_#

Number of samples used in the fit (after na_rm).

Type:

int

n_features_in_#

Number of input features.

Type:

int

is_fitted_#

Whether fit() has been called successfully.

Type:

bool

Examples

>>> import numpy as np
>>> from process_improve.regression.methods import OLS
>>> rng = np.random.default_rng(0)
>>> X = rng.standard_normal((50, 2))
>>> y = X @ [1.5, -2.0] + 0.5 + 0.1 * rng.standard_normal(50)
>>> model = OLS().fit(X, y)
>>> print(model)
Call:
OLS(fit_intercept=True, na_rm=True, conflevel=0.95)
...

See also

multiple_linear_regression

Backwards-compatible function returning a dict.

robust_regression

Robust regression via repeated-median slope.

fit(X, y)[source]#

Fit the OLS model.

Parameters:
  • X (array-like of shape (N, K)) – Feature matrix. Pandas and NumPy inputs are both accepted.

  • y (array-like of shape (N,) or (N, 1)) – Target vector.

Returns:

self – Fitted estimator.

Return type:

OLS

predict(X)[source]#

Predict target values for X.

Parameters:

X (array-like of shape (N, K))

Returns:

y_pred

Return type:

np.ndarray of shape (N,)

prediction_interval(X, conflevel=None)[source]#

Prediction interval for new observations at arbitrary X.

Unlike the pi_range_ attribute - which is evaluated on a fixed grid spanning the training data - this method evaluates the prediction interval at any predictor value(s) supplied by the caller, including points outside the training range.

Parameters:
  • X (array-like of shape (M, K), (K,) or scalar) – New predictor value(s). A scalar or 1-D array is interpreted as a list of points when the model has a single feature, or as a single multi-feature point otherwise.

  • conflevel (float or None, default=None) – Confidence level for the interval. Defaults to the model’s own conflevel.

Returns:

A bunch with three length-M arrays: predicted (the point prediction), and lower / upper (the prediction-interval bounds).

Return type:

sklearn.utils.Bunch

summary()[source]#

Return an R-style summary(lm(...)) string for the fitted model.

Return type:

str

to_dict()[source]#

Return the legacy dictionary representation used by multiple_linear_regression.

Return type:

dict

set_score_request(*, sample_weight='$UNCHANGED$')#

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

  • self (OLS)

Returns:

self – The updated object.

Return type:

object

process_improve.regression.methods.fit_robust_lm(x, y)[source]#

Fits a robust linear model between Numpy vectors x and y, with an intercept. Returns a length-2 array [intercept, slope] (the params attribute returned by statsmodels.RLM); no extra checking on data consistency is done.

See also: regression.repeated_median_slope

Parameters:
Return type:

ndarray

process_improve.regression.methods.multiple_linear_regression(X, y, fit_intercept=True, na_rm=True, conflevel=0.95, pi_resolution=50)[source]#

Linear regression of the N rows and K columns of matrix X onto the single column ‘y’.

Backwards-compatible wrapper around OLS. New code should use the OLS estimator directly, which exposes the same statistics as sklearn-style attributes and prints an R-like summary(lm(...)).

Notes and limitations:
  • does not handle weighting

  • N >= K at least as many rows as columns in X

Returns a dictionary of outputs. Keys always present:

N:                        number of observations actually used to fit
coefficients:             a vector of K coefficients, one for each column in X
intercept:                returned if fit_intercept==True
standard_errors:          a vector of K standard errors, one per column in X
standard_error_intercept: standard error for the intercept
R2:                       the R^2 value
SE:                       the model's standard error
fitted_values:            the N predicted values, one per row in y
residuals:                the N residuals
t_value:                  the t-values for the standard errors
conf_intervals:           K rows x 2 columns (lower, upper) confidence intervals

Keys present only for single-feature X (and only when fit_intercept is True and there is enough non-degenerate data):

x_ssq:                    sum of squares of the centred predictor
leverage:                 hat-matrix diagonal
influence:                Cook-style influence values
pi_range:                 prediction interval above and below, over the
                          range of the predictor (``pi_resolution`` points)
Parameters:
Return type:

dict

process_improve.regression.methods.repeated_median_slope(x, y, nowarn=False)[source]#

Robust slope calculation via Siegel’s repeated-median estimator.

https://en.wikipedia.org/wiki/Repeated_median_regression

An elegant (simple) method to compute the robust slope between a vector x and y. For each point i the median of the pairwise slopes (y[j] - y[i]) / (x[j] - x[i]) over all j != i is computed; the returned slope is the median of those per-point medians.

Parameters:
  • x (np.ndarray or sequence) – Independent variable. Coerced to a 1-D numpy array. Must have at least 3 elements (unless nowarn=True).

  • y (np.ndarray or sequence) – Dependent variable. Must have the same length as x (unless nowarn=True).

  • nowarn (bool, optional) – If True, skip the length and equal-length input assertions. Default False.

Returns:

The repeated-median estimate of the slope. Returns np.nan if all inner medians are undefined (e.g. all x values are equal).

Return type:

float

Notes

INVESTIGATE: algorithm speed-ups via these articles: https://link.springer.com/article/10.1007/PL00009190 http://www.sciencedirect.com/science/article/pii/S0020019003003508

process_improve.regression.methods.robust_regression(x, y, fit_intercept=True, na_rm=True, conflevel=0.95, nowarn=False, pi_resolution=50)[source]#

Perform the Simple robust regression analysis between x and y variables.

Parameters - x, y: Sequences of numerical values. - fit_intercept: If True, fits an intercept term. If False, forces regression through origin. - na_rm: If True, removes all observations with one or more missing values. - conflevel: Confidence level for confidence intervals, default is 0.95. - nowarn: If True, suppresses warnings. Users should ensure data validity beforehand. - pi_resolution: The resolution of prediction intervals, default is 50.

Simple robust regression between an x and a y using the repeated_median_slope method to calculate the slope. The intercept is the median intercept, when using that slope and the provided x and y values, or forced to zero if fit_intercept=False.

Returns a dictionary of outputs with these keys:

N:                        the number of observations used to fit the model
coefficients:             a length-1 list containing the regression slope
intercept:                returned if fit_intercept==True, otherwise 0
standard_errors:          a length-1 list containing the standard error of the slope
standard_error_intercept: standard error for the intercept (np.nan if fit_intercept=False)
R2:                       the R^2 value
SE:                       the model's standard error
x_ssq:                    the sum of squares of (x - mean(x))
k:                        the number of model parameters (2 if fit_intercept else 1)
fitted_values:            the N predicted values, one per row in y
residuals:                the N residuals
t_value:                  the t-values for the standard errors
conf_intervals:           K rows x 2 columns (lower, upper) confidence intervals
conf_interval_intercept:  (lower, upper) confidence interval for the intercept
pi_range:                 prediction intervals above and below, over the range of data
leverage:                 the hat-matrix diagonal (leverage) for each observation
influence:                Cook-style influence values for each observation
Parameters:
Return type:

dict

process_improve.regression.methods.t_value(p, v)[source]#

Return the value on the x-axis if you plot the cumulative t-distribution with a fractional area of p (p is therefore a fractional value between 0 and 1 on the y-axis) and v is the degrees of freedom.

Examples

Since the cumulative distribution passes symmetrically through the x-axis at 0.0 for any number of degrees of freedom

>>> t_value(0.5, v)
0.0

Zero fractional area under the curve is always at \(-\infty\):

>>> t_value(0.0, v)
-Inf

100% fractional area is always at \(+\infty\):

>>> t_value(1.0, v)
+Inf

See also

t_value_cdf

does the inverse of this function.

Parameters:
Return type:

float