Regression#
Backwards-compatible re-exporter for process_improve.regression.
The implementation now lives in
process_improve.regression._robust_regression (ENG-23 / #305): the
renamed file makes filename-ranked tooling (Jump-to-File, fuzzy search,
codecov reports) less ambiguous about which methods.py is being shown.
Every public name remains importable as before:
from process_improve.regression.methods import OLS, robust_regression, repeated_median_slope
- class process_improve.regression.methods.OLS(fit_intercept=True, na_rm=True, conflevel=0.95, pi_resolution=50)[source]#
Bases:
RegressorMixin,BaseEstimatorOrdinary Least Squares regression with statistical diagnostics.
A scikit-learn-compatible estimator that fits an OLS model and exposes inferential statistics (standard errors, t-values, p-values, confidence intervals, F-statistic) and influence diagnostics (leverage, Cook’s distance). Calling
print(model)after fitting renders a summary similar to R’ssummary(lm(...)).- Parameters:
fit_intercept (bool, default=True) – If True, fits an intercept term. If False, the regression is forced through the origin.
na_rm (bool, default=True) – If True, drops rows with one or more missing values before fitting.
conflevel (float, default=0.95) – Confidence level for confidence and prediction intervals.
pi_resolution (int, default=50) – Number of grid points at which to compute prediction intervals over the range of x. Only used when X has a single column and an intercept is fitted.
- coefficients_#
Fitted slope coefficients (excludes the intercept).
- Type:
np.ndarray of shape (K,)
- standard_errors_#
Standard errors of
coefficients_.- Type:
np.ndarray of shape (K,)
- t_values_#
t-statistics for each coefficient.
- Type:
np.ndarray of shape (K,)
- p_values_#
Two-sided p-values for each coefficient.
- Type:
np.ndarray of shape (K,)
- conf_intervals_#
Lower and upper bounds of the coefficient confidence intervals.
- Type:
np.ndarray of shape (K, 2)
- conf_interval_intercept_#
Lower and upper bounds of the intercept confidence interval.
- Type:
np.ndarray of shape (2,)
- fitted_values_#
In-sample predictions.
- Type:
np.ndarray of shape (N,)
- residuals_#
In-sample residuals (NaN at rows removed by
na_rm).- Type:
np.ndarray of shape (N_original,)
- leverage_#
Hat-matrix diagonal (only computed for single-feature X).
- Type:
np.ndarray of shape (N,)
- influence_#
Cook’s distance (only computed for single-feature X with intercept).
- Type:
np.ndarray of shape (N,)
- pi_range_#
Columns are x-grid, lower bound, upper bound of the prediction interval.
np.nanif not applicable.- Type:
np.ndarray of shape (pi_resolution, 3) or float
Examples
>>> import numpy as np >>> from process_improve.regression.methods import OLS >>> rng = np.random.default_rng(0) >>> X = rng.standard_normal((50, 2)) >>> y = X @ [1.5, -2.0] + 0.5 + 0.1 * rng.standard_normal(50) >>> model = OLS().fit(X, y) >>> print(model) Call: OLS(fit_intercept=True, na_rm=True, conflevel=0.95) ...
See also
multiple_linear_regressionBackwards-compatible function returning a dict.
robust_regressionRobust regression via repeated-median slope.
- fit(X, y)[source]#
Fit the OLS model.
- Parameters:
X (array-like of shape (N, K)) – Feature matrix. Pandas and NumPy inputs are both accepted.
y (array-like of shape (N,) or (N, 1)) – Target vector.
- Returns:
self – Fitted estimator.
- Return type:
- predict(X)[source]#
Predict target values for
X.- Parameters:
X (array-like of shape (N, K))
- Returns:
y_pred
- Return type:
np.ndarray of shape (N,)
- prediction_interval(X, conflevel=None)[source]#
Prediction interval for new observations at arbitrary
X.Unlike the
pi_range_attribute - which is evaluated on a fixed grid spanning the training data - this method evaluates the prediction interval at any predictor value(s) supplied by the caller, including points outside the training range.- Parameters:
X (array-like of shape (M, K), (K,) or scalar) – New predictor value(s). A scalar or 1-D array is interpreted as a list of points when the model has a single feature, or as a single multi-feature point otherwise.
conflevel (float or None, default=None) – Confidence level for the interval. Defaults to the model’s own
conflevel.
- Returns:
A bunch with three length-M arrays:
predicted(the point prediction), andlower/upper(the prediction-interval bounds).- Return type:
- to_dict()[source]#
Return the legacy dictionary representation used by
multiple_linear_regression.- Return type:
- set_score_request(*, sample_weight='$UNCHANGED$')#
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- process_improve.regression.methods.fit_robust_lm(x, y)[source]#
Fits a robust linear model between Numpy vectors x and y, with an intercept. Returns a length-2 array
[intercept, slope](theparamsattribute returned bystatsmodels.RLM); no extra checking on data consistency is done.See also: regression.repeated_median_slope
- process_improve.regression.methods.multiple_linear_regression(X, y, fit_intercept=True, na_rm=True, conflevel=0.95, pi_resolution=50)[source]#
Linear regression of the N rows and K columns of matrix X onto the single column ‘y’.
Backwards-compatible wrapper around
OLS. New code should use theOLSestimator directly, which exposes the same statistics as sklearn-style attributes and prints an R-likesummary(lm(...)).- Notes and limitations:
does not handle weighting
N >= K at least as many rows as columns in X
Returns a dictionary of outputs. Keys always present:
N: number of observations actually used to fit coefficients: a vector of K coefficients, one for each column in X intercept: returned if fit_intercept==True standard_errors: a vector of K standard errors, one per column in X standard_error_intercept: standard error for the intercept R2: the R^2 value SE: the model's standard error fitted_values: the N predicted values, one per row in y residuals: the N residuals t_value: the t-values for the standard errors conf_intervals: K rows x 2 columns (lower, upper) confidence intervals
Keys present only for single-feature
X(and only whenfit_interceptis True and there is enough non-degenerate data):x_ssq: sum of squares of the centred predictor leverage: hat-matrix diagonal influence: Cook-style influence values pi_range: prediction interval above and below, over the range of the predictor (``pi_resolution`` points)
- process_improve.regression.methods.repeated_median_slope(x, y, nowarn=False)[source]#
Robust slope calculation via Siegel’s repeated-median estimator.
https://en.wikipedia.org/wiki/Repeated_median_regression
An elegant (simple) method to compute the robust slope between a vector
xandy. For each pointithe median of the pairwise slopes(y[j] - y[i]) / (x[j] - x[i])over allj != iis computed; the returned slope is the median of those per-point medians.- Parameters:
x (np.ndarray or sequence) – Independent variable. Coerced to a 1-D numpy array. Must have at least 3 elements (unless
nowarn=True).y (np.ndarray or sequence) – Dependent variable. Must have the same length as
x(unlessnowarn=True).nowarn (bool, optional) – If
True, skip the length and equal-length input assertions. DefaultFalse.
- Returns:
The repeated-median estimate of the slope. Returns
np.nanif all inner medians are undefined (e.g. allxvalues are equal).- Return type:
Notes
INVESTIGATE: algorithm speed-ups via these articles: https://link.springer.com/article/10.1007/PL00009190 http://www.sciencedirect.com/science/article/pii/S0020019003003508
- process_improve.regression.methods.robust_regression(x, y, fit_intercept=True, na_rm=True, conflevel=0.95, nowarn=False, pi_resolution=50)[source]#
Perform the Simple robust regression analysis between x and y variables.
Parameters - x, y: Sequences of numerical values. - fit_intercept: If True, fits an intercept term. If False, forces regression through origin. - na_rm: If True, removes all observations with one or more missing values. - conflevel: Confidence level for confidence intervals, default is 0.95. - nowarn: If True, suppresses warnings. Users should ensure data validity beforehand. - pi_resolution: The resolution of prediction intervals, default is 50.
Simple robust regression between an x and a y using the repeated_median_slope method to calculate the slope. The intercept is the median intercept, when using that slope and the provided x and y values, or forced to zero if fit_intercept=False.
Returns a dictionary of outputs with these keys:
N: the number of observations used to fit the model coefficients: a length-1 list containing the regression slope intercept: returned if fit_intercept==True, otherwise 0 standard_errors: a length-1 list containing the standard error of the slope standard_error_intercept: standard error for the intercept (np.nan if fit_intercept=False) R2: the R^2 value SE: the model's standard error x_ssq: the sum of squares of (x - mean(x)) k: the number of model parameters (2 if fit_intercept else 1) fitted_values: the N predicted values, one per row in y residuals: the N residuals t_value: the t-values for the standard errors conf_intervals: K rows x 2 columns (lower, upper) confidence intervals conf_interval_intercept: (lower, upper) confidence interval for the intercept pi_range: prediction intervals above and below, over the range of data leverage: the hat-matrix diagonal (leverage) for each observation influence: Cook-style influence values for each observation
- process_improve.regression.methods.t_value(p, v)[source]#
Return the value on the x-axis if you plot the cumulative t-distribution with a fractional area of p (p is therefore a fractional value between 0 and 1 on the y-axis) and v is the degrees of freedom.
Examples
Since the cumulative distribution passes symmetrically through the x-axis at 0.0 for any number of degrees of freedom
>>> t_value(0.5, v) 0.0
Zero fractional area under the curve is always at \(-\infty\):
>>> t_value(0.0, v) -Inf
100% fractional area is always at \(+\infty\):
>>> t_value(1.0, v) +Inf
See also
t_value_cdfdoes the inverse of this function.