Univariate Statistics#

process_improve.univariate.metrics.t_value(p, v)[source]#

Return the value on the x-axis if you plot the cumulative t-distribution with a fractional area of p (p is therefore a fractional value between 0 and 1 on the y-axis) and v is the degrees of freedom.

Examples

Since the cumulative distribution passes symmetrically through the x-axis at 0.0 for any number of degrees of freedom

>>> t_value(0.5, v)
0.0

Zero fractional area under the curve is always at \(-\infty\):

>>> t_value(0.0, v)
-Inf

100% fractional area is always at \(+\infty\):

>>> t_value(1.0, v)
+Inf

See also

t_value_cdf

does the inverse of this function.

Parameters:
Return type:

float

process_improve.univariate.metrics.t_value_cdf(z, v)[source]#

Return the fractional area under the cumulative t-distribution (y-axis value) at the t-value z on the x-axis, with v degrees of freedom.

Examples

The cumulative distribution is symmetric through the x-axis at 0.0 for any number of degrees of freedom, so half of the area lies below zero:

>>> t_value_cdf(0.0, v)
0.5

Zero fractional area under the curve is at \(-\infty\):

>>> t_value_cdf(-np.inf, v)
0.0

100% fractional area is at \(+\infty\):

>>> t_value_cdf(np.inf, v)
1.0

See also

t_value

does the inverse of this function.

Parameters:
Return type:

float

process_improve.univariate.metrics.test_normality(x)[source]#

Check the p-value of the hypothesis that the data are from a normal distribution.

If the p-value is less than the chosen alpha level (e.g. 0.05 or 0.025), then there is evidence that the data tested are NOT normally distributed.

On the other hand, if the p-value is greater than the chosen alpha level, then the null hypothesis that the data came from a normally distributed population can not be rejected. NOTE: it does not mean that the data are normally distributed, just that we have nothing better to say about it. See the Shapiro-Wilk test.

Implementation: Uses the Shapiro Wilk test directly taken from scipy.stats.shapiro.

Parameters:

x (ndarray | Series)

Return type:

float

process_improve.univariate.metrics.Sn(x, constant=1.1926)[source]#

Compute a robust scale estimator. The Sn metric is an efficient alternative to MAD.

Parameters:
  • x (np.ndarray or pd.Series) – A vector of values. NaN entries are ignored.

  • constant (float, optional) – Multiplicative constant that makes the estimator consistent with iid values from a Gaussian distribution with no outliers. Default is 1.1926.

Returns:

A scalar value, the Sn estimate of spread. Returns NaN if all entries are missing, and 0.0 when only a single non-missing value is supplied.

Return type:

np.floating

Notes

Tested against once of the most reliable open-source packages, written by some of the most respected names in the area of robust methods: [1] and [2].

Disadvantages of MAD:

  • It does not have particularly high efficiency for data that is in fact normal (37%). In comparison, the median has 64% efficiency for normal data.

  • The MAD statistic also has an implicit assumption of symmetry. That is, it measures the distance from a measure of central location (the median).

References

process_improve.univariate.metrics.ttest_independent(sample_A, sample_B, conflevel=0.995)[source]#

Core calculation for a test of differences between the average of A and the average of B. No checking of inputs.

Parameters:
  • sample_A (iterable) – Vector of n_a measurements.

  • sample_B (iterable) – Vector of n_b measurements.

  • conflevel (float) – Value between 0 and 1 (closer to 1.0), that gives the level of confidence required for the 2-sided test.

Returns:

Outcomes from the statistical test.

Return type:

dict

process_improve.univariate.metrics.ttest_independent_from_df(df, grouper_column, values_column, conflevel=0.995)[source]#

Calculate the t-test for differences between two or more groups and returns a confidence interval for the difference. The test is for UNPAIRED differences.

The dataframe df contains a grouper_column with 2 or more unique values (e.g. ‘A’ and ‘B’). All unique values of the grouper_column are used, and t-tests are done between the values in the values_column.

Args:

df (pd.DataFrame): Dataframe of the values and grouping variable. grouper_column (str): Indicates which column will be grouped on. values_column (str): Which column contains the numeric values to calculate the test on. conflevel (float, optional): [description]. Defaults to 0.995.

Output: Dataframe with columns containing the statistical outputs of the t-test, including:
  1. Group “A” name

  2. Group “B” name

  3. Group “A” mean

  4. Group “B” mean

  5. z-value for the difference between group “B” minus group “A”

  6. p-value for this z-value

  7. Confidence interval low value for difference between group “B” minus group “A”

  8. Confidence interval high value for difference between group “B” minus group “A”

Example: df : has 3 levels in the grouper variable; [‘Marco’, ‘Pete’, ‘Sam’]

Output will have 3 rows: Group A name Group B name Marco Pete Marco Sam Pete Sam

Parameters:
Return type:

DataFrame

process_improve.univariate.metrics.ttest_paired(differences, conflevel=0.995)[source]#

Core calculation for a test of paired differences.

Parameters:
  • differences (pd.Series) – The paired differences (e.g. sample_A - sample_B) for which the test is run.

  • conflevel (float, optional) – Value between 0 and 1 (closer to 1.0), that gives the level of confidence required for the 2-sided test. Default is 0.995.

Returns:

Outcomes from the statistical test, with keys:

  • "Differences mean": the mean of the input differences.

  • "z value": the test statistic (mean divided by its standard error).

  • "ConfInt: Lo", "ConfInt: Hi": lower and upper bounds of the two-sided confidence interval around the mean difference at the requested conflevel.

  • "p value": two-sided p-value from the t-distribution.

  • "Degrees of freedom": n - 1 for n paired observations.

  • "Standard deviation": the standard error of the mean difference, i.e. sample_std / sqrt(n) (the scale used to build the confidence interval), NOT the sample standard deviation of differences.

Return type:

dict

process_improve.univariate.metrics.ttest_paired_from_df(df, grouper_column, values_column, conflevel=0.995)[source]#

Calculate the t-test for paired differences between two or more groups and returns a confidence interval for the difference. The test is for PAIRED differences. The differences is always defined as the A values minus the B values: after - before, or A - B.

The dataframe df contains a grouper_column with 2 or more unique values (e.g. ‘A’ and ‘B’). All unique values of the grouper_column are used, and t-tests are done between the values in the values_column.

When selecting the columns, the number of values per column must be the same.

Args:

df (pd.DataFrame): Dataframe of the values and grouping variable. grouper_column (str): Indicates which column will be grouped on. values_column (str): Which column contains the numeric values to calculate the test on. conflevel (float, optional): [description]. Defaults to 0.995.

Output: Dataframe with columns containing the statistical outputs of the t-test, including:

  1. Group A name

  2. Group B name

  3. Group A mean

  4. Group B mean

  5. Differences mean: average difference between the groups (not the same as the difference of the averages from items 3 and 4 above)

  6. z-value for the difference between group “B” minus group “A”

  7. p-value for this z-value

  8. Confidence interval low value for difference between group “B” minus group “A”

  9. Confidence interval high value for difference between group “B” minus group “A”

Parameters:
Return type:

DataFrame

process_improve.univariate.metrics.confidence_interval(df, column_name, conflevel=0.95, style='robust')[source]#

Calculate the confidence interval, returned as a tuple, for the column_name (str) in the dataframe df, for a given confidence level conflevel (default: 0.95).

style: [‘robust’; ‘regular’]: indicates which style of estimates to use for the center and

spread. Default: ‘robust’

Missing values are ignored.

Parameters:
Return type:

tuple

process_improve.univariate.metrics.median_absolute_deviation(x, axis=0, center=<function median>, scale='normal', nan_policy='omit')[source]#

Taken from scipy.stats.stats: we want the same functionality, but with a slightly different default function signature.

  • scale=’normal’ instead of scale=1.0.

  • nan_policy=’omit’ instead of nan_policy=’propogate’

Compute the median absolute deviation of the data along the given axis. The median absolute deviation (MAD) computes the median over the absolute deviations from the median. It is a measure of dispersion similar to the standard deviation but more robust to outliers. The MAD of an empty array is np.nan.

Parameters:
  • x (array_like) – Input array or object that can be converted to an array.

  • axis (int or None, optional) – Axis along which the range is computed. Default is 0. Axis == None will not be accepted anymore.

  • center (callable, optional) – A function that will return the central value. The default is to use np.median. Any user defined function used will need to have the function signature func(arr, axis).

  • scale (scalar or str, optional) – The numerical value of scale will be divided out of the final result. The default is "normal", which results in scale being the inverse of the standard normal quantile function at 0.75, which is approximately 0.67449 (so the returned value is consistent with the standard deviation for normally distributed data). Any positive float may also be passed. Array-like scale is also allowed, as long as it broadcasts correctly to the output such that out / scale is a valid operation. The output dimensions depend on the input array, x, and the axis argument.

  • nan_policy ({'propagate', 'raise', 'omit'}, optional) – Defines how to handle when input contains nan. The following options are available (default is ‘omit’): * ‘propagate’: returns nan * ‘raise’: throws an error * ‘omit’: performs the calculations ignoring nan values

Returns:

mad – If the input contains integers or floats of smaller precision than np.float64, then the output data-type is np.float64. Otherwise, the output data-type is the same as that of the input.

Return type:

scalar or ndarray

See also

numpy.std, numpy.var, numpy.median, scipy.stats.iqr, scipy.stats.tmean, scipy.stats.tstd, scipy.stats.tvar

Notes

The center argument only affects the calculation of the central value around which the MAD is calculated. That is, passing in center=np.mean will calculate the MAD around the mean - it will not calculate the mean absolute deviation.

The input array may contain inf, but if center returns inf, the corresponding MAD for that data will be nan.

References

Examples

When comparing the behavior of median_abs_deviation with np.std, the latter is affected when we change a single value of an array to have an outlier value while the MAD hardly changes:

>>> from scipy import stats
>>> x = stats.norm.rvs(size=100, scale=1, random_state=123456)
>>> x.std()
0.9973906394005013
>>> stats.median_abs_deviation(x)
0.82832610097857
>>> x[0] = 345.6
>>> x.std()
34.42304872314415
>>> stats.median_abs_deviation(x)
0.8323442311590675

Axis handling example:

>>> x = np.array([[10, 7, 4], [3, 2, 1]])
>>> x
array([[10,  7,  4],
       [ 3,  2,  1]])
>>> stats.median_abs_deviation(x)
array([3.5, 2.5, 1.5])
>>> stats.median_abs_deviation(x, axis=None)  <-- syntax of `axis=None` will be depricated
2.0

Scale normal example:

>>> x = stats.norm.rvs(size=1000000, scale=2, random_state=123456)
>>> stats.median_abs_deviation(x)
1.3487398527041636
>>> stats.median_abs_deviation(x, scale='normal')
1.9996446978061115
process_improve.univariate.metrics.biweight_midvariance(x, nan_policy='omit')[source]#

Return the Mosteller-Tukey robust scale (biweight midvariance) of x.

The biweight midvariance is a robust, highly efficient estimator of the variance: it down-weights observations far from the median and ignores gross outliers entirely.

Parameters:
  • x (np.ndarray or pd.Series) – One-dimensional sample of numeric values.

  • nan_policy ({"omit", "propagate"}, optional) – "omit" (default) drops missing values; "propagate" returns nan if any value is missing.

Returns:

The robust variance estimate. Returns 0.0 when the MAD is zero (e.g. a constant sample), and nan for an empty sample.

Return type:

float

References

Mosteller and Tukey, Data Analysis and Regression, pp. 207-208, 1977.

process_improve.univariate.metrics.holm_bonferroni(p_values, alpha=0.05)[source]#

Holm-Bonferroni step-down correction for multiple comparisons.

Holm’s method controls the family-wise error rate while being uniformly more powerful than the plain Bonferroni correction. It is the recommended post-hoc correction for a family of pairwise comparisons.

Parameters:
  • p_values (array-like) – The raw (uncorrected) p-values of the individual comparisons.

  • alpha (float, optional) – Family-wise significance level, by default 0.05.

Returns:

A bunch with, in the same order as the input:

  • p_adjusted: the Holm-adjusted p-values.

  • reject: boolean array, True where the null hypothesis is rejected at level alpha.

  • alpha: the family-wise level used.

Return type:

sklearn.utils.Bunch

References

Holm, “A simple sequentially rejective multiple test procedure”, Scandinavian Journal of Statistics, 6, 65-70, 1979.

process_improve.univariate.metrics.summary_stats(x, method='robust')[source]#

Return summary statistics of the numeric values in vector x.

Parameters:
  • x (numpy.ndarray or pandas.Series) – A vector of univariate values to summarize.

  • method (str, optional) – If "robust" (the default), the reported center is the median and the spread is the Sn robust estimate; otherwise the mean and the sample standard deviation are used.

Returns:

A summary of the univariate vector. The most useful keys are "center" (a measure of the center, e.g. the median for the robust method) and "spread" (a measure of the spread, e.g. the Sn robust estimate for the robust method).

Return type:

dict

process_improve.univariate.metrics.detect_outliers_esd(x, algorithm='esd', max_outliers_detected=1, **kwargs)[source]#

Return a list of indexes of points in the vector x which are likely outliers.

A second output (can be ignored) contains the details of the values used to make the decision.

Arguments:

x {list, sequence, NumPy vector/array} – [A sequence, list or vector which can be unravelled.]

Keyword Arguments:

algorithm – Two algorithms are possible to detect outliers: (default: “esd”)

‘esd’: Generalized ESD Test for Outliers. If max_outliers_detected=1 this is essentially Grubb’s test. For more details, please see: https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h3.htm

‘cc-robust’: Build a robust control-chart for the sequence x and points should lie outside the +/- 3 sigma limits are considered outliers. Not Implemented Yet: left here as an idea for the future, but not confirmed yet.

max_outliers_detected – The maximum number of outliers that

should be detected, as required by the algorithms.

kwargs – Algorithm dependent arguments. Defaults are shown here.

‘esd’:

‘robust_variant’ = True. Uses the median and MAD for the center and the standard deviation respectively.

‘alpha’ = 0.05. The significance level of the testing.

Parameters:
Return type:

tuple[list[int], defaultdict[Any, Any]]

process_improve.univariate.metrics.tietjen_moore_test(x, n_outliers, *, two_sided=True, alpha=0.05, n_simulations=10000, random_state=None)[source]#

Tietjen-Moore test for a specified number of outliers.

Tests the null hypothesis “there are no outliers” against the alternative “the n_outliers most extreme observations are outliers”. Unlike the generalised ESD test, the number of suspected outliers must be fixed in advance. The test statistic has no closed-form critical value, so it is obtained by simulation under the normal null.

Parameters:
  • x (np.ndarray or pd.Series) – One-dimensional sample. Missing values are dropped.

  • n_outliers (int) – The number of suspected outliers to test for (1 <= n_outliers < N).

  • two_sided (bool, optional) – If True (default) the test looks for outliers on either tail (the observations with the largest absolute deviation from the mean). If False it tests only the n_outliers largest observations.

  • alpha (float, optional) – Significance level, by default 0.05.

  • n_simulations (int, optional) – Number of Monte-Carlo samples used to estimate the critical value.

  • random_state (int or None, optional) – Seed for the simulation, for reproducibility.

Returns:

With statistic, critical_value, reject (True when the outliers are significant), outlier_indices (positions in the missing-value-removed sample), n_outliers and alpha.

Return type:

sklearn.utils.Bunch

References

Tietjen and Moore, “Some Grubbs-type statistics for the detection of several outliers”, Technometrics, 14, 583-597, 1972. See also the NIST handbook, section 3.5.h.3.

process_improve.univariate.metrics.distribution_fit(x, distribution='norm', alpha=0.05)[source]#

Check how well a sample fits a named distribution.

Fits the parameters of the requested scipy.stats distribution by maximum likelihood and runs a Kolmogorov-Smirnov goodness-of-fit test (NIST handbook, section 3.5.7).

Parameters:
  • x (np.ndarray or pd.Series) – One-dimensional sample. Missing values are dropped.

  • distribution (str, optional) – Name of any continuous scipy.stats distribution, by default "norm".

  • alpha (float, optional) – Significance level for the fits_well verdict, by default 0.05.

Returns:

With distribution, fitted parameters, ks_statistic, ks_pvalue, fits_well (True when the fit is not rejected at level alpha) and the sample size n.

Return type:

sklearn.utils.Bunch

Notes

Because the distribution parameters are estimated from the same data, the KS p-value is conservative (the true Type-I error is smaller than alpha); it remains a useful screening check.

process_improve.univariate.metrics.variance_decomposition(df, measured, repeat)[source]#

Given a DataFrame df of raw data, and an indication of which column is the measured value column, and which is the repeat indicator, it will calculate the within and between replicate standard deviation.

Example Two measurements on day 1 [101, 102] and two measurements on day 2 [94, 95]. The between-day variation can already be expected to be much greater than the within-day variation.

>>> df = pd.DataFrame(data={'Result': [101, 102, 94, 95], 'Repeat': [1, 1, 2, 2]})
    Result  Repeat
0     101       1
1     102       1
2      94       2
3      95       2
>>> output = within_between_standard_deviation(df, measured="Result", repeat="Repeat")
{'total_ms':       16.666667,
 'total_dof':      3,
 'within_ms':      0.5,
 'within_stddev':  0.70711,
 'within_dof':     2,
 'between_ms':     49.0,
 'between_stddev': 7.0,
 'between_dof':    1}

Note * SSQ = sum of squares * DOF= degrees of freedom * MS = mean square = (sum of squares) / (degrees of freedom) = SSQ / DOF = variance

Parameters:
Return type:

dict