Univariate Statistics#

process_improve.univariate.metrics.t_value(p, v)[source]#

Return the value on the x-axis if you plot the cumulative t-distribution with a fractional area of p (p is therefore a fractional value between 0 and 1 on the y-axis) and v is the degrees of freedom.

Examples

Since the cumulative distribution passes symmetrically through the x-axis at 0.0 for any number of degrees of freedom

>>> t_value(0.5, v)
0.0

Zero fractional area under the curve is always at \(-\infty\):

>>> t_value(0.0, v)
-Inf

100% fractional area is always at \(+\infty\):

>>> t_value(1.0, v)
+Inf

See also

t_value_cdf

does the inverse of this function.

Parameters:
Return type:

float

process_improve.univariate.metrics.t_value_cdf(z, v)[source]#

Return the fractional area under the cumulative t-distribution (y-axis value) at the t-value z on the x-axis, with v degrees of freedom.

Examples

The cumulative distribution is symmetric through the x-axis at 0.0 for any number of degrees of freedom, so half of the area lies below zero:

>>> t_value_cdf(0.0, v)
0.5

Zero fractional area under the curve is at \(-\infty\):

>>> t_value_cdf(-np.inf, v)
0.0

100% fractional area is at \(+\infty\):

>>> t_value_cdf(np.inf, v)
1.0

See also

t_value

does the inverse of this function.

Parameters:
Return type:

float

process_improve.univariate.metrics.test_normality(x)[source]#

Check the p-value of the hypothesis that the data are from a normal distribution.

If the p-value is less than the chosen alpha level (e.g. 0.05 or 0.025), then there is evidence that the data tested are NOT normally distributed.

On the other hand, if the p-value is greater than the chosen alpha level, then the null hypothesis that the data came from a normally distributed population can not be rejected. NOTE: it does not mean that the data are normally distributed, just that we have nothing better to say about it. See the Shapiro-Wilk test.

Implementation: Uses the Shapiro Wilk test directly taken from scipy.stats.shapiro.

Parameters:

x (ndarray | Series)

Return type:

float

process_improve.univariate.metrics.Sn(x, constant=1.1926)[source]#

Compute a robust scale estimator. The Sn metric is an efficient alternative to MAD.

Parameters:
  • x (np.ndarray or pd.Series) – A vector of values. NaN entries are ignored.

  • constant (float, optional) – Multiplicative constant that makes the estimator consistent with iid values from a Gaussian distribution with no outliers. Default is 1.1926.

Returns:

A scalar value, the Sn estimate of spread. Returns NaN if all entries are missing, and 0.0 when only a single non-missing value is supplied.

Return type:

np.floating

Notes

Tested against once of the most reliable open-source packages, written by some of the most respected names in the area of robust methods: [1] and [2].

Disadvantages of MAD:

  • It does not have particularly high efficiency for data that is in fact normal (37%). In comparison, the median has 64% efficiency for normal data.

  • The MAD statistic also has an implicit assumption of symmetry. That is, it measures the distance from a measure of central location (the median).

References

process_improve.univariate.metrics.ttest_independent(sample_A, sample_B, conflevel=0.995)[source]#

Core calculation for a test of differences between the average of A and the average of B. No checking of inputs.

Parameters:
  • sample_A (iterable) – Vector of n_a measurements.

  • sample_B (iterable) – Vector of n_b measurements.

  • conflevel (float) – Value between 0 and 1 (closer to 1.0), that gives the level of confidence required for the 2-sided test.

Returns:

Outcomes from the statistical test.

Return type:

dict

process_improve.univariate.metrics.ttest_independent_from_df(df, grouper_column, values_column, conflevel=0.995)[source]#

Calculate the t-test for differences between two or more groups and returns a confidence interval for the difference. The test is for UNPAIRED differences.

The dataframe df contains a grouper_column with 2 or more unique values (e.g. ‘A’ and ‘B’). All unique values of the grouper_column are used, and t-tests are done between the values in the values_column.

Args:

df (pd.DataFrame): Dataframe of the values and grouping variable. grouper_column (str): Indicates which column will be grouped on. values_column (str): Which column contains the numeric values to calculate the test on. conflevel (float, optional): [description]. Defaults to 0.995.

Output: Dataframe with columns containing the statistical outputs of the t-test, including:
  1. Group “A” name

  2. Group “B” name

  3. Group “A” mean

  4. Group “B” mean

  5. z-value for the difference between group “B” minus group “A”

  6. p-value for this z-value

  7. Confidence interval low value for difference between group “B” minus group “A”

  8. Confidence interval high value for difference between group “B” minus group “A”

Example: df : has 3 levels in the grouper variable; [‘Marco’, ‘Pete’, ‘Sam’]

Output will have 3 rows: Group A name Group B name Marco Pete Marco Sam Pete Sam

Parameters:
Return type:

DataFrame

process_improve.univariate.metrics.ttest_paired(differences, conflevel=0.995)[source]#

Core calculation for a test of paired differences.

Parameters:
  • differences (pd.Series) – The paired differences (e.g. sample_A - sample_B) for which the test is run.

  • conflevel (float, optional) – Value between 0 and 1 (closer to 1.0), that gives the level of confidence required for the 2-sided test. Default is 0.995.

Returns:

Outcomes from the statistical test.

Return type:

dict

process_improve.univariate.metrics.ttest_paired_from_df(df, grouper_column, values_column, conflevel=0.995)[source]#

Calculate the t-test for paired differences between two or more groups and returns a confidence interval for the difference. The test is for PAIRED differences. The differences is always defined as the A values minus the B values: after - before, or A - B.

The dataframe df contains a grouper_column with 2 or more unique values (e.g. ‘A’ and ‘B’). All unique values of the grouper_column are used, and t-tests are done between the values in the values_column.

When selecting the columns, the number of values per column must be the same.

Args:

df (pd.DataFrame): Dataframe of the values and grouping variable. grouper_column (str): Indicates which column will be grouped on. values_column (str): Which column contains the numeric values to calculate the test on. conflevel (float, optional): [description]. Defaults to 0.995.

Output: Dataframe with columns containing the statistical outputs of the t-test, including:

  1. Group A name

  2. Group B name

  3. Group A mean

  4. Group B mean

  5. Differences mean: average difference between the groups (not the same as the difference of the averages from items 3 and 4 above)

  6. z-value for the difference between group “B” minus group “A”

  7. p-value for this z-value

  8. Confidence interval low value for difference between group “B” minus group “A”

  9. Confidence interval high value for difference between group “B” minus group “A”

Parameters:
Return type:

DataFrame

process_improve.univariate.metrics.confidence_interval(df, column_name, conflevel=0.95, style='robust')[source]#

Calculate the confidence interval, returned as a tuple, for the column_name (str) in the dataframe df, for a given confidence level conflevel (default: 0.95).

style: [‘robust’; ‘regular’]: indicates which style of estimates to use for the center and

spread. Default: ‘robust’

Missing values are ignored.

Parameters:
Return type:

tuple

process_improve.univariate.metrics.median_absolute_deviation(x, axis=0, center=<function median>, scale='normal', nan_policy='omit')[source]#

Taken from scipy.stats.stats: we want the same functionality, but with a slightly different default function signature.

  • scale=’normal’ instead of scale=1.0.

  • nan_policy=’omit’ instead of nan_policy=’propogate’

Compute the median absolute deviation of the data along the given axis. The median absolute deviation (MAD) computes the median over the absolute deviations from the median. It is a measure of dispersion similar to the standard deviation but more robust to outliers. The MAD of an empty array is np.nan.

Parameters:
  • x (array_like) – Input array or object that can be converted to an array.

  • axis (int or None, optional) – Axis along which the range is computed. Default is 0. Axis == None will not be accepted anymore.

  • center (callable, optional) – A function that will return the central value. The default is to use np.median. Any user defined function used will need to have the function signature func(arr, axis).

  • scale (scalar or str, optional) – The numerical value of scale will be divided out of the final result. The default is 1.0. The string “normal” is also accepted, and results in scale being the inverse of the standard normal quantile function at 0.75, which is approximately 0.67449. Array-like scale is also allowed, as long as it broadcasts correctly to the output such that out / scale is a valid operation. The output dimensions depend on the input array, x, and the axis argument.

  • nan_policy ({'propagate', 'raise', 'omit'}, optional) – Defines how to handle when input contains nan. The following options are available (default is ‘propagate’): * ‘propagate’: returns nan * ‘raise’: throws an error * ‘omit’: performs the calculations ignoring nan values

Returns:

mad – If the input contains integers or floats of smaller precision than np.float64, then the output data-type is np.float64. Otherwise, the output data-type is the same as that of the input.

Return type:

scalar or ndarray

See also

numpy.std, numpy.var, numpy.median, scipy.stats.iqr, scipy.stats.tmean, scipy.stats.tstd, scipy.stats.tvar

Notes

The center argument only affects the calculation of the central value around which the MAD is calculated. That is, passing in center=np.mean will calculate the MAD around the mean - it will not calculate the mean absolute deviation.

The input array may contain inf, but if center returns inf, the corresponding MAD for that data will be nan.

References

Examples

When comparing the behavior of median_abs_deviation with np.std, the latter is affected when we change a single value of an array to have an outlier value while the MAD hardly changes:

>>> from scipy import stats
>>> x = stats.norm.rvs(size=100, scale=1, random_state=123456)
>>> x.std()
0.9973906394005013
>>> stats.median_abs_deviation(x)
0.82832610097857
>>> x[0] = 345.6
>>> x.std()
34.42304872314415
>>> stats.median_abs_deviation(x)
0.8323442311590675

Axis handling example:

>>> x = np.array([[10, 7, 4], [3, 2, 1]])
>>> x
array([[10,  7,  4],
       [ 3,  2,  1]])
>>> stats.median_abs_deviation(x)
array([3.5, 2.5, 1.5])
>>> stats.median_abs_deviation(x, axis=None)  <-- syntax of `axis=None` will be depricated
2.0

Scale normal example:

>>> x = stats.norm.rvs(size=1000000, scale=2, random_state=123456)
>>> stats.median_abs_deviation(x)
1.3487398527041636
>>> stats.median_abs_deviation(x, scale='normal')
1.9996446978061115
process_improve.univariate.metrics.summary_stats(x, method='robust')[source]#

Return summary statistics of the numeric values in vector x.

Arguments:

x (Numpy vector or Pandas series): A vector of univariate values to summarize.

Returns:

dict: a summary of the univariate vector. The following outputs are the most interesting:

“center”: a measure of the center (average). If method is robust, this is the

median.

“spread”: a measure of the spread. If method is robust, this is the Sn, a robust

spread estimate.

Parameters:
Return type:

dict

process_improve.univariate.metrics.detect_outliers_esd(x, algorithm='esd', max_outliers_detected=1, **kwargs)[source]#

Return a list of indexes of points in the vector x which are likely outliers.

A second output (can be ignored) contains the details of the values used to make the decision.

Arguments:

x {list, sequence, NumPy vector/array} – [A sequence, list or vector which can be unravelled.]

Keyword Arguments:

algorithm – Two algorithms are possible to detect outliers: (default: “esd”)

‘esd’: Generalized ESD Test for Outliers. If max_outliers_detected=1 this is essentially Grubb’s test. For more details, please see: https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h3.htm

‘cc-robust’: Build a robust control-chart for the sequence x and points should lie outside the +/- 3 sigma limits are considered outliers. Not Implemented Yet: left here as an idea for the future, but not confirmed yet.

max_outliers_detected – The maximum number of outliers that

should be detected, as required by the algorithms.

kwargs – Algorithm dependent arguments. Defaults are shown here.

‘esd’:

‘robust_variant’ = True. Uses the median and MAD for the center and the standard deviation respectively.

‘alpha’ = 0.05. The significance level of the testing.

Parameters:
Return type:

tuple[list[int], defaultdict[Any, Any]]

process_improve.univariate.metrics.variance_decomposition(df, measured, repeat)[source]#

Given a DataFrame df of raw data, and an indication of which column is the measured value column, and which is the repeat indicator, it will calculate the within and between replicate standard deviation.

Example Two measurements on day 1 [101, 102] and two measurements on day 2 [94, 95]. The between-day variation can already be expected to be much greater than the within-day variation.

>>> df = pd.DataFrame(data={'Result': [101, 102, 94, 95], 'Repeat': [1, 1, 2, 2]})
    Result  Repeat
0     101       1
1     102       1
2      94       2
3      95       2
>>> output = within_between_standard_deviation(df, measured="Result", repeat="Repeat")
{'total_ms':       16.666667,
 'total_dof':      3,
 'within_ms':      0.5,
 'within_stddev':  0.70711,
 'within_dof':     2,
 'between_ms':     49.0,
 'between_stddev': 7.0,
 'between_dof':    1}

Note * SSQ = sum of squares * DOF= degrees of freedom * MS = mean square = (sum of squares) / (degrees of freedom) = SSQ / DOF = variance

Parameters:
Return type:

dict