Univariate Statistics#
- process_improve.univariate.metrics.t_value(p, v)[source]#
Return the value on the x-axis if you plot the cumulative t-distribution with a fractional area of p (p is therefore a fractional value between 0 and 1 on the y-axis) and v is the degrees of freedom.
Examples
Since the cumulative distribution passes symmetrically through the x-axis at 0.0 for any number of degrees of freedom
>>> t_value(0.5, v) 0.0
Zero fractional area under the curve is always at \(-\infty\):
>>> t_value(0.0, v) -Inf
100% fractional area is always at \(+\infty\):
>>> t_value(1.0, v) +Inf
See also
t_value_cdfdoes the inverse of this function.
- process_improve.univariate.metrics.t_value_cdf(z, v)[source]#
Return the fractional area under the cumulative t-distribution (y-axis value) at the t-value z on the x-axis, with v degrees of freedom.
Examples
The cumulative distribution is symmetric through the x-axis at 0.0 for any number of degrees of freedom, so half of the area lies below zero:
>>> t_value_cdf(0.0, v) 0.5
Zero fractional area under the curve is at \(-\infty\):
>>> t_value_cdf(-np.inf, v) 0.0
100% fractional area is at \(+\infty\):
>>> t_value_cdf(np.inf, v) 1.0
See also
t_valuedoes the inverse of this function.
- process_improve.univariate.metrics.test_normality(x)[source]#
Check the p-value of the hypothesis that the data are from a normal distribution.
If the p-value is less than the chosen alpha level (e.g. 0.05 or 0.025), then there is evidence that the data tested are NOT normally distributed.
On the other hand, if the p-value is greater than the chosen alpha level, then the null hypothesis that the data came from a normally distributed population can not be rejected. NOTE: it does not mean that the data are normally distributed, just that we have nothing better to say about it. See the Shapiro-Wilk test.
Implementation: Uses the Shapiro Wilk test directly taken from scipy.stats.shapiro.
- process_improve.univariate.metrics.Sn(x, constant=1.1926)[source]#
Compute a robust scale estimator. The Sn metric is an efficient alternative to MAD.
- Parameters:
x (np.ndarray or pd.Series) – A vector of values. NaN entries are ignored.
constant (float, optional) – Multiplicative constant that makes the estimator consistent with iid values from a Gaussian distribution with no outliers. Default is 1.1926.
- Returns:
A scalar value, the Sn estimate of spread. Returns NaN if all entries are missing, and 0.0 when only a single non-missing value is supplied.
- Return type:
np.floating
Notes
Tested against once of the most reliable open-source packages, written by some of the most respected names in the area of robust methods: [1] and [2].
Disadvantages of MAD:
It does not have particularly high efficiency for data that is in fact normal (37%). In comparison, the median has 64% efficiency for normal data.
The MAD statistic also has an implicit assumption of symmetry. That is, it measures the distance from a measure of central location (the median).
References
- process_improve.univariate.metrics.ttest_independent(sample_A, sample_B, conflevel=0.995)[source]#
Core calculation for a test of differences between the average of A and the average of B. No checking of inputs.
- process_improve.univariate.metrics.ttest_independent_from_df(df, grouper_column, values_column, conflevel=0.995)[source]#
Calculate the t-test for differences between two or more groups and returns a confidence interval for the difference. The test is for UNPAIRED differences.
The dataframe df contains a grouper_column with 2 or more unique values (e.g. ‘A’ and ‘B’). All unique values of the grouper_column are used, and t-tests are done between the values in the values_column.
- Args:
df (pd.DataFrame): Dataframe of the values and grouping variable. grouper_column (str): Indicates which column will be grouped on. values_column (str): Which column contains the numeric values to calculate the test on. conflevel (float, optional): [description]. Defaults to 0.995.
- Output: Dataframe with columns containing the statistical outputs of the t-test, including:
Group “A” name
Group “B” name
Group “A” mean
Group “B” mean
z-value for the difference between group “B” minus group “A”
p-value for this z-value
Confidence interval low value for difference between group “B” minus group “A”
Confidence interval high value for difference between group “B” minus group “A”
Example: df : has 3 levels in the grouper variable; [‘Marco’, ‘Pete’, ‘Sam’]
Output will have 3 rows: Group A name Group B name Marco Pete Marco Sam Pete Sam
- process_improve.univariate.metrics.ttest_paired(differences, conflevel=0.995)[source]#
Core calculation for a test of paired differences.
- Parameters:
differences (pd.Series) – The paired differences (e.g.
sample_A - sample_B) for which the test is run.conflevel (float, optional) – Value between 0 and 1 (closer to 1.0), that gives the level of confidence required for the 2-sided test. Default is 0.995.
- Returns:
Outcomes from the statistical test.
- Return type:
- process_improve.univariate.metrics.ttest_paired_from_df(df, grouper_column, values_column, conflevel=0.995)[source]#
Calculate the t-test for paired differences between two or more groups and returns a confidence interval for the difference. The test is for PAIRED differences. The differences is always defined as the A values minus the B values: after - before, or A - B.
The dataframe df contains a grouper_column with 2 or more unique values (e.g. ‘A’ and ‘B’). All unique values of the grouper_column are used, and t-tests are done between the values in the values_column.
When selecting the columns, the number of values per column must be the same.
- Args:
df (pd.DataFrame): Dataframe of the values and grouping variable. grouper_column (str): Indicates which column will be grouped on. values_column (str): Which column contains the numeric values to calculate the test on. conflevel (float, optional): [description]. Defaults to 0.995.
Output: Dataframe with columns containing the statistical outputs of the t-test, including:
Group A name
Group B name
Group A mean
Group B mean
Differences mean: average difference between the groups (not the same as the difference of the averages from items 3 and 4 above)
z-value for the difference between group “B” minus group “A”
p-value for this z-value
Confidence interval low value for difference between group “B” minus group “A”
Confidence interval high value for difference between group “B” minus group “A”
- process_improve.univariate.metrics.confidence_interval(df, column_name, conflevel=0.95, style='robust')[source]#
Calculate the confidence interval, returned as a tuple, for the column_name (str) in the dataframe df, for a given confidence level conflevel (default: 0.95).
- style: [‘robust’; ‘regular’]: indicates which style of estimates to use for the center and
spread. Default: ‘robust’
Missing values are ignored.
- process_improve.univariate.metrics.median_absolute_deviation(x, axis=0, center=<function median>, scale='normal', nan_policy='omit')[source]#
Taken from scipy.stats.stats: we want the same functionality, but with a slightly different default function signature.
scale=’normal’ instead of scale=1.0.
nan_policy=’omit’ instead of nan_policy=’propogate’
Compute the median absolute deviation of the data along the given axis. The median absolute deviation (MAD) computes the median over the absolute deviations from the median. It is a measure of dispersion similar to the standard deviation but more robust to outliers. The MAD of an empty array is
np.nan.- Parameters:
x (array_like) – Input array or object that can be converted to an array.
axis (int or None, optional) – Axis along which the range is computed. Default is 0. Axis == None will not be accepted anymore.
center (callable, optional) – A function that will return the central value. The default is to use np.median. Any user defined function used will need to have the function signature
func(arr, axis).scale (scalar or str, optional) – The numerical value of scale will be divided out of the final result. The default is 1.0. The string “normal” is also accepted, and results in scale being the inverse of the standard normal quantile function at 0.75, which is approximately 0.67449. Array-like scale is also allowed, as long as it broadcasts correctly to the output such that
out / scaleis a valid operation. The output dimensions depend on the input array, x, and the axis argument.nan_policy ({'propagate', 'raise', 'omit'}, optional) – Defines how to handle when input contains nan. The following options are available (default is ‘propagate’): * ‘propagate’: returns nan * ‘raise’: throws an error * ‘omit’: performs the calculations ignoring nan values
- Returns:
mad – If the input contains integers or floats of smaller precision than
np.float64, then the output data-type isnp.float64. Otherwise, the output data-type is the same as that of the input.- Return type:
scalar or ndarray
See also
numpy.std,numpy.var,numpy.median,scipy.stats.iqr,scipy.stats.tmean,scipy.stats.tstd,scipy.stats.tvarNotes
The center argument only affects the calculation of the central value around which the MAD is calculated. That is, passing in
center=np.meanwill calculate the MAD around the mean - it will not calculate the mean absolute deviation.The input array may contain inf, but if center returns inf, the corresponding MAD for that data will be nan.
References
Examples
When comparing the behavior of median_abs_deviation with
np.std, the latter is affected when we change a single value of an array to have an outlier value while the MAD hardly changes:>>> from scipy import stats >>> x = stats.norm.rvs(size=100, scale=1, random_state=123456) >>> x.std() 0.9973906394005013 >>> stats.median_abs_deviation(x) 0.82832610097857 >>> x[0] = 345.6 >>> x.std() 34.42304872314415 >>> stats.median_abs_deviation(x) 0.8323442311590675
Axis handling example:
>>> x = np.array([[10, 7, 4], [3, 2, 1]]) >>> x array([[10, 7, 4], [ 3, 2, 1]]) >>> stats.median_abs_deviation(x) array([3.5, 2.5, 1.5]) >>> stats.median_abs_deviation(x, axis=None) <-- syntax of `axis=None` will be depricated 2.0Scale normal example:
>>> x = stats.norm.rvs(size=1000000, scale=2, random_state=123456) >>> stats.median_abs_deviation(x) 1.3487398527041636 >>> stats.median_abs_deviation(x, scale='normal') 1.9996446978061115
- process_improve.univariate.metrics.summary_stats(x, method='robust')[source]#
Return summary statistics of the numeric values in vector x.
- Arguments:
x (Numpy vector or Pandas series): A vector of univariate values to summarize.
- Returns:
dict: a summary of the univariate vector. The following outputs are the most interesting:
- “center”: a measure of the center (average). If method is
robust, this is the median.
- “spread”: a measure of the spread. If method is
robust, this is the Sn, a robust spread estimate.
- “center”: a measure of the center (average). If method is
- process_improve.univariate.metrics.detect_outliers_esd(x, algorithm='esd', max_outliers_detected=1, **kwargs)[source]#
Return a list of indexes of points in the vector x which are likely outliers.
A second output (can be ignored) contains the details of the values used to make the decision.
- Arguments:
x {list, sequence, NumPy vector/array} – [A sequence, list or vector which can be unravelled.]
- Keyword Arguments:
algorithm – Two algorithms are possible to detect outliers: (default: “esd”)
‘esd’: Generalized ESD Test for Outliers. If max_outliers_detected=1 this is essentially Grubb’s test. For more details, please see: https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h3.htm
‘cc-robust’: Build a robust control-chart for the sequence x and points should lie outside the +/- 3 sigma limits are considered outliers. Not Implemented Yet: left here as an idea for the future, but not confirmed yet.
- max_outliers_detected – The maximum number of outliers that
should be detected, as required by the algorithms.
kwargs – Algorithm dependent arguments. Defaults are shown here.
‘esd’:
‘robust_variant’ = True. Uses the median and MAD for the center and the standard deviation respectively.
‘alpha’ = 0.05. The significance level of the testing.
- process_improve.univariate.metrics.variance_decomposition(df, measured, repeat)[source]#
Given a DataFrame df of raw data, and an indication of which column is the measured value column, and which is the repeat indicator, it will calculate the within and between replicate standard deviation.
Example Two measurements on day 1
[101, 102]and two measurements on day 2[94, 95]. The between-day variation can already be expected to be much greater than the within-day variation.>>> df = pd.DataFrame(data={'Result': [101, 102, 94, 95], 'Repeat': [1, 1, 2, 2]}) Result Repeat 0 101 1 1 102 1 2 94 2 3 95 2 >>> output = within_between_standard_deviation(df, measured="Result", repeat="Repeat") {'total_ms': 16.666667, 'total_dof': 3, 'within_ms': 0.5, 'within_stddev': 0.70711, 'within_dof': 2, 'between_ms': 49.0, 'between_stddev': 7.0, 'between_dof': 1}
Note * SSQ = sum of squares * DOF= degrees of freedom * MS = mean square = (sum of squares) / (degrees of freedom) = SSQ / DOF = variance