Univariate Statistics#
- process_improve.univariate.metrics.t_value(p, v)[source]#
Return the value on the x-axis if you plot the cumulative t-distribution with a fractional area of p (p is therefore a fractional value between 0 and 1 on the y-axis) and v is the degrees of freedom.
Examples
Since the cumulative distribution passes symmetrically through the x-axis at 0.0 for any number of degrees of freedom
>>> t_value(0.5, v) 0.0
Zero fractional area under the curve is always at \(-\infty\):
>>> t_value(0.0, v) -Inf
100% fractional area is always at \(+\infty\):
>>> t_value(1.0, v) +Inf
See also
t_value_cdfdoes the inverse of this function.
- process_improve.univariate.metrics.t_value_cdf(z, v)[source]#
Return the fractional area under the cumulative t-distribution (y-axis value) at the t-value z on the x-axis, with v degrees of freedom.
Examples
The cumulative distribution is symmetric through the x-axis at 0.0 for any number of degrees of freedom, so half of the area lies below zero:
>>> t_value_cdf(0.0, v) 0.5
Zero fractional area under the curve is at \(-\infty\):
>>> t_value_cdf(-np.inf, v) 0.0
100% fractional area is at \(+\infty\):
>>> t_value_cdf(np.inf, v) 1.0
See also
t_valuedoes the inverse of this function.
- process_improve.univariate.metrics.test_normality(x)[source]#
Check the p-value of the hypothesis that the data are from a normal distribution.
If the p-value is less than the chosen alpha level (e.g. 0.05 or 0.025), then there is evidence that the data tested are NOT normally distributed.
On the other hand, if the p-value is greater than the chosen alpha level, then the null hypothesis that the data came from a normally distributed population can not be rejected. NOTE: it does not mean that the data are normally distributed, just that we have nothing better to say about it. See the Shapiro-Wilk test.
Implementation: Uses the Shapiro Wilk test directly taken from scipy.stats.shapiro.
- process_improve.univariate.metrics.Sn(x, constant=1.1926)[source]#
Compute a robust scale estimator. The Sn metric is an efficient alternative to MAD.
- Parameters:
x (np.ndarray or pd.Series) – A vector of values. NaN entries are ignored.
constant (float, optional) – Multiplicative constant that makes the estimator consistent with iid values from a Gaussian distribution with no outliers. Default is 1.1926.
- Returns:
A scalar value, the Sn estimate of spread. Returns NaN if all entries are missing, and 0.0 when only a single non-missing value is supplied.
- Return type:
np.floating
Notes
Tested against once of the most reliable open-source packages, written by some of the most respected names in the area of robust methods: [1] and [2].
Disadvantages of MAD:
It does not have particularly high efficiency for data that is in fact normal (37%). In comparison, the median has 64% efficiency for normal data.
The MAD statistic also has an implicit assumption of symmetry. That is, it measures the distance from a measure of central location (the median).
References
- process_improve.univariate.metrics.ttest_independent(sample_A, sample_B, conflevel=0.995)[source]#
Core calculation for a test of differences between the average of A and the average of B. No checking of inputs.
- process_improve.univariate.metrics.ttest_independent_from_df(df, grouper_column, values_column, conflevel=0.995)[source]#
Calculate the t-test for differences between two or more groups and returns a confidence interval for the difference. The test is for UNPAIRED differences.
The dataframe df contains a grouper_column with 2 or more unique values (e.g. ‘A’ and ‘B’). All unique values of the grouper_column are used, and t-tests are done between the values in the values_column.
- Args:
df (pd.DataFrame): Dataframe of the values and grouping variable. grouper_column (str): Indicates which column will be grouped on. values_column (str): Which column contains the numeric values to calculate the test on. conflevel (float, optional): [description]. Defaults to 0.995.
- Output: Dataframe with columns containing the statistical outputs of the t-test, including:
Group “A” name
Group “B” name
Group “A” mean
Group “B” mean
z-value for the difference between group “B” minus group “A”
p-value for this z-value
Confidence interval low value for difference between group “B” minus group “A”
Confidence interval high value for difference between group “B” minus group “A”
Example: df : has 3 levels in the grouper variable; [‘Marco’, ‘Pete’, ‘Sam’]
Output will have 3 rows: Group A name Group B name Marco Pete Marco Sam Pete Sam
- process_improve.univariate.metrics.ttest_paired(differences, conflevel=0.995)[source]#
Core calculation for a test of paired differences.
- Parameters:
differences (pd.Series) – The paired differences (e.g.
sample_A - sample_B) for which the test is run.conflevel (float, optional) – Value between 0 and 1 (closer to 1.0), that gives the level of confidence required for the 2-sided test. Default is 0.995.
- Returns:
Outcomes from the statistical test, with keys:
"Differences mean": the mean of the inputdifferences."z value": the test statistic (mean divided by its standard error)."ConfInt: Lo","ConfInt: Hi": lower and upper bounds of the two-sided confidence interval around the mean difference at the requestedconflevel."p value": two-sided p-value from the t-distribution."Degrees of freedom":n - 1fornpaired observations."Standard deviation": the standard error of the mean difference, i.e.sample_std / sqrt(n)(the scale used to build the confidence interval), NOT the sample standard deviation ofdifferences.
- Return type:
- process_improve.univariate.metrics.ttest_paired_from_df(df, grouper_column, values_column, conflevel=0.995)[source]#
Calculate the t-test for paired differences between two or more groups and returns a confidence interval for the difference. The test is for PAIRED differences. The differences is always defined as the A values minus the B values: after - before, or A - B.
The dataframe df contains a grouper_column with 2 or more unique values (e.g. ‘A’ and ‘B’). All unique values of the grouper_column are used, and t-tests are done between the values in the values_column.
When selecting the columns, the number of values per column must be the same.
- Args:
df (pd.DataFrame): Dataframe of the values and grouping variable. grouper_column (str): Indicates which column will be grouped on. values_column (str): Which column contains the numeric values to calculate the test on. conflevel (float, optional): [description]. Defaults to 0.995.
Output: Dataframe with columns containing the statistical outputs of the t-test, including:
Group A name
Group B name
Group A mean
Group B mean
Differences mean: average difference between the groups (not the same as the difference of the averages from items 3 and 4 above)
z-value for the difference between group “B” minus group “A”
p-value for this z-value
Confidence interval low value for difference between group “B” minus group “A”
Confidence interval high value for difference between group “B” minus group “A”
- process_improve.univariate.metrics.confidence_interval(df, column_name, conflevel=0.95, style='robust')[source]#
Calculate the confidence interval, returned as a tuple, for the column_name (str) in the dataframe df, for a given confidence level conflevel (default: 0.95).
- style: [‘robust’; ‘regular’]: indicates which style of estimates to use for the center and
spread. Default: ‘robust’
Missing values are ignored.
- process_improve.univariate.metrics.median_absolute_deviation(x, axis=0, center=<function median>, scale='normal', nan_policy='omit')[source]#
Taken from scipy.stats.stats: we want the same functionality, but with a slightly different default function signature.
scale=’normal’ instead of scale=1.0.
nan_policy=’omit’ instead of nan_policy=’propogate’
Compute the median absolute deviation of the data along the given axis. The median absolute deviation (MAD) computes the median over the absolute deviations from the median. It is a measure of dispersion similar to the standard deviation but more robust to outliers. The MAD of an empty array is
np.nan.- Parameters:
x (array_like) – Input array or object that can be converted to an array.
axis (int or None, optional) – Axis along which the range is computed. Default is 0. Axis == None will not be accepted anymore.
center (callable, optional) – A function that will return the central value. The default is to use np.median. Any user defined function used will need to have the function signature
func(arr, axis).scale (scalar or str, optional) – The numerical value of scale will be divided out of the final result. The default is
"normal", which results in scale being the inverse of the standard normal quantile function at 0.75, which is approximately 0.67449 (so the returned value is consistent with the standard deviation for normally distributed data). Any positive float may also be passed. Array-like scale is also allowed, as long as it broadcasts correctly to the output such thatout / scaleis a valid operation. The output dimensions depend on the input array, x, and the axis argument.nan_policy ({'propagate', 'raise', 'omit'}, optional) – Defines how to handle when input contains nan. The following options are available (default is ‘omit’): * ‘propagate’: returns nan * ‘raise’: throws an error * ‘omit’: performs the calculations ignoring nan values
- Returns:
mad – If the input contains integers or floats of smaller precision than
np.float64, then the output data-type isnp.float64. Otherwise, the output data-type is the same as that of the input.- Return type:
scalar or ndarray
See also
numpy.std,numpy.var,numpy.median,scipy.stats.iqr,scipy.stats.tmean,scipy.stats.tstd,scipy.stats.tvarNotes
The center argument only affects the calculation of the central value around which the MAD is calculated. That is, passing in
center=np.meanwill calculate the MAD around the mean - it will not calculate the mean absolute deviation.The input array may contain inf, but if center returns inf, the corresponding MAD for that data will be nan.
References
Examples
When comparing the behavior of median_abs_deviation with
np.std, the latter is affected when we change a single value of an array to have an outlier value while the MAD hardly changes:>>> from scipy import stats >>> x = stats.norm.rvs(size=100, scale=1, random_state=123456) >>> x.std() 0.9973906394005013 >>> stats.median_abs_deviation(x) 0.82832610097857 >>> x[0] = 345.6 >>> x.std() 34.42304872314415 >>> stats.median_abs_deviation(x) 0.8323442311590675
Axis handling example:
>>> x = np.array([[10, 7, 4], [3, 2, 1]]) >>> x array([[10, 7, 4], [ 3, 2, 1]]) >>> stats.median_abs_deviation(x) array([3.5, 2.5, 1.5]) >>> stats.median_abs_deviation(x, axis=None) <-- syntax of `axis=None` will be depricated 2.0Scale normal example:
>>> x = stats.norm.rvs(size=1000000, scale=2, random_state=123456) >>> stats.median_abs_deviation(x) 1.3487398527041636 >>> stats.median_abs_deviation(x, scale='normal') 1.9996446978061115
- process_improve.univariate.metrics.biweight_midvariance(x, nan_policy='omit')[source]#
Return the Mosteller-Tukey robust scale (biweight midvariance) of
x.The biweight midvariance is a robust, highly efficient estimator of the variance: it down-weights observations far from the median and ignores gross outliers entirely.
- Parameters:
x (np.ndarray or pd.Series) – One-dimensional sample of numeric values.
nan_policy ({"omit", "propagate"}, optional) –
"omit"(default) drops missing values;"propagate"returnsnanif any value is missing.
- Returns:
The robust variance estimate. Returns
0.0when the MAD is zero (e.g. a constant sample), andnanfor an empty sample.- Return type:
References
Mosteller and Tukey, Data Analysis and Regression, pp. 207-208, 1977.
- process_improve.univariate.metrics.holm_bonferroni(p_values, alpha=0.05)[source]#
Holm-Bonferroni step-down correction for multiple comparisons.
Holm’s method controls the family-wise error rate while being uniformly more powerful than the plain Bonferroni correction. It is the recommended post-hoc correction for a family of pairwise comparisons.
- Parameters:
p_values (array-like) – The raw (uncorrected) p-values of the individual comparisons.
alpha (float, optional) – Family-wise significance level, by default 0.05.
- Returns:
A bunch with, in the same order as the input:
p_adjusted: the Holm-adjusted p-values.reject: boolean array,Truewhere the null hypothesis is rejected at levelalpha.alpha: the family-wise level used.
- Return type:
References
Holm, “A simple sequentially rejective multiple test procedure”, Scandinavian Journal of Statistics, 6, 65-70, 1979.
- process_improve.univariate.metrics.summary_stats(x, method='robust')[source]#
Return summary statistics of the numeric values in vector
x.- Parameters:
x (numpy.ndarray or pandas.Series) – A vector of univariate values to summarize.
method (str, optional) – If
"robust"(the default), the reported center is the median and the spread is the Sn robust estimate; otherwise the mean and the sample standard deviation are used.
- Returns:
A summary of the univariate vector. The most useful keys are
"center"(a measure of the center, e.g. the median for the robust method) and"spread"(a measure of the spread, e.g. the Sn robust estimate for the robust method).- Return type:
- process_improve.univariate.metrics.detect_outliers_esd(x, algorithm='esd', max_outliers_detected=1, **kwargs)[source]#
Return a list of indexes of points in the vector x which are likely outliers.
A second output (can be ignored) contains the details of the values used to make the decision.
- Arguments:
x {list, sequence, NumPy vector/array} – [A sequence, list or vector which can be unravelled.]
- Keyword Arguments:
algorithm – Two algorithms are possible to detect outliers: (default: “esd”)
‘esd’: Generalized ESD Test for Outliers. If max_outliers_detected=1 this is essentially Grubb’s test. For more details, please see: https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h3.htm
‘cc-robust’: Build a robust control-chart for the sequence x and points should lie outside the +/- 3 sigma limits are considered outliers. Not Implemented Yet: left here as an idea for the future, but not confirmed yet.
- max_outliers_detected – The maximum number of outliers that
should be detected, as required by the algorithms.
kwargs – Algorithm dependent arguments. Defaults are shown here.
‘esd’:
‘robust_variant’ = True. Uses the median and MAD for the center and the standard deviation respectively.
‘alpha’ = 0.05. The significance level of the testing.
- process_improve.univariate.metrics.tietjen_moore_test(x, n_outliers, *, two_sided=True, alpha=0.05, n_simulations=10000, random_state=None)[source]#
Tietjen-Moore test for a specified number of outliers.
Tests the null hypothesis “there are no outliers” against the alternative “the
n_outliersmost extreme observations are outliers”. Unlike the generalised ESD test, the number of suspected outliers must be fixed in advance. The test statistic has no closed-form critical value, so it is obtained by simulation under the normal null.- Parameters:
x (np.ndarray or pd.Series) – One-dimensional sample. Missing values are dropped.
n_outliers (int) – The number of suspected outliers to test for (1 <= n_outliers < N).
two_sided (bool, optional) – If
True(default) the test looks for outliers on either tail (the observations with the largest absolute deviation from the mean). IfFalseit tests only then_outlierslargest observations.alpha (float, optional) – Significance level, by default 0.05.
n_simulations (int, optional) – Number of Monte-Carlo samples used to estimate the critical value.
random_state (int or None, optional) – Seed for the simulation, for reproducibility.
- Returns:
With
statistic,critical_value,reject(Truewhen the outliers are significant),outlier_indices(positions in the missing-value-removed sample),n_outliersandalpha.- Return type:
References
Tietjen and Moore, “Some Grubbs-type statistics for the detection of several outliers”, Technometrics, 14, 583-597, 1972. See also the NIST handbook, section 3.5.h.3.
- process_improve.univariate.metrics.distribution_fit(x, distribution='norm', alpha=0.05)[source]#
Check how well a sample fits a named distribution.
Fits the parameters of the requested
scipy.statsdistribution by maximum likelihood and runs a Kolmogorov-Smirnov goodness-of-fit test (NIST handbook, section 3.5.7).- Parameters:
- Returns:
With
distribution, fittedparameters,ks_statistic,ks_pvalue,fits_well(Truewhen the fit is not rejected at levelalpha) and the sample sizen.- Return type:
Notes
Because the distribution parameters are estimated from the same data, the KS p-value is conservative (the true Type-I error is smaller than
alpha); it remains a useful screening check.
- process_improve.univariate.metrics.variance_decomposition(df, measured, repeat)[source]#
Given a DataFrame df of raw data, and an indication of which column is the measured value column, and which is the repeat indicator, it will calculate the within and between replicate standard deviation.
Example Two measurements on day 1
[101, 102]and two measurements on day 2[94, 95]. The between-day variation can already be expected to be much greater than the within-day variation.>>> df = pd.DataFrame(data={'Result': [101, 102, 94, 95], 'Repeat': [1, 1, 2, 2]}) Result Repeat 0 101 1 1 102 1 2 94 2 3 95 2 >>> output = within_between_standard_deviation(df, measured="Result", repeat="Repeat") {'total_ms': 16.666667, 'total_dof': 3, 'within_ms': 0.5, 'within_stddev': 0.70711, 'within_dof': 2, 'between_ms': 49.0, 'between_stddev': 7.0, 'between_dof': 1}
Note * SSQ = sum of squares * DOF= degrees of freedom * MS = mean square = (sum of squares) / (degrees of freedom) = SSQ / DOF = variance