Batch Data Analysis#
- process_improve.batch.features.f_mean(data, tags=None, batch_col=None, phase_col=None)[source]#
Feature: mean.
The arithmetic mean for the given tags in
tags, for each unique batch in thebatch_colindicator column, and within each unique phase, per batch, of thephase_colcolumn.
- process_improve.batch.features.f_median(data, tags=None, batch_col=None, phase_col=None)[source]#
Feature: median.
The median for the given tags in
tags, for each unique batch in thebatch_colindicator column, and within each unique phase, per batch, of thephase_colcolumn.
- process_improve.batch.features.f_std(data, tags=None, batch_col=None, phase_col=None)[source]#
Feature: std.
The standard deviation for the given tags in
tags, for each unique batch in thebatch_colindicator column, and within each unique phase, per batch, of thephase_colcolumn.See also: f_mad, f_iqr
- process_improve.batch.features.f_iqr(data, tags=None, batch_col=None, phase_col=None)[source]#
Feature: iqr.
The InterQuartile Range (IQR) for the given tags in
tags, for each unique batch in thebatch_colindicator column, and within each unique phase, per batch, of thephase_colcolumn.The IQR is a robust variant of the standard deviation. The difference between the 75th percentile and the 25th percentile of a sample this is the 25 % trimmed range, an example of an L - estimator.
See also: f_std, f_mad
- process_improve.batch.features.f_mad(data, tags=None, batch_col=None, phase_col=None)[source]#
Feature: mad.
The MEAN (not MEDIAN) Absolute Deviation for the given tags in
tags, for each unique batch in thebatch_colindicator column, and within each unique phase, per batch, of thephase_colcolumn.The mean absolute deviation (MAD) is a measure of the variability of a univariate sample of quantitative data. For values in a sequence X1, X2, …, Xn, the
madis the mean of the absolute deviations from the data’s mean.Since the mean can be biased by outliers, the MAD can also be biased. If an unbiased estimate is required, see f_robust_mad.
See also: f_std, f_iqr, f_robust_mad
- process_improve.batch.features.f_robust_mad(data, tags=None, batch_col=None, phase_col=None)[source]#
Feature: mad.
The MEDIAN (not MEAN) Absolute Deviation for the given tags in
tags, for each unique batch in thebatch_colindicator column, and within each unique phase, per batch, of thephase_colcolumn.In statistics, the median absolute deviation (MAD) is a robust measure of the variability of a univariate sample of quantitative data.
For a univariate data set X1, X2, …, Xn, the MAD is defined as the median of the absolute deviations from the data’s median.
from scipy.stats import norm as Gaussian c_MAD_constant = Gaussian.ppf(3/4.0) median = np.nanmedian(x) mad = np.nanmedian((np.fabs(x - median)) / c_MAD_constant)
The constant correction factor is so that MAD agrees with standard deviation for normally distributed data.
Warning
This function is not yet implemented. Calling it always raises
AssertionError; the underlying group-wise calculation below the raise is a placeholder that still needs to be corrected.See also: f_mad, f_std, f_iqr,
- process_improve.batch.features.f_sum(data, tags=None, batch_col=None, phase_col=None)[source]#
Feature: sum.
The SUM within each tag for for the given tags in
tags, for each unique batch in thebatch_colindicator column, and within each unique phase, per batch, of thephase_colcolumn.If the x-axis (time) data are evenly-spaced, then this is directly proportional to the area under the trace (curve/trajectory).
See also: f_cumsum
- process_improve.batch.features.f_area(data, time_tag, tags=None, batch_col=None, phase_col=None)[source]#
Feature: area.
The AREA of each tag for for the given tags in
tags, for each unique batch in thebatch_colindicator column, and within each unique phase, per batch, of thephase_colcolumn against the time-based curve.The spacing of the x-axis is taken into account, so, this will produce accurate areas if the data are not evenly-spaced in time along the x-axis.
The area is calculated using the trapezoidal rule.
See also: f_sum, f_cumsum
- process_improve.batch.features.f_rupture(data, columns=None, batch_col=None, phase_col=None)[source]#
Feature: rupture.
The breakpoint in a given tag in
columns(usually it is 1 tag), for each unique batch in thebatch_colindicator column, and within each unique phase, per batch, of thephase_colcolumn.
- process_improve.batch.features.f_min(data, tags=None, batch_col=None, phase_col=None)[source]#
Feature: min.
The minimum value attained by each tag, for the given tags in
tags, for each unique batch in thebatch_colindicator column, and within each unique phase, per batch, of thephase_colcolumn.To get the time-point when the minimum occured: f_agemin.
See also: f_agemin, f_max
- process_improve.batch.features.f_max(data, tags=None, batch_col=None, phase_col=None)[source]#
Feature: max.
The maximum value attained by each tag, for the given tags in
tags, for each unique batch in thebatch_colindicator column, and within each unique phase, per batch, of thephase_colcolumn.To get the time-point when the maximum occured: f_agemax.
See also: f_min
- process_improve.batch.features.f_agemin(data, tags=None, batch_col=None, phase_col=None)[source]#
Feature: age at minimum value. Not yet implemented.
- process_improve.batch.features.f_agemax(data, tags=None, batch_col=None, phase_col=None)[source]#
Feature: age at maximum value. Not yet implemented.
- process_improve.batch.features.f_last(data, tags=None, batch_col=None, phase_col=None)[source]#
Feature: endpoint.
The final value attained by each tag, for the given tags in
tags, for each unique batch in thebatch_colindicator column, and within each unique phase, per batch, of thephase_colcolumn.If you want to know how many rows [i.e. the last row], then consider using the f_count feature.
See also: f_sum, f_count
- process_improve.batch.features.f_count(data, tags=None, batch_col=None, phase_col=None)[source]#
Feature: count.
The index number of the final value for each tag, for the given tags in
tags, for each unique batch in thebatch_colindicator column, and within each unique phase, per batch, of thephase_colcolumn.Can be useful to get the 1-based index (it is a count!), and to then use that index for other calculation purposes.
See also: f_sum, f_last
- process_improve.batch.features.f_slope(data, x_axis_tag, tags=None, batch_col=None, phase_col=None, age_col=None)[source]#
Feature: slope.
The slope of the given tags for each unique batch in the batch_col indicator column, of the phase_col column.
The slope is calculated against whichever variable is given by x_axis_tag. If this is the age_col of the batch (i.e. time duration), ensure that age_col is also specified.
- process_improve.batch.features.cross(series, threshold=0, direction='cross', only_index=False, first_point_only=False)[source]#
Given a Series returns all the index values where the data values equal the ‘threshold’ value. Will first drop all missing values from the series.
direction` can be ‘rising’ (for rising edge), ‘falling’ (for only falling edge), or ‘cross’ for both edges.
If only_index is True (default False), then it will return the 0-based index where crossing occur just after. E.g. if the returned index is 135, then the crossing takes place at, or after, index 135, but before index 136.
If the setting first_point_only is set to True, only the first point where the crossing occurs is reported. The rest are ignored. Default = all crossings are report (i.e. first_point_only=False).
https://stackoverflow.com/questions/10475488/calculating-crossing-intercept- points-of-a-series-or-dataframe
- process_improve.batch.features.f_crossing(data, tag, time_tag, threshold=0, direction='cross', only_index=False, batch_col=None, phase_col=None, suffix=None)[source]#
Feature: cross.
The time (time_tag) value at which tag crosses a certain numeric threshold`, either direction=’rising’` (for rising edge), or direction=’falling’’ (for falling edge), or ‘cross’ for both edges.
The time when the crossing occurs is found by linear interpolation between the indices. If you prefer the index itself, use only_index=True, but the default for that setting is False.
Does this for each unique batch in the batch_col indicator column, and within each unique phase, per batch, of the phase_col column.
suffix: what to add to the data tag, to name to this feature.
Note: NaN is returned for a given batch and phase, if the crossing is not found.
- process_improve.batch.features.f_elbow(data, x_axis_tag, tags=None, only_index=False, batch_col=None, phase_col=None)[source]#
Feature: elbow.
The “elbow” of the given
tagsfor each unique batch in thebatch_colindicator column, of thephase_colcolumn.The elbow is calculated against whichever variable is given by x_axis_tag (usually a time- based tag).
The function returns the value on the x-axis where the elbox occurs. Sometimes you might want the index of the value, so you can also find the corresponding y-axis value. Use only_index=True for such cases.
- process_improve.batch.preprocessing.determine_scaling(batches, columns_to_align=None, settings=None)[source]#
Scales the batch data according to the variable ranges.
- Parameters:
batches (dict[str, pd.DataFrame]) – Batch data, in the standard format (keyed by batch identifier).
columns_to_align (list, optional) – The column names (tags) to be scaled. If
None, the columns of the first batch are used.settings (dict, optional) – Optional overrides. Currently supports the key
"robust"(bool, defaultTrue) which switches between a robust range (q98 - q02) and the raw (max - min) range.
- Returns:
range_scalers (DataFrame) – J rows, 2 columns: column 1 = range of each tag (approx. q98 - q02), column 2 = typical minimum of each tag (robustly calculated).
TODO (put this in a scikit-learn style: .fit() and .apply() style)
- Return type:
- process_improve.batch.preprocessing.apply_scaling(batches, scale_df, columns_to_align=None)[source]#
Scales the batches according to the information in the scaling dataframe.
- process_improve.batch.preprocessing.reverse_scaling(batches, scale_df, columns_to_align=None)[source]#
Reverse the scaling applied by apply_scaling.
- class process_improve.batch.preprocessing.DTWresult(synced, penalty_matrix, md_path, warping_path, distance, normalized_distance)[source]#
Bases:
objectResult class.
- process_improve.batch.preprocessing.align_with_path(md_path, batch, initial_row)[source]#
Align a batch to the reference using the DTW path.
- process_improve.batch.preprocessing.dtw_core(test, ref, weight_matrix)[source]#
Compute DTW alignment of test batch against reference batch.
- process_improve.batch.preprocessing.one_iteration_dtw(batches_scaled, refbatch_sc, weight_matrix, settings=None)[source]#
Perform one iteration of the DTW alignment algorithm.
- process_improve.batch.preprocessing.batch_dtw(batches, columns_to_align, reference_batch, settings=None)[source]#
Synchronize, via iterative DTW, with weighting.
Algorithm: Kassidas et al. (2004): https://doi.org/10.1002/aic.690440412
- Parameters:
batches (dict[str, pd.DataFrame]) – Batch data, in the standard format.
columns_to_align (list) – Which columns to use during the alignment process. The others are aligned, but get no weight, and therefore do not influence the objective function.
reference_batch (str) – Which key in the batches is the reference batch to use.
settings (dict) –
Default settings are:
{ "maximum_iterations": 25, # stops here, even if not converged "tolerance": 0.1, # convergence tolerance "robust": True, # use robust scaling "show_progress": True, # show progress "subsample": 1, # use every sample "interpolate_time_axis_maximum": 100, # resample time axis to this scale "interpolate_time_axis_delta": 1, # resolution of resampled axis "interpolate_method": "cubic", # any scipy.interpolate.interp1d method }
The default settings resample the time axis to 100 data points, starting at 0 and ending at 99. Adjust the delta for more points, or change the maximum.
- Returns:
dict – Various outputs relevant to the alignment. TODO: Document completely later.
Notation
——–
I = number of batches (index = i)
i = index for the batches
J = number of tags (columns in each batch)
j = index for the tags
k = index into the rows of each batch, the samples (0 … k … K_i)
- Return type:
- process_improve.batch.preprocessing.resample_to_reference(batches, columns_to_align, reference_batch, settings=None)[source]#
Resamples all batches (only the columns_to_align) to the duration of batch with identifier reference.
- Parameters:
- Returns:
Batch data, in the standard format.
- Return type:
- process_improve.batch.preprocessing.find_average_length(batches, settings=None)[source]#
Find the batch in batches with the average length.
- process_improve.batch.preprocessing.find_reference_batch(batches, columns_to_align, settings=None)[source]#
Find a reference batch. Assumes NO missing data.
Starts with the average duration batch; resamples (simple interpolation) of all batches to that duration. Unfolds that resampled data. Does PCA on the wide, unfolded data. Fits, by default, 4 components. Excludes all batches with Hotelling’s T2 > 90% limit. Refits PCA with 4 components. Finds the batch which has the multivariate combination of scores which are the smallest (i.e. closest to the model center) and ensures this batch has SPE < 50% of the model limit.
- Parameters:
batches (dict[str, pd.DataFrame]) – Batch data, in the standard format.
columns_to_align (list) – Which columns to use. Others are ignored.
settings (dict, optional) –
Default settings are:
{ "robust": True, # use robust scaling "subsample": 1, # use every sample "method": "pca_most_average", # most average batch from a crude PCA "n_components": 4, "number_of_reference_batches": 1, # only a single batch returned }
- Returns:
When
settings["number_of_reference_batches"] == 1(the default), a single dictionary key frombatchesis returned. When more than one reference batch is requested, a list of that many keys is returned, ordered from most to least central in the PCA model.- Return type:
- process_improve.batch.plotting.get_rgba_from_triplet(incolour, alpha=1, as_string=False)[source]#
Convert the input colour triplet (list) to a Plotly rgba(r,g,b,a) string if as_string is True. If False it will return the list of 3 integer RGB values.
E.g. [0.9677975592919913, 0.44127456009157356, 0.5358103155058701] -> ‘rgba(246,112,136,1)’
- process_improve.batch.plotting.plot_to_HTML(filename, fig)[source]#
Export a Plotly figure to an HTML file.
- process_improve.batch.plotting.plot_all_batches_per_tag(df_dict, tag, tag_y2=None, time_column=None, extra_info='', batches_to_highlight=None, x_axis_label='Time [sequence order]', highlight_width=5, html_image_height=900, html_aspect_ratio_w_over_h=1.7777777777777777, y1_limits=(None, None), y2_limits=(None, None))[source]#
Plot a particular tag over all batches in the given dataframe df.
- Parameters:
df_dict (dict) – Standard data format for batches.
tag (str) – Which tag to plot? [on the y1 (left) axis]
tag_y2 (str, optional) – Which tag to plot? [on the y2 (right) axis] Tag will be plotted with different scaling on the secondary axis, to allow time-series comparisons to be easier.
time_column (str, optional) – Which tag on the x-axis. If not specified, creates sequential integers, starting from 0 if left as the default, None.
extra_info (str, optional) – Used in the plot title to add any extra details, by default “”
batches_to_highlight (dict, optional) –
Keys are JSON strings parseable by
json.loadsinto a Plotly line specifier. For example:batches_to_highlight = {'{"width": 2, "color": "rgba(255,0,0,0.5)"}': redlist}
will plot the batch identifiers in
redlistwith that colour and linewidth.x_axis_label (str, optional) – String label for the x-axis, by default “Time [sequence order]”
highlight_width (int, optional) – The width of the highlighted lines; default = 5.
html_image_height (int, optional) – HTML image output height, by default 900
html_aspect_ratio_w_over_h (float, optional) – HTML image aspect ratio: 16/9 (therefore the default width will be 1600 px)
y1_limits (tuple, optional) – Axis limits enforced on the y1 (left) axis. Default is (None, None) which means the data themselves are used to determine the limits. Specify BOTH limits. Plotly requires (at the moment plotly/plotly.js#400) that you specify both. Order: (low limit, high limit)
y2_limits (tuple, optional) – Axis limits enforced on the y2 (right) axis. Default is (None, None) which means the data themselves are used to determine the limits. Specify BOTH limits. Plotly requires (at the moment plotly/plotly.js#400) that you specify both.
- Returns:
Standard Plotly fig object (dictionary-like).
- Return type:
go.Figure
- process_improve.batch.plotting.colours_per_batch_id(batch_ids, batches_to_highlight, default_line_width, use_default_colour=False, colour_map=None)[source]#
Return a colour to use for each trace in the plot. A dictionary: keys are batch ids, and the value is a colour and line width setting for Plotly.
- override_default_colour: bool
If True, then the default colour is used (grey: 0.5, 0.5, 0.5)
- process_improve.batch.plotting.plot_multitags(df_dict, batch_list=None, tag_list=None, time_column=None, batches_to_highlight=None, settings=None, fig=None)[source]#
Plot all the tags for a batch; or a subset of tags, if specified in tag_list.
- Parameters:
df_dict (dict) – Standard data format for batches.
batch_list (list [default: None, will plot all batches in df_dict]) – Which batches to plot; if provided, must be a list of valid keys into df_dict.
tag_list (list [default: None, will plot all tags in the dataframes]) – Which tags to plot; tags will also be plotted in this order, or in the order of the first dataframe if not specified.
time_column (str, optional) – Which tag on the x-axis. If not specified, creates sequential integers, starting from 0 if left as the default, None.
batches_to_highlight (dict, optional) –
Keys are JSON strings parseable by
json.loadsinto a Plotly line specifier. For example:batches_to_highlight = {'{"width": 2, "color": "rgba(255,0,0,0.5)"}': redlist}
will plot the batch identifiers in
redlistwith that colour and linewidth.settings (dict) –
Default settings:
{ "nrows": 1, # int: number of subplot rows "ncols": None, # int or None: columns (None = auto) "x_axis_label": "Time, grouped per tag",# str: x-axis label "title": "", # str: overall plot title "show_legend": True, # bool: show legend "html_image_height": 900, # int: image height in pixels "html_aspect_ratio_w_over_h": 16/9, # float: width as ratio of height }
fig (go.Figure) – If supplied, uses the existing Plotly figure to draw in.
- Return type:
Figure
- process_improve.batch.plotting.generate_one_frame(df_dict, tag_list, fig, up_to_index, time_column, batch_ids_to_animate, animation_colour_assignment, show_legend=False, hovertemplate='', max_columns=0)[source]#
Return a list of dictionaries.
Each entry in the list is for each subplot; in the order of the subplots. Since each subplot is a tag, we need the tag_list as input.
Getting data into the required format for use with this library.
There are 3 useful ways to represent batch data.
dict: as a Python dictionary. Example:
data = {
"batch 1": data frame with varying number of rows, but same number of columns,
"batch 2": etc,
}
The keys are unique identifiers for each batch, such as integers or strings.
melt: as a single Pandas data frame:
data = pd.DataFrame(...)
Characteristics:
very large number of rows, for all batches stacked vertically on top of each other
some number of columns, one column per tag
one column, usually called
batch_id, indicates what the batch number is for that rowanother column, usually called
time, indicates what the time is within that batchtypically sorted, but does not have to be
wide: as a single Pandas data frame, as for the “melted” version, but pivoted instead.
These wide dataframes always have a multilevel column index to distinguish the tags
from the time. This representation is only valid for aligned data. Example:
data = pd.DataFrame(...)
Characteristics:
each row is a unique batch number
the multilevel column index has level 0 = column name, level 1 = aligned time
only makes sense if the data are aligned (same number of elements in each level-1 index)
- process_improve.batch.data_input.check_valid_batch_dict(in_dict, no_nan=False)[source]#
Check if the incoming dictionary of batch data is a valid dictionary of data.
Checks: 1. All batches in the dictionary have the same number of columns. 2. All columns are numeric. 3. If no_nan is True, also checks that there are no NaNs.
- process_improve.batch.data_input.dict_to_melted(in_df, insert_batch_id_column=True, insert_sequence_column=False)[source]#
Reverse of melted_to_dict.
- process_improve.batch.data_input.dict_to_wide(in_df, group_by_batch=False)[source]#
Convert aligned batch data from dict to wide format.
group_by_batch, if True, means that all the data from the first batch is on the left of the output dataframe, and the last batch is collected on the right.
If group_by_batch is False, then data for the same tag are grouped together, side-by-side.
TODO: group_by_batch is not implemented yet.
- process_improve.batch.data_input.melted_to_dict(in_df, batch_id_col)[source]#
Load a “melted” data set, where one of the columns is the batch_id_col. The data are grouped along the unique values of batch_id_col, and each group is stored in a dictionary. The dictionary keys are the batch identifier, and the corresponding value is a Pandas dataframe of the batch data for that batch.
- process_improve.batch.data_input.melted_to_wide(in_df, batch_id_col)[source]#
Convert aligned melted data to wide format.
- process_improve.batch.data_input.wide_to_melted(in_df)[source]#
Convert wide-format batch data to melted format. Not yet implemented.