Batch Data Analysis#
- process_improve.batch.features.f_mean(data, tags=None, batch_col=None, phase_col=None)[source]#
Feature: mean.
The arithmetic mean for the given tags in
tags, for each unique batch in thebatch_colindicator column, and within each unique phase, per batch, of thephase_colcolumn.
- process_improve.batch.features.f_median(data, tags=None, batch_col=None, phase_col=None)[source]#
Feature: median.
The median for the given tags in
tags, for each unique batch in thebatch_colindicator column, and within each unique phase, per batch, of thephase_colcolumn.
- process_improve.batch.features.f_std(data, tags=None, batch_col=None, phase_col=None)[source]#
Feature: std.
The standard deviation for the given tags in
tags, for each unique batch in thebatch_colindicator column, and within each unique phase, per batch, of thephase_colcolumn.See also: f_iqr
- process_improve.batch.features.f_iqr(data, tags=None, batch_col=None, phase_col=None)[source]#
Feature: iqr.
The InterQuartile Range (IQR) for the given tags in
tags, for each unique batch in thebatch_colindicator column, and within each unique phase, per batch, of thephase_colcolumn.The IQR is a robust variant of the standard deviation. The difference between the 75th percentile and the 25th percentile of a sample this is the 25 % trimmed range, an example of an L - estimator.
See also: f_std
- process_improve.batch.features.f_robust_mad(data, tags=None, batch_col=None, phase_col=None)[source]#
Feature: robust_mad.
The Median Absolute Deviation (MAD) for the given tags in
tags, for each unique batch in thebatch_colindicator column, and within each unique phase, per batch, of thephase_colcolumn.The MAD is a robust alternative to the standard deviation. It is scaled by the normal-consistency factor (~1.4826), so that for normally distributed data it estimates the same quantity as
f_std.See also: f_std, f_iqr
- process_improve.batch.features.f_sum(data, tags=None, batch_col=None, phase_col=None)[source]#
Feature: sum.
The SUM within each tag for for the given tags in
tags, for each unique batch in thebatch_colindicator column, and within each unique phase, per batch, of thephase_colcolumn.If the x-axis (time) data are evenly-spaced, then this is directly proportional to the area under the trace (curve/trajectory).
See also: f_cumsum
- process_improve.batch.features.f_area(data, time_tag, tags=None, batch_col=None, phase_col=None)[source]#
Feature: area.
The AREA of each tag for for the given tags in
tags, for each unique batch in thebatch_colindicator column, and within each unique phase, per batch, of thephase_colcolumn against the time-based curve.The spacing of the x-axis is taken into account, so, this will produce accurate areas if the data are not evenly-spaced in time along the x-axis.
The area is calculated using the trapezoidal rule.
See also: f_sum, f_cumsum
- process_improve.batch.features.f_rupture(data, columns=None, batch_col=None, phase_col=None)[source]#
Feature: rupture.
The breakpoint in a given tag in
columns(usually it is 1 tag), for each unique batch in thebatch_colindicator column, and within each unique phase, per batch, of thephase_colcolumn.
- process_improve.batch.features.f_min(data, tags=None, batch_col=None, phase_col=None)[source]#
Feature: min.
The minimum value attained by each tag, for the given tags in
tags, for each unique batch in thebatch_colindicator column, and within each unique phase, per batch, of thephase_colcolumn.To get the time-point when the minimum occured: f_agemin.
See also: f_agemin, f_max
- process_improve.batch.features.f_max(data, tags=None, batch_col=None, phase_col=None)[source]#
Feature: max.
The maximum value attained by each tag, for the given tags in
tags, for each unique batch in thebatch_colindicator column, and within each unique phase, per batch, of thephase_colcolumn.To get the time-point when the maximum occured: f_agemax.
See also: f_min
- process_improve.batch.features.f_agemin(data, tags=None, batch_col=None, phase_col=None)[source]#
Feature: agemin.
The age - the index label, i.e. the time stamp or sample number - at which each tag attained its minimum value, for the given tags in
tags, for each unique batch in thebatch_colindicator column, and within each unique phase, per batch, of thephase_colcolumn.See also: f_min, f_agemax
- process_improve.batch.features.f_agemax(data, tags=None, batch_col=None, phase_col=None)[source]#
Feature: agemax.
The age - the index label, i.e. the time stamp or sample number - at which each tag attained its maximum value, for the given tags in
tags, for each unique batch in thebatch_colindicator column, and within each unique phase, per batch, of thephase_colcolumn.See also: f_max, f_agemin
- process_improve.batch.features.f_last(data, tags=None, batch_col=None, phase_col=None)[source]#
Feature: endpoint.
The final value attained by each tag, for the given tags in
tags, for each unique batch in thebatch_colindicator column, and within each unique phase, per batch, of thephase_colcolumn.If you want to know how many rows [i.e. the last row], then consider using the f_count feature.
See also: f_sum, f_count
- process_improve.batch.features.f_count(data, tags=None, batch_col=None, phase_col=None)[source]#
Feature: count.
The number of non-missing observations for each tag, for the given tags in
tags, for each unique batch in thebatch_colindicator column, and within each unique phase, per batch, of thephase_colcolumn.For data without internal gaps this count equals the 1-based index of the final row, so it can also be used as that index for other calculations.
See also: f_sum, f_last
- process_improve.batch.features.f_slope(data, x_axis_tag, tags=None, batch_col=None, phase_col=None, age_col=None)[source]#
Feature: slope.
The slope of the given tags for each unique batch in the batch_col indicator column, of the phase_col column.
The slope is calculated against whichever variable is given by x_axis_tag. If this is the age_col of the batch (i.e. time duration), ensure that age_col is also specified.
- process_improve.batch.features.cross(series, threshold=0, direction='cross', only_index=False, first_point_only=False)[source]#
Given a Series returns all the index values where the data values equal the ‘threshold’ value. Will first drop all missing values from the series.
direction` can be ‘rising’ (for rising edge), ‘falling’ (for only falling edge), or ‘cross’ for both edges.
If only_index is True (default False), then it will return the 0-based index where crossing occur just after. E.g. if the returned index is 135, then the crossing takes place at, or after, index 135, but before index 136.
If the setting first_point_only is set to True, only the first point where the crossing occurs is reported. The rest are ignored. Default = all crossings are report (i.e. first_point_only=False).
https://stackoverflow.com/questions/10475488/calculating-crossing-intercept- points-of-a-series-or-dataframe
- process_improve.batch.features.f_crossing(data, tag, time_tag, threshold=0, direction='cross', only_index=False, batch_col=None, phase_col=None, suffix=None)[source]#
Feature: cross.
The time (time_tag) value at which tag crosses a certain numeric threshold`, either direction=’rising’` (for rising edge), or direction=’falling’’ (for falling edge), or ‘cross’ for both edges.
The time when the crossing occurs is found by linear interpolation between the indices. If you prefer the index itself, use only_index=True, but the default for that setting is False.
Does this for each unique batch in the batch_col indicator column, and within each unique phase, per batch, of the phase_col column.
suffix: what to add to the data tag, to name to this feature.
Note: NaN is returned for a given batch and phase, if the crossing is not found.
- process_improve.batch.features.f_elbow(data, x_axis_tag, tags=None, only_index=False, batch_col=None, phase_col=None)[source]#
Feature: elbow.
The “elbow” of the given
tagsfor each unique batch in thebatch_colindicator column, of thephase_colcolumn.The elbow is calculated against whichever variable is given by x_axis_tag (usually a time- based tag).
The function returns the value on the x-axis where the elbox occurs. Sometimes you might want the index of the value, so you can also find the corresponding y-axis value. Use only_index=True for such cases.
- process_improve.batch.preprocessing.determine_scaling(batches, columns_to_align=None, settings=None)[source]#
Scales the batch data according to the variable ranges.
- Parameters:
batches (dict[str, pd.DataFrame]) – Batch data, in the standard format (keyed by batch identifier).
columns_to_align (list, optional) – The column names (tags) to be scaled. If
None, the columns of the first batch are used.settings (dict, optional) – Optional overrides. Currently supports the key
"robust"(bool, defaultTrue) which switches between a robust range (q98 - q02) and the raw (max - min) range.
- Returns:
range_scalers (DataFrame) – J rows, 2 columns: column 1 = range of each tag (approx. q98 - q02), column 2 = typical minimum of each tag (robustly calculated).
TODO (put this in a scikit-learn style: .fit() and .apply() style)
- Return type:
- process_improve.batch.preprocessing.apply_scaling(batches, scale_df, columns_to_align=None)[source]#
Scales the batches according to the information in the scaling dataframe.
- process_improve.batch.preprocessing.reverse_scaling(batches, scale_df, columns_to_align=None)[source]#
Reverse the scaling applied by apply_scaling.
- class process_improve.batch.preprocessing.DTWresult(synced, penalty_matrix, md_path, warping_path, distance, normalized_distance)[source]#
Bases:
objectResult class.
- process_improve.batch.preprocessing.align_with_path(md_path, batch)[source]#
Align a batch to the reference using the DTW path.
Where several samples of
batchmap to the same reference index (a compression in the warping path), the synced value for that index is the average of those batch samples. The runningtempaccumulator is therefore seeded with the first batch sample for the current index - the same value assigned tosyncedrow 0 just below - not with a reference row. A formerinitial_rowargument seeded it from the reference row (in one caller) or from an out-of-space batch index (in the other), which mixed an unrelated row into the row-0 average (#197).
- process_improve.batch.preprocessing.dtw_core(test, ref, weight_matrix)[source]#
Compute DTW alignment of test batch against reference batch.
- process_improve.batch.preprocessing.one_iteration_dtw(batches_scaled, refbatch_sc, weight_matrix, settings=None)[source]#
Perform one iteration of the DTW alignment algorithm.
- process_improve.batch.preprocessing.batch_dtw(batches, columns_to_align, reference_batch, settings=None)[source]#
Synchronize, via iterative DTW, with weighting.
Algorithm: Kassidas et al. (2004): https://doi.org/10.1002/aic.690440412
- Parameters:
batches (dict[str, pd.DataFrame]) – Batch data, in the standard format.
columns_to_align (list) – Which columns to use during the alignment process. The others are aligned, but get no weight, and therefore do not influence the objective function.
reference_batch (str) – Which key in the batches is the reference batch to use.
settings (dict) –
Default settings are:
{ "maximum_iterations": 25, # stops here, even if not converged "tolerance": 0.1, # convergence tolerance "robust": True, # use robust scaling "show_progress": True, # show progress "subsample": 1, # use every sample "interpolate_time_axis_maximum": 100, # resample time axis to this scale "interpolate_time_axis_delta": 1, # resolution of resampled axis "interpolate_method": "cubic", # any scipy.interpolate.interp1d method }
The default settings resample the time axis to 100 data points, starting at 0 and ending at 99. Adjust the delta for more points, or change the maximum.
- Returns:
dict – Various outputs relevant to the alignment. TODO: Document completely later.
Notation
——–
I = number of batches (index = i)
i = index for the batches
J = number of tags (columns in each batch)
j = index for the tags
k = index into the rows of each batch, the samples (0 … k … K_i)
- Return type:
- process_improve.batch.preprocessing.resample_to_reference(batches, columns_to_align, reference_batch, settings=None)[source]#
Resamples all batches (only the columns_to_align) to the duration of batch with identifier reference.
- Parameters:
- Returns:
Batch data, in the standard format.
- Return type:
- process_improve.batch.preprocessing.find_average_length(batches, settings=None)[source]#
Find the batch in batches with the average length.
- process_improve.batch.preprocessing.find_reference_batch(batches, columns_to_align, settings=None)[source]#
Find a reference batch. Assumes NO missing data.
Starts with the average duration batch; resamples (simple interpolation) of all batches to that duration. Unfolds that resampled data. Does PCA on the wide, unfolded data. Fits, by default, 4 components. Excludes all batches with Hotelling’s T2 > 90% limit. Refits PCA with 4 components. Finds the batch which has the multivariate combination of scores which are the smallest (i.e. closest to the model center) and ensures this batch has SPE < 50% of the model limit.
- Parameters:
batches (dict[str, pd.DataFrame]) – Batch data, in the standard format.
columns_to_align (list) – Which columns to use. Others are ignored.
settings (dict, optional) –
Default settings are:
{ "robust": True, # use robust scaling "subsample": 1, # use every sample "method": "pca_most_average", # most average batch from a crude PCA "n_components": 4, "number_of_reference_batches": 1, # only a single batch returned }
- Returns:
When
settings["number_of_reference_batches"] == 1(the default), a single dictionary key frombatchesis returned. When more than one reference batch is requested, a list of that many keys is returned, ordered from most to least central in the PCA model.- Return type:
- class process_improve.batch.plotting.MultiTagPlotSettings(*, nrows=1, ncols=0, x_axis_label='Time, grouped per tag', title='', show_legend=True, mode='lines', html_image_height=900, html_aspect_ratio_w_over_h=1.7777777777777777, default_line_width=2, colour_map=<function husl_palette>, animate=False, animate_batches_to_highlight=<factory>, animate_show_slider=True, animate_show_pause=True, animate_slider_prefix='Index: ', animate_slider_vertical_offset=-0.3, animate_line_width=4, animate_n_frames=None, animate_framerate_milliseconds=0)[source]#
Bases:
BaseModelSettings for
plot_multitags().All fields have sensible defaults; pass a plain dict of overrides to
plot_multitags(..., settings=...).- Parameters:
nrows (int)
ncols (int)
x_axis_label (str)
title (str)
show_legend (bool)
mode (str)
html_image_height (int)
html_aspect_ratio_w_over_h (float)
default_line_width (float)
colour_map (Callable)
animate (bool)
animate_batches_to_highlight (list)
animate_show_slider (bool)
animate_show_pause (bool)
animate_slider_prefix (str)
animate_slider_vertical_offset (float)
animate_line_width (float)
animate_n_frames (int | None)
animate_framerate_milliseconds (int)
- model_config = {'arbitrary_types_allowed': True}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- colour_map: Callable#
- process_improve.batch.plotting.get_rgba_from_triplet(incolour, alpha=1, as_string=False)[source]#
Convert the input colour triplet (list) to a Plotly rgba(r,g,b,a) string if as_string is True. If False it will return the list of 3 integer RGB values.
E.g. [0.9677975592919913, 0.44127456009157356, 0.5358103155058701] -> ‘rgba(246,112,136,1)’
- process_improve.batch.plotting.plot_to_HTML(filename, fig)[source]#
Export a Plotly figure to an HTML file.
- process_improve.batch.plotting.plot_all_batches_per_tag(df_dict, tag, tag_y2=None, time_column=None, extra_info='', batches_to_highlight=None, x_axis_label='Time [sequence order]', highlight_width=5, html_image_height=900, html_aspect_ratio_w_over_h=1.7777777777777777, y1_limits=(None, None), y2_limits=(None, None), mode='lines')[source]#
Plot a particular tag over all batches in the given dataframe df.
- Parameters:
df_dict (dict) – Standard data format for batches.
tag (str) – Which tag to plot? [on the y1 (left) axis]
tag_y2 (str, optional) – Which tag to plot? [on the y2 (right) axis] Tag will be plotted with different scaling on the secondary axis, to allow time-series comparisons to be easier.
time_column (str, optional) – Which tag on the x-axis. If not specified, creates sequential integers, starting from 0 if left as the default, None.
extra_info (str, optional) – Used in the plot title to add any extra details, by default “”
batches_to_highlight (dict, optional) –
Keys are JSON strings parseable by
json.loadsinto a Plotly line specifier. For example:batches_to_highlight = {'{"width": 2, "color": "rgba(255,0,0,0.5)"}': redlist}
will plot the batch identifiers in
redlistwith that colour and linewidth.x_axis_label (str, optional) – String label for the x-axis, by default “Time [sequence order]”
highlight_width (int, optional) – The width of the highlighted lines; default = 5.
html_image_height (int, optional) – HTML image output height, by default 900
html_aspect_ratio_w_over_h (float, optional) – HTML image aspect ratio: 16/9 (therefore the default width will be 1600 px)
y1_limits (tuple, optional) – Axis limits enforced on the y1 (left) axis. Default is (None, None) which means the data themselves are used to determine the limits. Specify BOTH limits. Plotly requires (at the moment plotly/plotly.js#400) that you specify both. Order: (low limit, high limit)
y2_limits (tuple, optional) – Axis limits enforced on the y2 (right) axis. Default is (None, None) which means the data themselves are used to determine the limits. Specify BOTH limits. Plotly requires (at the moment plotly/plotly.js#400) that you specify both.
mode (str, optional) – Plotly trace draw mode, by default “lines”. Use “lines+markers” to also show a marker at each data point, or “markers” for markers only.
- Returns:
Standard Plotly fig object (dictionary-like).
- Return type:
go.Figure
- process_improve.batch.plotting.colours_per_batch_id(batch_ids, batches_to_highlight, default_line_width, use_default_colour=False, colour_map=None)[source]#
Return a colour to use for each trace in the plot. A dictionary: keys are batch ids, and the value is a colour and line width setting for Plotly.
- use_default_colour: bool
If True, then the default colour is used (grey: 0.5, 0.5, 0.5)
- process_improve.batch.plotting.plot_multitags(df_dict, batch_list=None, tag_list=None, time_column=None, batches_to_highlight=None, settings=None, fig=None)[source]#
Plot all the tags for a batch; or a subset of tags, if specified in tag_list.
- Parameters:
df_dict (dict) – Standard data format for batches.
batch_list (list [default: None, will plot all batches in df_dict]) – Which batches to plot; if provided, must be a list of valid keys into df_dict.
tag_list (list [default: None, will plot all tags in the dataframes]) – Which tags to plot; tags will also be plotted in this order, or in the order of the first dataframe if not specified.
time_column (str, optional) – Which tag on the x-axis. If not specified, creates sequential integers, starting from 0 if left as the default, None.
batches_to_highlight (dict, optional) –
Keys are JSON strings parseable by
json.loadsinto a Plotly line specifier. For example:batches_to_highlight = {'{"width": 2, "color": "rgba(255,0,0,0.5)"}': redlist}
will plot the batch identifiers in
redlistwith that colour and linewidth.settings (dict) –
Default settings:
{ "nrows": 1, # int: number of subplot rows "ncols": None, # int or None: columns (None = auto) "x_axis_label": "Time, grouped per tag",# str: x-axis label "title": "", # str: overall plot title "show_legend": True, # bool: show legend "mode": "lines", # str: Plotly trace mode # e.g. "lines+markers" "html_image_height": 900, # int: image height in pixels "html_aspect_ratio_w_over_h": 16/9, # float: width as ratio of height }
fig (go.Figure) – If supplied, uses the existing Plotly figure to draw in.
- Return type:
Figure
- process_improve.batch.plotting.generate_one_frame(df_dict, tag_list, fig, up_to_index, time_column, batch_ids_to_animate, animation_colour_assignment, show_legend=False, hovertemplate='', max_columns=0, mode='lines')[source]#
Return a list of dictionaries.
Each entry in the list is for each subplot; in the order of the subplots. Since each subplot is a tag, we need the tag_list as input.
Getting data into the required format for use with this library.
There are 3 useful ways to represent batch data.
dict: as a Python dictionary. Example:
data = {
"batch 1": data frame with varying number of rows, but same number of columns,
"batch 2": etc,
}
The keys are unique identifiers for each batch, such as integers or strings.
melt: as a single Pandas data frame:
data = pd.DataFrame(...)
Characteristics:
very large number of rows, for all batches stacked vertically on top of each other
some number of columns, one column per tag
one column, usually called
batch_id, indicates what the batch number is for that rowanother column, usually called
time, indicates what the time is within that batchtypically sorted, but does not have to be
wide: as a single Pandas data frame, as for the “melted” version, but pivoted instead.
These wide dataframes always have a multilevel column index to distinguish the tags
from the time. This representation is only valid for aligned data. Example:
data = pd.DataFrame(...)
Characteristics:
each row is a unique batch number
the multilevel column index has level 0 = column name, level 1 = aligned time
only makes sense if the data are aligned (same number of elements in each level-1 index)
- process_improve.batch.data_input.check_valid_batch_dict(in_dict, no_nan=False)[source]#
Check if the incoming dictionary of batch data is a valid dictionary of data.
Checks: 1. All batches in the dictionary have the same number of columns. 2. All columns are numeric. 3. If no_nan is True, also checks that there are no NaNs.
- process_improve.batch.data_input.dict_to_melted(in_df, insert_batch_id_column=True, insert_sequence_column=False)[source]#
Reverse of melted_to_dict.
- process_improve.batch.data_input.dict_to_wide(in_df, group_by_batch=False)[source]#
Convert aligned batch data from a dict to wide format.
Each row of the output is one batch; the columns are a 2-level
("tag", "sequence")index, so the data are only meaningful for aligned batches (every batch has the same number of samples).- Parameters:
in_df (dict) – Standard batch-data dictionary: keys are batch identifiers, values are per-batch dataframes with identical columns.
group_by_batch (bool, optional) –
Controls the ordering of the hierarchical column index.
False(default): columns are ordered(tag, sequence), so all time samples for a tag are grouped together, side-by-side.True: the levels are swapped to(sequence, tag), so all tags for a given time sample are grouped together.
- Returns:
Wide-format dataframe, one row per batch, with a 2-level column index.
- Return type:
pd.DataFrame
- process_improve.batch.data_input.melted_to_dict(in_df, batch_id_col)[source]#
Load a “melted” data set, where one of the columns is the batch_id_col. The data are grouped along the unique values of batch_id_col, and each group is stored in a dictionary. The dictionary keys are the batch identifier, and the corresponding value is a Pandas dataframe of the batch data for that batch.
- process_improve.batch.data_input.melted_to_wide(in_df, batch_id_col)[source]#
Convert aligned melted data to wide format.
- process_improve.batch.data_input.wide_to_melted(in_df)[source]#
Convert wide-format batch data to melted format. Not yet implemented.