Batch Data Analysis#

process_improve.batch.features.f_mean(data, tags=None, batch_col=None, phase_col=None)[source]#

Feature: mean.

The arithmetic mean for the given tags in tags, for each unique batch in the batch_col indicator column, and within each unique phase, per batch, of the phase_col column.

Parameters:
Return type:

DataFrame

process_improve.batch.features.f_median(data, tags=None, batch_col=None, phase_col=None)[source]#

Feature: median.

The median for the given tags in tags, for each unique batch in the batch_col indicator column, and within each unique phase, per batch, of the phase_col column.

Parameters:
Return type:

DataFrame

process_improve.batch.features.f_std(data, tags=None, batch_col=None, phase_col=None)[source]#

Feature: std.

The standard deviation for the given tags in tags, for each unique batch in the batch_col indicator column, and within each unique phase, per batch, of the phase_col column.

See also: f_mad, f_iqr

Parameters:
Return type:

DataFrame

process_improve.batch.features.f_iqr(data, tags=None, batch_col=None, phase_col=None)[source]#

Feature: iqr.

The InterQuartile Range (IQR) for the given tags in tags, for each unique batch in the batch_col indicator column, and within each unique phase, per batch, of the phase_col column.

The IQR is a robust variant of the standard deviation. The difference between the 75th percentile and the 25th percentile of a sample this is the 25 % trimmed range, an example of an L - estimator.

See also: f_std, f_mad

Parameters:
Return type:

DataFrame

process_improve.batch.features.f_mad(data, tags=None, batch_col=None, phase_col=None)[source]#

Feature: mad.

The MEAN (not MEDIAN) Absolute Deviation for the given tags in tags, for each unique batch in the batch_col indicator column, and within each unique phase, per batch, of the phase_col column.

The mean absolute deviation (MAD) is a measure of the variability of a univariate sample of quantitative data. For values in a sequence X1, X2, …, Xn, the mad is the mean of the absolute deviations from the data’s mean.

Since the mean can be biased by outliers, the MAD can also be biased. If an unbiased estimate is required, see f_robust_mad.

See also: f_std, f_iqr, f_robust_mad

Parameters:
Return type:

DataFrame

process_improve.batch.features.f_robust_mad(data, tags=None, batch_col=None, phase_col=None)[source]#

Feature: mad.

The MEDIAN (not MEAN) Absolute Deviation for the given tags in tags, for each unique batch in the batch_col indicator column, and within each unique phase, per batch, of the phase_col column.

In statistics, the median absolute deviation (MAD) is a robust measure of the variability of a univariate sample of quantitative data.

For a univariate data set X1, X2, …, Xn, the MAD is defined as the median of the absolute deviations from the data’s median.

from scipy.stats import norm as Gaussian c_MAD_constant = Gaussian.ppf(3/4.0) median = np.nanmedian(x) mad = np.nanmedian((np.fabs(x - median)) / c_MAD_constant)

The constant correction factor is so that MAD agrees with standard deviation for normally distributed data.

Warning

This function is not yet implemented. Calling it always raises AssertionError; the underlying group-wise calculation below the raise is a placeholder that still needs to be corrected.

See also: f_mad, f_std, f_iqr,

Parameters:
Return type:

DataFrame

process_improve.batch.features.f_sum(data, tags=None, batch_col=None, phase_col=None)[source]#

Feature: sum.

The SUM within each tag for for the given tags in tags, for each unique batch in the batch_col indicator column, and within each unique phase, per batch, of the phase_col column.

If the x-axis (time) data are evenly-spaced, then this is directly proportional to the area under the trace (curve/trajectory).

See also: f_cumsum

Parameters:
Return type:

DataFrame

process_improve.batch.features.f_area(data, time_tag, tags=None, batch_col=None, phase_col=None)[source]#

Feature: area.

The AREA of each tag for for the given tags in tags, for each unique batch in the batch_col indicator column, and within each unique phase, per batch, of the phase_col column against the time-based curve.

The spacing of the x-axis is taken into account, so, this will produce accurate areas if the data are not evenly-spaced in time along the x-axis.

The area is calculated using the trapezoidal rule.

See also: f_sum, f_cumsum

Parameters:
Return type:

DataFrame

process_improve.batch.features.f_rupture(data, columns=None, batch_col=None, phase_col=None)[source]#

Feature: rupture.

The breakpoint in a given tag in columns (usually it is 1 tag), for each unique batch in the batch_col indicator column, and within each unique phase, per batch, of the phase_col column.

Parameters:
Return type:

None

process_improve.batch.features.f_min(data, tags=None, batch_col=None, phase_col=None)[source]#

Feature: min.

The minimum value attained by each tag, for the given tags in tags, for each unique batch in the batch_col indicator column, and within each unique phase, per batch, of the phase_col column.

To get the time-point when the minimum occured: f_agemin.

See also: f_agemin, f_max

Parameters:
Return type:

DataFrame

process_improve.batch.features.f_max(data, tags=None, batch_col=None, phase_col=None)[source]#

Feature: max.

The maximum value attained by each tag, for the given tags in tags, for each unique batch in the batch_col indicator column, and within each unique phase, per batch, of the phase_col column.

To get the time-point when the maximum occured: f_agemax.

See also: f_min

Parameters:
Return type:

DataFrame

process_improve.batch.features.f_agemin(data, tags=None, batch_col=None, phase_col=None)[source]#

Feature: age at minimum value. Not yet implemented.

Parameters:
Return type:

None

process_improve.batch.features.f_agemax(data, tags=None, batch_col=None, phase_col=None)[source]#

Feature: age at maximum value. Not yet implemented.

Parameters:
Return type:

None

process_improve.batch.features.f_last(data, tags=None, batch_col=None, phase_col=None)[source]#

Feature: endpoint.

The final value attained by each tag, for the given tags in tags, for each unique batch in the batch_col indicator column, and within each unique phase, per batch, of the phase_col column.

If you want to know how many rows [i.e. the last row], then consider using the f_count feature.

See also: f_sum, f_count

Parameters:
Return type:

DataFrame

process_improve.batch.features.f_count(data, tags=None, batch_col=None, phase_col=None)[source]#

Feature: count.

The index number of the final value for each tag, for the given tags in tags, for each unique batch in the batch_col indicator column, and within each unique phase, per batch, of the phase_col column.

Can be useful to get the 1-based index (it is a count!), and to then use that index for other calculation purposes.

See also: f_sum, f_last

Parameters:
Return type:

DataFrame

process_improve.batch.features.f_slope(data, x_axis_tag, tags=None, batch_col=None, phase_col=None, age_col=None)[source]#

Feature: slope.

The slope of the given tags for each unique batch in the batch_col indicator column, of the phase_col column.

The slope is calculated against whichever variable is given by x_axis_tag. If this is the age_col of the batch (i.e. time duration), ensure that age_col is also specified.

Parameters:
Return type:

DataFrame

process_improve.batch.features.cross(series, threshold=0, direction='cross', only_index=False, first_point_only=False)[source]#

Given a Series returns all the index values where the data values equal the ‘threshold’ value. Will first drop all missing values from the series.

direction` can be ‘rising’ (for rising edge), ‘falling’ (for only falling edge), or ‘cross’ for both edges.

If only_index is True (default False), then it will return the 0-based index where crossing occur just after. E.g. if the returned index is 135, then the crossing takes place at, or after, index 135, but before index 136.

If the setting first_point_only is set to True, only the first point where the crossing occurs is reported. The rest are ignored. Default = all crossings are report (i.e. first_point_only=False).

https://stackoverflow.com/questions/10475488/calculating-crossing-intercept- points-of-a-series-or-dataframe

Parameters:
  • series (Series)

  • threshold (int | None)

  • direction (str | None)

  • only_index (bool | None)

  • first_point_only (bool | None)

Return type:

list

process_improve.batch.features.f_crossing(data, tag, time_tag, threshold=0, direction='cross', only_index=False, batch_col=None, phase_col=None, suffix=None)[source]#

Feature: cross.

The time (time_tag) value at which tag crosses a certain numeric threshold`, either direction=’rising’` (for rising edge), or direction=’falling’’ (for falling edge), or ‘cross’ for both edges.

The time when the crossing occurs is found by linear interpolation between the indices. If you prefer the index itself, use only_index=True, but the default for that setting is False.

Does this for each unique batch in the batch_col indicator column, and within each unique phase, per batch, of the phase_col column.

suffix: what to add to the data tag, to name to this feature.

Note: NaN is returned for a given batch and phase, if the crossing is not found.

Parameters:
Return type:

DataFrame

process_improve.batch.features.f_elbow(data, x_axis_tag, tags=None, only_index=False, batch_col=None, phase_col=None)[source]#

Feature: elbow.

The “elbow” of the given tags for each unique batch in the batch_col indicator column, of the phase_col column.

The elbow is calculated against whichever variable is given by x_axis_tag (usually a time- based tag).

The function returns the value on the x-axis where the elbox occurs. Sometimes you might want the index of the value, so you can also find the corresponding y-axis value. Use only_index=True for such cases.

Parameters:
Return type:

DataFrame

process_improve.batch.preprocessing.determine_scaling(batches, columns_to_align=None, settings=None)[source]#

Scales the batch data according to the variable ranges.

Parameters:
  • batches (dict[str, pd.DataFrame]) – Batch data, in the standard format (keyed by batch identifier).

  • columns_to_align (list, optional) – The column names (tags) to be scaled. If None, the columns of the first batch are used.

  • settings (dict, optional) – Optional overrides. Currently supports the key "robust" (bool, default True) which switches between a robust range (q98 - q02) and the raw (max - min) range.

Returns:

  • range_scalers (DataFrame) – J rows, 2 columns: column 1 = range of each tag (approx. q98 - q02), column 2 = typical minimum of each tag (robustly calculated).

  • TODO (put this in a scikit-learn style: .fit() and .apply() style)

Return type:

DataFrame

process_improve.batch.preprocessing.apply_scaling(batches, scale_df, columns_to_align=None)[source]#

Scales the batches according to the information in the scaling dataframe.

Parameters:
  • batches (dict[str, pd.DataFrame]) – The batches, in standard format.

  • scale_df (pd.DataFrame) – The scaling dataframe, from determine_scaling

  • columns_to_align (list | None)

Returns:

The scaled batch data.

Return type:

dict

process_improve.batch.preprocessing.reverse_scaling(batches, scale_df, columns_to_align=None)[source]#

Reverse the scaling applied by apply_scaling.

Parameters:
Return type:

dict

class process_improve.batch.preprocessing.DTWresult(synced, penalty_matrix, md_path, warping_path, distance, normalized_distance)[source]#

Bases: object

Result class.

Parameters:
  • synced (np.ndarray)

  • penalty_matrix (np.ndarray)

  • md_path (np.ndarray)

  • warping_path (np.ndarray)

  • distance (float)

  • normalized_distance (float)

process_improve.batch.preprocessing.align_with_path(md_path, batch, initial_row)[source]#

Align a batch to the reference using the DTW path.

Parameters:
Return type:

DataFrame

process_improve.batch.preprocessing.dtw_core(test, ref, weight_matrix)[source]#

Compute DTW alignment of test batch against reference batch.

Parameters:
Return type:

DTWresult

process_improve.batch.preprocessing.one_iteration_dtw(batches_scaled, refbatch_sc, weight_matrix, settings=None)[source]#

Perform one iteration of the DTW alignment algorithm.

Parameters:
Return type:

tuple[dict, DataFrame]

process_improve.batch.preprocessing.batch_dtw(batches, columns_to_align, reference_batch, settings=None)[source]#

Synchronize, via iterative DTW, with weighting.

Algorithm: Kassidas et al. (2004): https://doi.org/10.1002/aic.690440412

Parameters:
  • batches (dict[str, pd.DataFrame]) – Batch data, in the standard format.

  • columns_to_align (list) – Which columns to use during the alignment process. The others are aligned, but get no weight, and therefore do not influence the objective function.

  • reference_batch (str) – Which key in the batches is the reference batch to use.

  • settings (dict) –

    Default settings are:

    {
        "maximum_iterations": 25,  # stops here, even if not converged
        "tolerance": 0.1,          # convergence tolerance
        "robust": True,            # use robust scaling
        "show_progress": True,     # show progress
        "subsample": 1,            # use every sample
        "interpolate_time_axis_maximum": 100,  # resample time axis to this scale
        "interpolate_time_axis_delta": 1,      # resolution of resampled axis
        "interpolate_method": "cubic",         # any scipy.interpolate.interp1d method
    }
    

    The default settings resample the time axis to 100 data points, starting at 0 and ending at 99. Adjust the delta for more points, or change the maximum.

Returns:

  • dict – Various outputs relevant to the alignment. TODO: Document completely later.

  • Notation

  • ——–

  • I = number of batches (index = i)

  • i = index for the batches

  • J = number of tags (columns in each batch)

  • j = index for the tags

  • k = index into the rows of each batch, the samples (0 … k … K_i)

Return type:

dict

process_improve.batch.preprocessing.resample_to_reference(batches, columns_to_align, reference_batch, settings=None)[source]#

Resamples all batches (only the columns_to_align) to the duration of batch with identifier reference.

Parameters:
  • batches (dict[str, pd.DataFrame]) – Batch data, in the standard format.

  • columns_to_align (list) – Which columns to use. Others are ignored.

  • reference_batch (str) – Which key in the batches is the reference batch.

  • settings (dict, optional) – [description], by default None

Returns:

Batch data, in the standard format.

Return type:

dict

process_improve.batch.preprocessing.find_average_length(batches, settings=None)[source]#

Find the batch in batches with the average length.

Parameters:
  • batches (dict[str, pd.DataFrame]) – Batch data, in the standard format.

  • settings (dict) –

    Default settings are:

    {"robust": True}  # use robust (median) average batch length
    

Return type:

One of the dictionary keys from batches.

process_improve.batch.preprocessing.find_reference_batch(batches, columns_to_align, settings=None)[source]#

Find a reference batch. Assumes NO missing data.

Starts with the average duration batch; resamples (simple interpolation) of all batches to that duration. Unfolds that resampled data. Does PCA on the wide, unfolded data. Fits, by default, 4 components. Excludes all batches with Hotelling’s T2 > 90% limit. Refits PCA with 4 components. Finds the batch which has the multivariate combination of scores which are the smallest (i.e. closest to the model center) and ensures this batch has SPE < 50% of the model limit.

Parameters:
  • batches (dict[str, pd.DataFrame]) – Batch data, in the standard format.

  • columns_to_align (list) – Which columns to use. Others are ignored.

  • settings (dict, optional) –

    Default settings are:

    {
        "robust": True,                    # use robust scaling
        "subsample": 1,                    # use every sample
        "method": "pca_most_average",      # most average batch from a crude PCA
        "n_components": 4,
        "number_of_reference_batches": 1,  # only a single batch returned
    }
    

Returns:

When settings["number_of_reference_batches"] == 1 (the default), a single dictionary key from batches is returned. When more than one reference batch is requested, a list of that many keys is returned, ordered from most to least central in the PCA model.

Return type:

str or list[str]

process_improve.batch.plotting.get_rgba_from_triplet(incolour, alpha=1, as_string=False)[source]#

Convert the input colour triplet (list) to a Plotly rgba(r,g,b,a) string if as_string is True. If False it will return the list of 3 integer RGB values.

E.g. [0.9677975592919913, 0.44127456009157356, 0.5358103155058701] -> ‘rgba(246,112,136,1)’

Parameters:
Return type:

str | list

process_improve.batch.plotting.plot_to_HTML(filename, fig)[source]#

Export a Plotly figure to an HTML file.

Parameters:
Return type:

str

process_improve.batch.plotting.plot_all_batches_per_tag(df_dict, tag, tag_y2=None, time_column=None, extra_info='', batches_to_highlight=None, x_axis_label='Time [sequence order]', highlight_width=5, html_image_height=900, html_aspect_ratio_w_over_h=1.7777777777777777, y1_limits=(None, None), y2_limits=(None, None))[source]#

Plot a particular tag over all batches in the given dataframe df.

Parameters:
  • df_dict (dict) – Standard data format for batches.

  • tag (str) – Which tag to plot? [on the y1 (left) axis]

  • tag_y2 (str, optional) – Which tag to plot? [on the y2 (right) axis] Tag will be plotted with different scaling on the secondary axis, to allow time-series comparisons to be easier.

  • time_column (str, optional) – Which tag on the x-axis. If not specified, creates sequential integers, starting from 0 if left as the default, None.

  • extra_info (str, optional) – Used in the plot title to add any extra details, by default “”

  • batches_to_highlight (dict, optional) –

    Keys are JSON strings parseable by json.loads into a Plotly line specifier. For example:

    batches_to_highlight = {'{"width": 2, "color": "rgba(255,0,0,0.5)"}': redlist}
    

    will plot the batch identifiers in redlist with that colour and linewidth.

  • x_axis_label (str, optional) – String label for the x-axis, by default “Time [sequence order]”

  • highlight_width (int, optional) – The width of the highlighted lines; default = 5.

  • html_image_height (int, optional) – HTML image output height, by default 900

  • html_aspect_ratio_w_over_h (float, optional) – HTML image aspect ratio: 16/9 (therefore the default width will be 1600 px)

  • y1_limits (tuple, optional) – Axis limits enforced on the y1 (left) axis. Default is (None, None) which means the data themselves are used to determine the limits. Specify BOTH limits. Plotly requires (at the moment plotly/plotly.js#400) that you specify both. Order: (low limit, high limit)

  • y2_limits (tuple, optional) – Axis limits enforced on the y2 (right) axis. Default is (None, None) which means the data themselves are used to determine the limits. Specify BOTH limits. Plotly requires (at the moment plotly/plotly.js#400) that you specify both.

Returns:

Standard Plotly fig object (dictionary-like).

Return type:

go.Figure

process_improve.batch.plotting.colours_per_batch_id(batch_ids, batches_to_highlight, default_line_width, use_default_colour=False, colour_map=None)[source]#

Return a colour to use for each trace in the plot. A dictionary: keys are batch ids, and the value is a colour and line width setting for Plotly.

override_default_colour: bool

If True, then the default colour is used (grey: 0.5, 0.5, 0.5)

Parameters:
  • batch_ids (list)

  • batches_to_highlight (dict)

  • default_line_width (float)

  • use_default_colour (bool)

  • colour_map (Callable | None)

Return type:

dict[Any, dict]

process_improve.batch.plotting.plot_multitags(df_dict, batch_list=None, tag_list=None, time_column=None, batches_to_highlight=None, settings=None, fig=None)[source]#

Plot all the tags for a batch; or a subset of tags, if specified in tag_list.

Parameters:
  • df_dict (dict) – Standard data format for batches.

  • batch_list (list [default: None, will plot all batches in df_dict]) – Which batches to plot; if provided, must be a list of valid keys into df_dict.

  • tag_list (list [default: None, will plot all tags in the dataframes]) – Which tags to plot; tags will also be plotted in this order, or in the order of the first dataframe if not specified.

  • time_column (str, optional) – Which tag on the x-axis. If not specified, creates sequential integers, starting from 0 if left as the default, None.

  • batches_to_highlight (dict, optional) –

    Keys are JSON strings parseable by json.loads into a Plotly line specifier. For example:

    batches_to_highlight = {'{"width": 2, "color": "rgba(255,0,0,0.5)"}': redlist}
    

    will plot the batch identifiers in redlist with that colour and linewidth.

  • settings (dict) –

    Default settings:

    {
        "nrows": 1,                             # int: number of subplot rows
        "ncols": None,                          # int or None: columns (None = auto)
        "x_axis_label": "Time, grouped per tag",# str: x-axis label
        "title": "",                            # str: overall plot title
        "show_legend": True,                    # bool: show legend
        "html_image_height": 900,               # int: image height in pixels
        "html_aspect_ratio_w_over_h": 16/9,     # float: width as ratio of height
    }
    

  • fig (go.Figure) – If supplied, uses the existing Plotly figure to draw in.

Return type:

Figure

process_improve.batch.plotting.generate_one_frame(df_dict, tag_list, fig, up_to_index, time_column, batch_ids_to_animate, animation_colour_assignment, show_legend=False, hovertemplate='', max_columns=0)[source]#

Return a list of dictionaries.

Each entry in the list is for each subplot; in the order of the subplots. Since each subplot is a tag, we need the tag_list as input.

Parameters:
  • df_dict (dict)

  • tag_list (list)

  • fig (Figure)

  • up_to_index (int)

  • time_column (str | None)

  • batch_ids_to_animate (list)

  • animation_colour_assignment (dict)

  • show_legend (bool)

  • hovertemplate (str)

  • max_columns (int)

Return type:

list[dict]

Getting data into the required format for use with this library.

There are 3 useful ways to represent batch data.

dict: as a Python dictionary. Example:

data = {
    "batch 1": data frame with varying number of rows, but same number of columns,
    "batch 2": etc,
}

The keys are unique identifiers for each batch, such as integers or strings.

melt: as a single Pandas data frame:

data = pd.DataFrame(...)

Characteristics:

  • very large number of rows, for all batches stacked vertically on top of each other

  • some number of columns, one column per tag

  • one column, usually called batch_id, indicates what the batch number is for that row

  • another column, usually called time, indicates what the time is within that batch

  • typically sorted, but does not have to be

wide: as a single Pandas data frame, as for the “melted” version, but pivoted instead. These wide dataframes always have a multilevel column index to distinguish the tags from the time. This representation is only valid for aligned data. Example:

data = pd.DataFrame(...)

Characteristics:

  • each row is a unique batch number

  • the multilevel column index has level 0 = column name, level 1 = aligned time

  • only makes sense if the data are aligned (same number of elements in each level-1 index)

process_improve.batch.data_input.check_valid_batch_dict(in_dict, no_nan=False)[source]#

Check if the incoming dictionary of batch data is a valid dictionary of data.

Checks: 1. All batches in the dictionary have the same number of columns. 2. All columns are numeric. 3. If no_nan is True, also checks that there are no NaNs.

Parameters:
  • in_dict (dict) – A dictionary of batch data.

  • no_nan (bool) – If True, will also check that no missing values are present.

Returns:

True, if it passes the checks.

Return type:

bool

process_improve.batch.data_input.dict_to_melted(in_df, insert_batch_id_column=True, insert_sequence_column=False)[source]#

Reverse of melted_to_dict.

Parameters:
Return type:

DataFrame

process_improve.batch.data_input.dict_to_wide(in_df, group_by_batch=False)[source]#

Convert aligned batch data from dict to wide format.

group_by_batch, if True, means that all the data from the first batch is on the left of the output dataframe, and the last batch is collected on the right.

If group_by_batch is False, then data for the same tag are grouped together, side-by-side.

TODO: group_by_batch is not implemented yet.

Parameters:
Return type:

DataFrame

process_improve.batch.data_input.melted_to_dict(in_df, batch_id_col)[source]#

Load a “melted” data set, where one of the columns is the batch_id_col. The data are grouped along the unique values of batch_id_col, and each group is stored in a dictionary. The dictionary keys are the batch identifier, and the corresponding value is a Pandas dataframe of the batch data for that batch.

Parameters:
Return type:

dict

process_improve.batch.data_input.melted_to_wide(in_df, batch_id_col)[source]#

Convert aligned melted data to wide format.

Parameters:
Return type:

dict

process_improve.batch.data_input.wide_to_melted(in_df)[source]#

Convert wide-format batch data to melted format. Not yet implemented.

Parameters:

in_df (DataFrame)

Return type:

DataFrame

process_improve.batch.data_input.wide_to_dict()[source]#

Convert wide-format batch data to dict format. Not yet implemented.

Return type:

None

process_improve.batch.data_input.melt_df_to_series(in_df, exclude_columns=None, name=None)[source]#

Return a Series with a multilevel-index, melted from the DataFrame.

Parameters:
Return type:

Series