Data

The focus of these modules is on storing, transforming, cleaning, splitting, merging, getting and removing data.

class data.ConcatedEncodedDs(encoded_ds_arr)[source]

Bases: Generic[torch.utils.data.dataset.T_co]

ConcatedEncodedDs abstracts over multiple encoded datasources (EncodedDs) as if they were a single entity.

Create a Lightwood datasource from a data frame and some encoders. This class inherits from torch.utils.data.Dataset.

Note: normal behavior is to cache encoded representations to avoid duplicated computations. If you want an option to disable, this please open an issue.

Parameters
  • encoders – list of Lightwood encoders used to encode the data per each column.

  • data_frame – original dataframe.

  • target – name of the target column to predict.

clear_cache()[source]

See lightwood.data.encoded_ds.EncodedDs.clear_cache().

get_column_original_data(column_name)[source]

See lightwood.data.encoded_ds.EncodedDs.get_column_original_data().

Return type

Series

get_encoded_column_data(column_name)[source]

See lightwood.data.encoded_ds.EncodedDs.get_encoded_column_data().

Return type

Tensor

class data.EncodedDs(encoders, data_frame, target)[source]

Bases: Generic[torch.utils.data.dataset.T_co]

Create a Lightwood datasource from a data frame and some encoders. This class inherits from torch.utils.data.Dataset.

Note: normal behavior is to cache encoded representations to avoid duplicated computations. If you want an option to disable, this please open an issue.

Parameters
  • encoders (List[BaseEncoder]) – list of Lightwood encoders used to encode the data per each column.

  • data_frame (DataFrame) – original dataframe.

  • target (str) – name of the target column to predict.

clear_cache()[source]

Clears the EncodedDs cache.

get_column_original_data(column_name)[source]

Gets the original data for any given column of the EncodedDs.

Parameters

column_name (str) – name of the column.

Return type

Series

Returns

A pd.Series with the original data stored in the column_name column.

get_encoded_column_data(column_name)[source]

Gets the encoded data for any given column of the EncodedDs.

Parameters

column_name (str) – name of the column.

Return type

Tensor

Returns

A torch.Tensor with the encoded data of the column_name column.

get_encoded_data(include_target=True)[source]

Gets all encoded data.

Parameters

include_target – whether to include the target column in the output or not.

Return type

Tensor

Returns

A torch.Tensor with the encoded dataframe.

data.cleaner(data, dtype_dict, pct_invalid, identifiers, target, mode, timeseries_settings, anomaly_detection, imputers={}, custom_cleaning_functions={})[source]

The cleaner is a function which takes in the raw data, plus additional information about it’s types and about the problem. Based on this it generates a “clean” representation of the data, where each column has an ideal standardized type and all malformed or otherwise missing or invalid elements are turned into None. Optionally, these None values can be replaced with imputers.

Parameters
  • data (DataFrame) – The raw data

  • dtype_dict (Dict[str, str]) – Type information for each column

  • pct_invalid (float) – How much of each column can be invalid

  • identifiers (Dict[str, str]) – A dict containing all identifier typed columns

  • target (str) – The target columns

  • mode (str) – Can be “predict” or “train”

  • imputers (Dict[str, BaseImputer]) – The key corresponds to the single input column that will be imputed by the object. Refer to the imputer documentation for more details.

  • timeseries_settings (TimeseriesSettings) – Timeseries related settings, only relevant for timeseries predictors, otherwise can be the default object

  • anomaly_detection (bool) – Are we detecting anomalies with this predictor?

Return type

DataFrame

Returns

The cleaned data

data.splitter(data, tss, dtype_dict, seed, pct_train, pct_dev, pct_test, target)[source]

Splits data into training, dev and testing datasets.

The proportion of data for each split must be specified (JSON-AI sets defaults to 80/10/10). First, rows in the dataset are shuffled randomly. Then a simple split is done. If a target value is provided and is of data type categorical/binary, then the splits will be stratified to maintain the representative populations of each class.

Parameters
  • data (DataFrame) – Input dataset to be split

  • tss (TimeseriesSettings) – time-series specific details for splitting

  • dtype_dict (Dict[str, str]) – Dictionary with the data type of all columns

  • seed (int) – Random state for pandas data-frame shuffling

  • pct_train (float) – training fraction of data; must be less than 1

  • pct_dev (float) – dev fraction of data; must be less than 1

  • pct_test (float) – testing fraction of data; must be less than 1

  • target (str) – Name of the target column; if specified, data will be stratified on this column

Return type

Dict[str, DataFrame]

Returns

A dictionary containing the keys train, test and dev with their respective data frames, as well as the “stratified_on” key indicating which columns the data was stratified on (None if it wasn’t stratified on anything)

data.timeseries_analyzer(data, dtype_dict, timeseries_settings, target)[source]

This module analyzes (pre-processed) time series data and stores a few useful insights used in the rest of Lightwood’s pipeline.

Parameters
  • data (Dict[str, DataFrame]) – dictionary with the dataset split into train, val, test subsets.

  • dtype_dict (Dict[str, str]) – dictionary with inferred types for every column.

  • timeseries_settings (TimeseriesSettings) – A TimeseriesSettings object. For more details, check lightwood.types.TimeseriesSettings.

  • target (str) – name of the target column.

The following things are extracted from each time series inside the dataset:
  • group_combinations: all observed combinations of values for the set of group_by columns. The length of this list determines how many time series are in the data.

  • deltas: inferred sampling interval

  • ts_naive_residuals: Residuals obtained from the data by a naive forecaster that repeats the last-seen value.

  • ts_naive_mae: Mean residual value obtained from the data by a naive forecaster that repeats the last-seen value.

  • target_normalizers: objects that may normalize the data within any given time series for effective learning. See lightwood.encoder.time_series.helpers.common for available choices.

Return type

Dict

Returns

Dictionary with the aforementioned insights and the TimeseriesSettings object for future references.

data.transform_timeseries(data, dtype_dict, ts_analysis, timeseries_settings, target, mode)[source]

Block that transforms the dataframe of a time series task to a convenient format for use in posterior phases like model training.

The main transformations performed by this block are:
  • Type casting (e.g. to numerical for order_by column).

  • Windowing functions for historical context based on TimeseriesSettings.window parameter.

  • Explicitly add target columns according to the TimeseriesSettings.horizon parameter.

  • Flag all rows that are “predictable” based on all TimeseriesSettings.

  • Plus, handle all logic for the streaming use case (where forecasts are only emitted for the last observed data point).

Parameters
  • data (DataFrame) – Dataframe with data to transform.

  • dtype_dict (Dict[str, str]) – Dictionary with the types of each column.

  • ts_analysis (dict) – dictionary with various insights into each series passed as training input.

  • timeseries_settings (TimeseriesSettings) – A TimeseriesSettings object.

  • target (str) – The name of the target column to forecast.

  • mode (str) – Either “train” or “predict”, depending on what phase is calling this procedure.

Return type

DataFrame

Returns

A dataframe with all the transformations applied.