Data
¶
The focus of these modules is on storing, transforming, cleaning, splitting, merging, getting and removing data.
- class data.ConcatedEncodedDs(encoded_ds_arr)[source]¶
Bases:
Generic
[torch.utils.data.dataset.T_co
]ConcatedEncodedDs abstracts over multiple encoded datasources (EncodedDs) as if they were a single entity.
Create a Lightwood datasource from a data frame and some encoders. This class inherits from torch.utils.data.Dataset.
Note: normal behavior is to cache encoded representations to avoid duplicated computations. If you want an option to disable, this please open an issue.
- Parameters
encoders – list of Lightwood encoders used to encode the data per each column.
data_frame – original dataframe.
target – name of the target column to predict.
- class data.EncodedDs(encoders, data_frame, target)[source]¶
Bases:
Generic
[torch.utils.data.dataset.T_co
]Create a Lightwood datasource from a data frame and some encoders. This class inherits from torch.utils.data.Dataset.
Note: normal behavior is to cache encoded representations to avoid duplicated computations. If you want an option to disable, this please open an issue.
- Parameters
encoders (
List
[BaseEncoder
]) – list of Lightwood encoders used to encode the data per each column.data_frame (
DataFrame
) – original dataframe.target (
str
) – name of the target column to predict.
- get_column_original_data(column_name)[source]¶
Gets the original data for any given column of the EncodedDs.
- Parameters
column_name (
str
) – name of the column.- Return type
Series
- Returns
A pd.Series with the original data stored in the column_name column.
- data.cleaner(data, dtype_dict, pct_invalid, identifiers, target, mode, timeseries_settings, anomaly_detection, imputers={}, custom_cleaning_functions={})[source]¶
The cleaner is a function which takes in the raw data, plus additional information about it’s types and about the problem. Based on this it generates a “clean” representation of the data, where each column has an ideal standardized type and all malformed or otherwise missing or invalid elements are turned into
None
. Optionally, theseNone
values can be replaced with imputers.- Parameters
data (
DataFrame
) – The raw datadtype_dict (
Dict
[str
,str
]) – Type information for each columnpct_invalid (
float
) – How much of each column can be invalididentifiers (
Dict
[str
,str
]) – A dict containing all identifier typed columnstarget (
str
) – The target columnsmode (
str
) – Can be “predict” or “train”imputers (
Dict
[str
,BaseImputer
]) – The key corresponds to the single input column that will be imputed by the object. Refer to the imputer documentation for more details.timeseries_settings (
TimeseriesSettings
) – Timeseries related settings, only relevant for timeseries predictors, otherwise can be the default objectanomaly_detection (
bool
) – Are we detecting anomalies with this predictor?
- Return type
DataFrame
- Returns
The cleaned data
- data.splitter(data, tss, dtype_dict, seed, pct_train, pct_dev, pct_test, target)[source]¶
Splits data into training, dev and testing datasets.
The proportion of data for each split must be specified (JSON-AI sets defaults to 80/10/10). First, rows in the dataset are shuffled randomly. Then a simple split is done. If a target value is provided and is of data type categorical/binary, then the splits will be stratified to maintain the representative populations of each class.
- Parameters
data (
DataFrame
) – Input dataset to be splittss (
TimeseriesSettings
) – time-series specific details for splittingdtype_dict (
Dict
[str
,str
]) – Dictionary with the data type of all columnsseed (
int
) – Random state for pandas data-frame shufflingpct_train (
float
) – training fraction of data; must be less than 1pct_dev (
float
) – dev fraction of data; must be less than 1pct_test (
float
) – testing fraction of data; must be less than 1target (
str
) – Name of the target column; if specified, data will be stratified on this column
- Return type
Dict
[str
,DataFrame
]- Returns
A dictionary containing the keys train, test and dev with their respective data frames, as well as the “stratified_on” key indicating which columns the data was stratified on (None if it wasn’t stratified on anything)
- data.timeseries_analyzer(data, dtype_dict, timeseries_settings, target)[source]¶
This module analyzes (pre-processed) time series data and stores a few useful insights used in the rest of Lightwood’s pipeline.
- Parameters
data (
Dict
[str
,DataFrame
]) – dictionary with the dataset split into train, val, test subsets.dtype_dict (
Dict
[str
,str
]) – dictionary with inferred types for every column.timeseries_settings (
TimeseriesSettings
) – A TimeseriesSettings object. For more details, check lightwood.types.TimeseriesSettings.target (
str
) – name of the target column.
- The following things are extracted from each time series inside the dataset:
group_combinations: all observed combinations of values for the set of group_by columns. The length of this list determines how many time series are in the data.
deltas: inferred sampling interval
ts_naive_residuals: Residuals obtained from the data by a naive forecaster that repeats the last-seen value.
ts_naive_mae: Mean residual value obtained from the data by a naive forecaster that repeats the last-seen value.
target_normalizers: objects that may normalize the data within any given time series for effective learning. See lightwood.encoder.time_series.helpers.common for available choices.
- Return type
Dict
- Returns
Dictionary with the aforementioned insights and the TimeseriesSettings object for future references.
- data.transform_timeseries(data, dtype_dict, ts_analysis, timeseries_settings, target, mode)[source]¶
Block that transforms the dataframe of a time series task to a convenient format for use in posterior phases like model training.
- The main transformations performed by this block are:
Type casting (e.g. to numerical for order_by column).
Windowing functions for historical context based on TimeseriesSettings.window parameter.
Explicitly add target columns according to the TimeseriesSettings.horizon parameter.
Flag all rows that are “predictable” based on all TimeseriesSettings.
Plus, handle all logic for the streaming use case (where forecasts are only emitted for the last observed data point).
- Parameters
data (
DataFrame
) – Dataframe with data to transform.dtype_dict (
Dict
[str
,str
]) – Dictionary with the types of each column.ts_analysis (
dict
) – dictionary with various insights into each series passed as training input.timeseries_settings (
TimeseriesSettings
) – A TimeseriesSettings object.target (
str
) – The name of the target column to forecast.mode (
str
) – Either “train” or “predict”, depending on what phase is calling this procedure.
- Return type
DataFrame
- Returns
A dataframe with all the transformations applied.