Data
¶
The focus of these modules is on storing, transforming, cleaning, splitting, merging, getting and removing data.
- class data.ConcatedEncodedDs(encoded_ds_arr)[source]¶
Bases:
Generic
[torch.utils.data.dataset.T_co
]ConcatedEncodedDs abstracts over multiple encoded datasources (EncodedDs) as if they were a single entity.
Create a Lightwood datasource from a data frame and some encoders. This class inherits from torch.utils.data.Dataset.
Note: normal behavior is to cache encoded representations to avoid duplicated computations. If you want an option to disable, this please open an issue.
- Parameters
encoders – list of Lightwood encoders used to encode the data per each column.
data_frame – original dataframe.
target – name of the target column to predict.
- class data.EncodedDs(encoders, data_frame, target)[source]¶
Bases:
Generic
[torch.utils.data.dataset.T_co
]Create a Lightwood datasource from a data frame and some encoders. This class inherits from torch.utils.data.Dataset.
Note: normal behavior is to cache encoded representations to avoid duplicated computations. If you want an option to disable, this please open an issue.
- Parameters
encoders (
List
[BaseEncoder
]) – list of Lightwood encoders used to encode the data per each column.data_frame (
DataFrame
) – original dataframe.target (
str
) – name of the target column to predict.
- get_column_original_data(column_name)[source]¶
Gets the original data for any given column of the EncodedDs.
- Parameters
column_name (
str
) – name of the column.- Return type
Series
- Returns
A pd.Series with the original data stored in the column_name column.
- data.timeseries_analyzer(data, dtype_dict, timeseries_settings, target)[source]¶
This module analyzes (pre-processed) time series data and stores a few useful insights used in the rest of Lightwood’s pipeline.
- Parameters
data (
Dict
[str
,DataFrame
]) – dictionary with the dataset split into train, val, test subsets.dtype_dict (
Dict
[str
,str
]) – dictionary with inferred types for every column.timeseries_settings (
TimeseriesSettings
) – A TimeseriesSettings object. For more details, check lightwood.types.TimeseriesSettings.target (
str
) – name of the target column.
- The following things are extracted from each time series inside the dataset:
group_combinations: all observed combinations of values for the set of group_by columns. The length of this list determines how many time series are in the data.
deltas: inferred sampling interval
ts_naive_residuals: Residuals obtained from the data by a naive forecaster that repeats the last-seen value.
ts_naive_mae: Mean residual value obtained from the data by a naive forecaster that repeats the last-seen value.
target_normalizers: objects that may normalize the data within any given time series for effective learning. See lightwood.encoder.time_series.helpers.common for available choices.
- Return type
Dict
- Returns
Dictionary with the aforementioned insights and the TimeseriesSettings object for future references.
- data.transform_timeseries(data, dtype_dict, ts_analysis, timeseries_settings, target, mode, pred_args=None)[source]¶
Block that transforms the dataframe of a time series task to a convenient format for use in posterior phases like model training.
- The main transformations performed by this block are:
Type casting (e.g. to numerical for order_by column).
Windowing functions for historical context based on TimeseriesSettings.window parameter.
Explicitly add target columns according to the TimeseriesSettings.horizon parameter.
Flag all rows that are “predictable” based on all TimeseriesSettings.
Plus, handle all logic for the streaming use case (where forecasts are only emitted for the last observed data point).
- Parameters
data (
DataFrame
) – Dataframe with data to transform.dtype_dict (
Dict
[str
,str
]) – Dictionary with the types of each column.ts_analysis (
dict
) – dictionary with various insights into each series passed as training input.timeseries_settings (
TimeseriesSettings
) – A TimeseriesSettings object.target (
str
) – The name of the target column to forecast.mode (
str
) – Either “train” or “predict”, depending on what phase is calling this procedure.pred_args (
Optional
[PredictionArguments
]) – Optional prediction arguments to control the transformation process.
- Return type
DataFrame
- Returns
A dataframe with all the transformations applied.