Encoders

Used for encoding data into PyTorch tensors and decoding it from pytorch tensors

class encoder.ArrayEncoder(stop_after, window=None, is_target=False, original_type=None)[source]

Fits a normalizer for array data.

To encode, ArrayEncoder returns a normalized window of previous data. It can be used for generic arrays, as well as for handling historical target values in time series tasks.

Currently supported normalizing strategies are minmax for numerical arrays, and a simple one-hot for categorical arrays. See lightwood.encoder.helpers for more details on each approach.

Parameters
  • stop_after (float) – time budget in seconds.

  • window (Optional[int]) – expected length of array data.

  • original_type (Optional[dtype]) – element-wise data type

decode(data)[source]

Converts data as a list of arrays.

Parameters

data (Tensor) – Encoded data prepared by this array encoder

Return type

List[Iterable]

Returns

A list of iterable sequences in the original data space

encode(column_data)[source]

Encode the properties of a sequence-of-sequence representation

Parameters

column_data (Iterable[Iterable]) – Input column data to be encoded

Return type

Tensor

Returns

a torch-tensor representing the encoded sequence

prepare(train_priming_data, dev_priming_data)[source]

Prepare the array encoder for sequence data. :type train_priming_data: Iterable[Iterable] :param train_priming_data: Training data of sequences :type dev_priming_data: Iterable[Iterable] :param dev_priming_data: Dev data of sequences

class encoder.BaseEncoder(is_target=False)[source]

Base class for all encoders.

An encoder should return encoded representations of any columnar data. The procedure for this is defined inside the encode() method.

If this encoder is expected to handle an output column, then it also needs to implement the respective decode() method that handles the inverse transformation from encoded representations to the final prediction in the original column space.

For encoders that learn representations (as opposed to rule-based), the prepare() method will handle all learning logic.

The to() method is used to move PyTorch-based encoders to and from a GPU.

Parameters
  • is_target – Whether the data to encode is the target, as per the problem definition.

  • is_timeseries_encoder – Whether encoder represents sequential/time-series data. Lightwood must provide specific treatment for this kind of encoder

  • is_trainable_encoder – Whether the encoder must return learned representations. Lightwood checks whether this flag is present in order to pass data to the feature representation via the prepare statement.

Class Attributes: - is_prepared: Internal flag to signal that the prepare() method has been successfully executed. - is_nn_encoder: Whether the encoder is neural network-based. - dependencies: list of additional columns that the encoder might need to encode. - output_size: length of each encoding tensor for a single data point.

decode(encoded_data)[source]

Given an encoded representation, returns the decoded value. Decoded values may not exist for all encoders (ex: rich text, audio, etc.)

Parameters

encoded_data (Tensor) – The input representation in encoded format

Return type

List[object]

Returns

The decoded representation of data, per column, in the original data-type presented.

encode(column_data)[source]

Given the approach defined in prepare(), encodes column data into a numerical representation to form part of the feature vector.

After all columns are featurized, each encoded vector is concatenated to form a feature vector per row in the dataset.

Parameters

column_data (Iterable[object]) – An iterable data structure where all the elements have type that is compatible with the encoder processing type; this may differ per encoder.

Return type

Tensor

Returns

The encoded representation of data, per column

prepare(priming_data)[source]

Given ‘priming_data’ (i.e. training data), prepares encoders either through a rule-based (ex: one-hot encoding) or learned (ex: DistilBERT for text) model. This works explicitly on only training data.

Parameters

priming_data (Iterable[object]) – An iterable data structure where all the elements have type that is compatible with the encoder processing type; this may differ per encoder.

Return type

None

class encoder.BinaryEncoder(is_target=False, target_weights=None)[source]

Creates a one-hot-encoding for binary class data. Assume two arbitrary categories \(A\) and \(B\); representation for them will be as such:

\[A &= [1, 0] \ B &= [0, 1]\]

This encoder is a specialized case of one-hot encoding (OHE); unknown categories are explicitly handled as [0, 0]. Unknowns may only be reported if the input row value is NULL (or python None type) or if new data, after the encoder is prepared, has examples outside the feature map.

When data is typed with Lightwood, this class is only deployed if an input data type is explicitly recognized as binary (i.e. the column has only 2 unique values like True/False). If future data shows a new category (thus the data is no longer truly binary), this encoder will no longer be appropriate unless you are comfortable mapping ALL new classes as [0, 0].

An encoder can represent a feature column or target column; in this case it represents a target, is_target is True, and target_weights. The target_weights parameter enables users to specify how heavily each class should be weighted within a mixer - useful in imbalanced classes.

By default, the StatisticalAnalysis phase will provide target_weights as the relative fraction of each class in the data which is important for imbalanced populations; for example, suppose there is a 80/20 imbalanced representation across 3 different classes - target_weights will be a vector as such:

target_weights = {“class1”: 0.8, “class2”: 0.2}

Users should note that models will be presented with the inverse of the target weights, inv_target_weights, which will perform the 1/target_value_per_class operation. This means large values will result in small weights for the model.

decode(encoded_data)[source]

Given encoded data, return in form of original category labels. The input to decode makes no presumption on whether the data is already in OHE form OR not, as it some models may output a set of probabilities of weights assigned to each class. The decoded value will always be the argmax of such a vector.

In the case that the vector is all 0s, the output is decoded as “UNKNOWN”

Parameters

encoded_data (Tensor) – the output of a mixer model

Returns

Decoded values for each data point

decode_probabilities(encoded_data)[source]

Provides decoded answers, as well as a probability assignment to each data point.

Parameters

encoded_data (Tensor) – the output of a mixer model

Return type

Tuple[List[str], List[List[float]], Dict[int, str]]

Returns

Decoded values for each data point, Probability vector for each category, and the reverse map of dimension to category name

encode(column_data)[source]

Encodes categories as OHE binary. Unknown/unrecognized classes return [0,0].

Parameters

column_data (Iterable[str]) – Pre-processed data to encode

:returns Encoded data of form \(N_{rows} x 2\)

Return type

Tensor

prepare(priming_data)[source]

Given priming data, create a map/inverse-map corresponding category name to index (and vice versa).

Parameters

priming_data (Iterable[str]) – Binary data to encode

class encoder.CatArrayEncoder(stop_after, window=None, is_target=False)[source]
Parameters
  • stop_after (float) – time budget in seconds.

  • window (Optional[int]) – expected length of array data.

  • original_type – element-wise data type

decode(data)[source]

Converts data as a list of arrays.

Parameters

data (Tensor) – Encoded data prepared by this array encoder

Return type

List[Iterable]

Returns

A list of iterable sequences in the original data space

prepare(train_priming_data, dev_priming_data)[source]

Prepare the array encoder for sequence data. :type train_priming_data: Iterable[Iterable] :param train_priming_data: Training data of sequences :type dev_priming_data: Iterable[Iterable] :param dev_priming_data: Dev data of sequences

class encoder.CategoricalAutoEncoder(stop_after=3600, is_target=False, max_encoded_length=100, desired_error=0.01, batch_size=200)[source]

Trains an autoencoder (AE) to represent categorical information with over 100 categories. This is used to ensure that feature vectors for categorical data with many categories are not excessively large.

The AE defaults to a vector sized 100 but can be adjusted to user preference. It is highly advised NOT to use this encoder to feature engineer your target, as reconstruction accuracy will determine your AE’s ability to decode properly.

Parameters
  • stop_after (float) – Stops training with provided time limit (sec)

  • is_target (bool) – Encoder represents target class (NOT recommended)

  • max_encoded_length (int) – Maximum length of vector represented

  • desired_error (float) – Threshold for reconstruction accuracy error

  • batch_size (int) – Minimum batch size while training

decode(encoded_data)[source]

Decodes from the embedding space, the original categories.

..warning If your reconstruction accuracy is not 100%, the CatAE may not return the correct category.

Parameters

encoded_data (Tensor) – A torch tensor of embeddings for category predictions

Return type

List[str]

Returns

A list of ‘translated’ categories for each embedding

encode(column_data)[source]

Encodes categorical information in column as the compressed vector from the CatAE.

Parameters

column_data (Iterable[str]) – An iterable of category samples from a column

Return type

Tensor

Returns

An embedding for each sample in original input

prepare(train_priming_data, dev_priming_data)[source]

Creates inputs and prepares a categorical autoencoder (CatAE) for input data. Currently, does not support a dev set; inputs for train and dev are concatenated together to train an autoencoder.

Parameters
  • train_priming_data (Series) – Input training data

  • dev_priming_data (Series) – Input dev data (Not supported currently)

class encoder.DatetimeEncoder(is_target=False)[source]

This encoder produces an encoded representation for timestamps.

The approach consists on decomposing the timestamp objects into its constituent units (e.g. day-of-week, month, year, etc), and describing each of those with a single value that represents the magnitude in a sensible cycle length.

decode(encoded_data, return_as_datetime=False)[source]

Given an encoded representation, returns the decoded value. Decoded values may not exist for all encoders (ex: rich text, audio, etc.)

Parameters

encoded_data – The input representation in encoded format

Returns

The decoded representation of data, per column, in the original data-type presented.

encode(data)[source]
Parameters

data – # @TODO: receive a consistent data type here; currently either list of lists or pd.Series w/lists

Returns

encoded data

encode_one(unix_timestamp)[source]

Encodes a list of unix_timestamps, or a list of tensors with unix_timestamps :param data: list of unix_timestamps (unix_timestamp resolution is seconds) :return: a list of vectors

prepare(priming_data)[source]

Given ‘priming_data’ (i.e. training data), prepares encoders either through a rule-based (ex: one-hot encoding) or learned (ex: DistilBERT for text) model. This works explicitly on only training data.

Parameters

priming_data – An iterable data structure where all the elements have type that is compatible with the encoder processing type; this may differ per encoder.

class encoder.DatetimeNormalizerEncoder(is_target=False, sinusoidal=False)[source]
decode(encoded_data, return_as_datetime=False)[source]

Given an encoded representation, returns the decoded value. Decoded values may not exist for all encoders (ex: rich text, audio, etc.)

Parameters

encoded_data – The input representation in encoded format

Returns

The decoded representation of data, per column, in the original data-type presented.

encode(data)[source]
Parameters

data – # @TODO: receive a consistent data type here; currently either list of lists or pd.Series w/lists

Returns

encoded data

encode_one(data)[source]

Encodes a list of unix_timestamps, or a list of tensors with unix_timestamps :param data: list of unix_timestamps (unix_timestamp resolution is seconds) :return: a list of vectors

prepare(priming_data)[source]

Given ‘priming_data’ (i.e. training data), prepares encoders either through a rule-based (ex: one-hot encoding) or learned (ex: DistilBERT for text) model. This works explicitly on only training data.

Parameters

priming_data – An iterable data structure where all the elements have type that is compatible with the encoder processing type; this may differ per encoder.

class encoder.Img2VecEncoder(stop_after=3600, is_target=False, scale=(224, 224), mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])[source]

Generates encoded representations for images using a pre-trained deep neural network. Inputs must be str-based location of the data.

Without user-specified details, all input images are rescaled to a standard size of 224x224, and normalized using the mean and standard deviation of the ImageNet dataset (as it was used to train the underlying NN).

This encoder currently does not support a decode() call; models with an image output will not work.

For more information about the neural network this encoder uses, refer to the lightwood.encoder.image.helpers.img_to_vec.Img2Vec.

Parameters
  • stop_after (float) – time budget, in seconds.

  • is_target (bool) – Whether encoder represents target or not

  • scale (Tuple[int, int]) – Resize scale of image (x, y)

  • mean (List[float]) – Mean of pixel values

  • std (List[float]) – Standard deviation of pixel values

decode(encoded_values_tensor)[source]

Currently not supported

encode(images)[source]

Creates encodings for a list of images; each image is referenced by a filepath or url.

Parameters

images (List[str]) – list of images, each image is a path to a file or a url.

Return type

Tensor

Returns

a torch.floatTensor

prepare(train_priming_data, dev_priming_data)[source]

Sets an Img2Vec object (model) and sets the expected size for encoded representations.

to(device, available_devices)[source]

Changes device of model to support CPU/GPU

Parameters
  • device – will move the model to this device.

  • available_devices – all available devices as reported by lightwood.

Returns

same object but moved to the target device.

class encoder.MultiHotEncoder(is_target=False)[source]
decode(vectors)[source]

Given an encoded representation, returns the decoded value. Decoded values may not exist for all encoders (ex: rich text, audio, etc.)

Parameters

encoded_data – The input representation in encoded format

Returns

The decoded representation of data, per column, in the original data-type presented.

encode(column_data)[source]

Given the approach defined in prepare(), encodes column data into a numerical representation to form part of the feature vector.

After all columns are featurized, each encoded vector is concatenated to form a feature vector per row in the dataset.

Parameters

column_data – An iterable data structure where all the elements have type that is compatible with the encoder processing type; this may differ per encoder.

Returns

The encoded representation of data, per column

prepare(priming_data, max_dimensions=100)[source]

Given ‘priming_data’ (i.e. training data), prepares encoders either through a rule-based (ex: one-hot encoding) or learned (ex: DistilBERT for text) model. This works explicitly on only training data.

Parameters

priming_data – An iterable data structure where all the elements have type that is compatible with the encoder processing type; this may differ per encoder.

class encoder.NumArrayEncoder(stop_after, window=None, is_target=False, positive_domain=False)[source]
Parameters
  • stop_after (float) – time budget in seconds.

  • window (Optional[int]) – expected length of array data.

  • original_type – element-wise data type

class encoder.NumericEncoder(data_type=None, is_target=False, positive_domain=False)[source]

The numeric encoder takes numbers (float or integer) and converts it into tensors of the form: [0 if the number is none, otherwise 1, 1 if the number is positive, otherwise 0, natural_log(abs(number)), number/absolute_mean]

This representation is: [1 if the number is positive, otherwise 0, natural_log(abs(number)), number/absolute_mean]] if encoding target values, since target values can’t be none.

The absolute_mean is computed in the prepare method and is just the mean of the absolute values of all numbers feed to prepare (which are not none)

none stands for any number that is an actual python None value or any sort of non-numeric value (a string, nan, inf)

Parameters
  • data_type (Optional[dtype]) – The data type of the number (integer, float, quantity)

  • is_target (bool) – Indicates whether the encoder refers to a target column or feature column (True==target)

  • positive_domain (bool) – Forces the encoder to always output positive values

decode(encoded_values, decode_log=None)[source]
Parameters
  • encoded_values (Union[List[Union[int, float, bool]], Tensor]) – The encoded values to decode into single numbers

  • decode_log (Optional[bool]) – Whether to decode the log or linear part of the representation, since the encoded vector contains both a log and a linear part

Return type

list

Returns

The decoded number

encode(data)[source]
Parameters

data (Iterable) – An iterable data structure containing the numbers to be encoded

Returns

A torch tensor with the representations of each number

prepare(priming_data)[source]

“NumericalEncoder” uses a rule-based form to prepare results on training (priming) data. The averages etc. are taken from this distribution.

Parameters

priming_data (Iterable) – an iterable data structure containing numbers numbers which will be used to compute the values used for normalizing the encoded representations

class encoder.OneHotEncoder(is_target=False, target_weights=None, use_unknown=True)[source]

Creates a one-hot encoding (OHE) for categorical data. One-hot encoding represents categorical information as a vector where each individual dimension corresponds to a category. A category has a 1:1 mapping between dimension indicated by a “1” in that position. For example, imagine 3 categories, \(A\), \(B\), and \(C\); these can be represented as follows:

\[A &= [1, 0, 0] \ B &= [0, 1, 0] \ C &= [0, 0, 1]\]
The OHE encoder operates in 2 modes:
  1. “use_unknown=True”: Makes an \(N+1\) length vector for \(N\) categories, the first index always corresponds to the unknown category.

  2. “use_unknown=False”: Makes an \(N\) length vector for \(N\) categories, where an empty vector of 0s indicates an unknown/missing category.

An encoder can represent a feature column or target column; in this case it represents a target, is_target is True, and target_weights. The target_weights parameter enables users to specify how heavily each class should be weighted within a mixer - useful in imbalanced classes.

By default, the StatisticalAnalysis phase will provide target_weights as the relative fraction of each class in the data which is important for imbalanced populations; for example, suppose there is a 80/05/15 imbalanced representation across 3 different classes - target_weights will be a vector as such:

target_weights = {“class1”: 0.8, “class2”: 0.05, “class3”: 0.15}

Users should note that models will be presented with the inverse of the target weights, inv_target_weights, which will perform the 1/target_value_per_class operation. This means large values will result in small weights for the model.

Parameters
  • is_target (bool) – True if this encoder featurizes the target column

  • target_weights (Optional[Dict[str, float]]) – Percentage of total population represented by each category (between [0, 1]).

  • mode – True uses an extra dimension to account for unknown/out-of-distribution categories

decode(encoded_data)[source]

Decodes OHE mapping into the original categories. Since this approach uses an argmax, decoding flexibly works either on logits or an explicitly OHE vector.

Param

encoded_data:

:returns Returns the original category names for encoded data.

decode_probabilities(encoded_data)[source]

Provides decoded answers, as well as a probability assignment to each data point.

Parameters

encoded_data (Tensor) – the output of a mixer model

:returns Decoded values for each data point, Probability vector for each category, and the reverse map of dimension to category name

Return type

Tuple[List[str], List[List[float]], Dict[int, str]]

encode(column_data)[source]

Encodes pre-processed data into OHE. Unknown/unrecognized classes vector of all 0s.

Parameters

column_data (Iterable[str]) – Pre-processed data to encode

Return type

Tensor

Returns

Encoded data of form \(N_{rows} x N_{categories}\)

prepare(priming_data)[source]

Prepares the OHE Encoder by creating a dictionary mapping.

Unknown categories must be explicitly handled as python None types.

class encoder.PretrainedLangEncoder(stop_after, is_target=False, batch_size=10, max_position_embeddings=None, frozen=False, epochs=1, output_type=None, embed_mode=True)[source]
Parameters
  • is_target (bool) – Whether this encoder represents the target. NOT functional for text generation yet.

  • batch_size (int) – size of batch while fine-tuning

  • max_position_embeddings (Optional[int]) – max sequence length of input text

  • custom_train – If True, trains model on target procided

  • frozen (bool) – If True, freezes transformer layers during training.

  • epochs (int) – number of epochs to train model with

  • output_type (Optional[str]) – Data dtype of the target; if categorical/binary, the option to return logits is possible.

  • embed_mode (bool) – If True, assumes the output of the encode() step is the CLS embedding (this can be trained or not). If False, returns the logits of the tuned task.

decode(encoded_values_tensor, max_length=100)[source]

Text generation via decoding is not supported.

encode(column_data)[source]

Converts each text example in a column into encoded state. This can be either a vector embedding of the [CLS] token (represents the full text input) OR the logits prediction of the output.

The transformer model is of form: transformer base + pre-classifier linear layer + classifier layer

The embedding returned is of the [CLS] token after the pre-classifier layer; from internal testing, we found the latent space most highly separated across classes.

If the encoder represents the logits in classification, returns a soft-maxed output of the class vector.

Parameters

column_data (Iterable[str]) – List of text data as strings

Return type

Tensor

Returns

Embedded vector N_rows x Nembed_dim OR logits vector N_rows x N_classes depending on if embed_mode is True or not.

is_trainable_encoder: bool = True

//arxiv.org/abs/1910.01108).

In certain text tasks, this model can use a transformer to automatically fine-tune on a class of interest (providing there is a 2 column dataset, where the input column is text).

Type

Creates a contextualized embedding to represent input text via the [CLS] token vector from DistilBERT (transformers) (Sanh et al. 2019 - https

prepare(train_priming_data, dev_priming_data, encoded_target_values)[source]

Fine-tunes a transformer on the priming data.

CURRENTLY WIP; train + dev are placeholders for a validation-based approach.

Train + Dev are concatenated together and a transformer is then fine tuned with weight-decay applied on the transformer parameters. The option to freeze the underlying transformer and only train a linear layer exists if frozen=True. This trains faster, with the exception that the performance is often lower than fine-tuning on internal benchmarks.

Parameters
  • train_priming_data (Iterable[str]) – Text data in the train set

  • dev_priming_data (Iterable[str]) – Text data in the dev set (not currently supported; can be empty)

  • encoded_target_values (Tensor) – Encoded target labels in Nrows x N_output_dimension

to(device, available_devices)[source]

Converts encoder models to device specified (CPU/GPU)

Transformers are LARGE models, please run on GPU for fastest implementation.

class encoder.ShortTextEncoder(is_target=False, mode=None)[source]
Parameters
  • is_target

  • mode – None or “concat” or “mean”. When None, it will be set automatically based on is_target: (is_target) -> ‘concat’ (not is_target) -> ‘mean’

decode(vectors)[source]

Given an encoded representation, returns the decoded value. Decoded values may not exist for all encoders (ex: rich text, audio, etc.)

Parameters

encoded_data – The input representation in encoded format

Returns

The decoded representation of data, per column, in the original data-type presented.

encode(column_data)[source]

Given the approach defined in prepare(), encodes column data into a numerical representation to form part of the feature vector.

After all columns are featurized, each encoded vector is concatenated to form a feature vector per row in the dataset.

Parameters

column_data (List[str]) – An iterable data structure where all the elements have type that is compatible with the encoder processing type; this may differ per encoder.

Return type

Tensor

Returns

The encoded representation of data, per column

prepare(priming_data)[source]

Given ‘priming_data’ (i.e. training data), prepares encoders either through a rule-based (ex: one-hot encoding) or learned (ex: DistilBERT for text) model. This works explicitly on only training data.

Parameters

priming_data – An iterable data structure where all the elements have type that is compatible with the encoder processing type; this may differ per encoder.

encoder.TextRnnEncoder

alias of lightwood.encoder.text.rnn.RnnEncoder

class encoder.TimeSeriesEncoder(stop_after, window=None, is_target=False, original_type=None)[source]

Time series encoder. This module will pass the normalized series values, along with moving averages taken from the series’ last window values. :type stop_after: float :param stop_after: time budget in seconds. :type window: Optional[int] :param window: expected length of array data. :type original_type: Optional[dtype] :param original_type: element-wise data type

decode(data)[source]

Converts data as a list of arrays. Removes all encoded moving average information.

Parameters

data (Tensor) – Encoded data prepared by this array encoder

Return type

List[Iterable]

Returns

A list of iterable sequences in the original data space

encode(column_data)[source]

Encodes time series data.

Parameters

column_data (Iterable[Iterable]) – Input column data to be encoded

Return type

Tensor

Returns

a torch tensor representing the encoded time series.

class encoder.TsArrayNumericEncoder(timesteps, is_target=False, positive_domain=False, grouped_by=None)[source]

This encoder handles arrays of numerical time series data by wrapping the numerical encoder with behavior specific to time series tasks.

Parameters
  • timesteps (int) – length of forecasting horizon, as defined by TimeseriesSettings.window.

  • is_target (bool) – whether this encoder corresponds to the target column.

  • positive_domain (bool) – whether the column domain is expected to be positive numbers.

  • grouped_by – what columns, if any, are considered to group the original column and yield multiple time series.

decode(encoded_values, dependency_data=None)[source]

Decodes a list of encoded arrays into values in their original domains.

Parameters
  • encoded_values – encoded slices of numerical time series.

  • dependency_data – used to determine the correct normalizer for the input.

Return type

List[List]

Returns

a list of decoded time series arrays.

decode_one(encoded_value, dependency_data={})[source]

Decodes a single window of a time series into its original domain.

Parameters
  • encoded_value – encoded slice of a numerical time series.

  • dependency_data – used to determine the correct normalizer for the input.

Return type

List

Returns

a list of length TimeseriesSettings.window with decoded values for the forecasted time series.

encode(data, dependency_data={})[source]

Encodes a list of time series arrays using the underlying time series numerical encoder.

Parameters
  • data (Iterable[Iterable]) – list of numerical values to encode. Its length is determined by the tss.window parameter, and all data points belong to the same time series.

  • dependency_data (Optional[Dict[str, str]]) – dict with values of each group_by column for the time series, used to retrieve the correct normalizer.

Return type

Tensor

Returns

list of encoded time series arrays. Tensor is (len(data), N x K)-shaped, where N: self.data_window and K: sub-encoder # of output features.

encode_one(data, dependency_data={})[source]

Encodes a single windowed slice of any given time series.

Parameters
  • data (Iterable) – windowed slice of a numerical time series.

  • dependency_data (Optional[Dict[str, str]]) – used to determine the correct normalizer for the input.

Return type

Tensor

Returns

an encoded time series array, as per the underlying TsNumericEncoder object.

The output of this encoder for all time steps is concatenated, so the final shape of the tensor is (1, NxK), where N: self.data_window and K: sub-encoder # of output features.

prepare(priming_data)[source]

This method prepares the underlying time series numerical encoder.

class encoder.TsCatArrayEncoder(timesteps, is_target=False, grouped_by=None)[source]

This encoder handles arrays of categorical time series data by wrapping the OHE encoder with behavior specific to time series tasks.

Parameters
  • timesteps (int) – length of forecasting horizon, as defined by TimeseriesSettings.window.

  • is_target (bool) – whether this encoder corresponds to the target column.

  • grouped_by – what columns, if any, are considered to group the original column and yield multiple time series.

decode(encoded_values, dependency_data=None)[source]

Decodes a list of encoded arrays into values in their original domains.

Parameters
  • encoded_values – encoded slices of numerical time series.

  • dependency_data – used to determine the correct normalizer for the input.

Return type

List[List]

Returns

a list of decoded time series arrays.

decode_one(encoded_value)[source]

Decodes a single window of a time series into its original domain.

Parameters
  • encoded_value – encoded slice of a numerical time series.

  • dependency_data – used to determine the correct normalizer for the input.

Return type

List

Returns

a list of length TimeseriesSettings.window with decoded values for the forecasted time series.

encode(data, dependency_data={})[source]

Encodes a list of time series arrays using the underlying time series numerical encoder.

Parameters
  • data (Iterable[Iterable]) – list of numerical values to encode. Its length is determined by the tss.window parameter, and all data points belong to the same time series.

  • dependency_data (Optional[Dict[str, str]]) – dict with values of each group_by column for the time series, used to retrieve the correct normalizer.

Return type

Tensor

Returns

list of encoded time series arrays. Tensor is (len(data), N x K)-shaped, where N: self.data_window and K: sub-encoder # of output features.

encode_one(data)[source]

Encodes a single windowed slice of any given time series.

Parameters

data (Iterable) – windowed slice of a numerical time series.

Return type

Tensor

Returns

an encoded time series array, as per the underlying TsNumericEncoder object.

The output of this encoder for all time steps is concatenated, so the final shape of the tensor is (1, NxK), where N: self.data_window and K: sub-encoder # of output features.

prepare(priming_data)[source]

This method prepares the underlying time series numerical encoder.

class encoder.TsNumericEncoder(is_target=False, positive_domain=False, grouped_by=None)[source]

Variant of vanilla numerical encoder, supports dynamic mean re-scaling

Parameters
  • data_type – The data type of the number (integer, float, quantity)

  • is_target (bool) – Indicates whether the encoder refers to a target column or feature column (True==target)

  • positive_domain (bool) – Forces the encoder to always output positive values

decode(encoded_values, decode_log=None, dependency_data=None)[source]
Parameters
  • encoded_values – The encoded values to decode into single numbers

  • decode_log – Whether to decode the log or linear part of the representation, since the encoded vector contains both a log and a linear part

Returns

The decoded number

encode(data, dependency_data={})[source]
Parameters

dependency_data – dict with grouped_by column info, to retrieve the correct normalizer for each datum

class encoder.VocabularyEncoder(is_target=False)[source]
decode(encoded_values_tensor)[source]

Given an encoded representation, returns the decoded value. Decoded values may not exist for all encoders (ex: rich text, audio, etc.)

Parameters

encoded_data – The input representation in encoded format

Returns

The decoded representation of data, per column, in the original data-type presented.

encode(column_data)[source]

Given the approach defined in prepare(), encodes column data into a numerical representation to form part of the feature vector.

After all columns are featurized, each encoded vector is concatenated to form a feature vector per row in the dataset.

Parameters

column_data – An iterable data structure where all the elements have type that is compatible with the encoder processing type; this may differ per encoder.

Returns

The encoded representation of data, per column

prepare(priming_data)[source]

Given ‘priming_data’ (i.e. training data), prepares encoders either through a rule-based (ex: one-hot encoding) or learned (ex: DistilBERT for text) model. This works explicitly on only training data.

Parameters

priming_data – An iterable data structure where all the elements have type that is compatible with the encoder processing type; this may differ per encoder.