Encoders
¶
Used for encoding data into PyTorch tensors and decoding it from pytorch tensors
- class encoder.ArrayEncoder(stop_after, window=None, is_target=False, original_type=None)[source]¶
Fits a normalizer for array data.
To encode, ArrayEncoder returns a normalized window of previous data. It can be used for generic arrays, as well as for handling historical target values in time series tasks.
Currently supported normalizing strategies are minmax for numerical arrays, and a simple one-hot for categorical arrays. See lightwood.encoder.helpers for more details on each approach.
- Parameters
stop_after (
float
) – time budget in seconds.window (
Optional
[int
]) – expected length of array data.original_type (
Optional
[dtype
]) – element-wise data type
- decode(data)[source]¶
Converts data as a list of arrays.
- Parameters
data (
Tensor
) – Encoded data prepared by this array encoder- Return type
List
[Iterable
]- Returns
A list of iterable sequences in the original data space
- class encoder.BaseEncoder(is_target=False)[source]¶
Base class for all encoders.
An encoder should return encoded representations of any columnar data. The procedure for this is defined inside the encode() method.
If this encoder is expected to handle an output column, then it also needs to implement the respective decode() method that handles the inverse transformation from encoded representations to the final prediction in the original column space.
For encoders that learn representations (as opposed to rule-based), the prepare() method will handle all learning logic.
The to() method is used to move PyTorch-based encoders to and from a GPU.
- Parameters
is_target – Whether the data to encode is the target, as per the problem definition.
is_timeseries_encoder – Whether encoder represents sequential/time-series data. Lightwood must provide specific treatment for this kind of encoder
is_trainable_encoder – Whether the encoder must return learned representations. Lightwood checks whether this flag is present in order to pass data to the feature representation via the
prepare
statement.
Class Attributes: - is_prepared: Internal flag to signal that the prepare() method has been successfully executed. - is_nn_encoder: Whether the encoder is neural network-based. - dependencies: list of additional columns that the encoder might need to encode. - output_size: length of each encoding tensor for a single data point.
- decode(encoded_data)[source]¶
Given an encoded representation, returns the decoded value. Decoded values may not exist for all encoders (ex: rich text, audio, etc.)
- Parameters
encoded_data (
Tensor
) – The input representation in encoded format- Return type
List
[object
]- Returns
The decoded representation of data, per column, in the original data-type presented.
- encode(column_data)[source]¶
Given the approach defined in prepare(), encodes column data into a numerical representation to form part of the feature vector.
After all columns are featurized, each encoded vector is concatenated to form a feature vector per row in the dataset.
- Parameters
column_data (
Iterable
[object
]) – An iterable data structure where all the elements have type that is compatible with the encoder processing type; this may differ per encoder.- Return type
Tensor
- Returns
The encoded representation of data, per column
- prepare(priming_data)[source]¶
Given ‘priming_data’ (i.e. training data), prepares encoders either through a rule-based (ex: one-hot encoding) or learned (ex: DistilBERT for text) model. This works explicitly on only training data.
- Parameters
priming_data (
Iterable
[object
]) – An iterable data structure where all the elements have type that is compatible with the encoder processing type; this may differ per encoder.- Return type
None
- class encoder.BinaryEncoder(is_target=False, target_weights=None)[source]¶
Creates a one-hot-encoding for binary class data. Assume two arbitrary categories \(A\) and \(B\); representation for them will be as such:
\[A &= [1, 0] \ B &= [0, 1]\]This encoder is a specialized case of one-hot encoding (OHE); unknown categories are explicitly handled as [0, 0]. Unknowns may only be reported if the input row value is NULL (or python None type) or if new data, after the encoder is prepared, has examples outside the feature map.
When data is typed with Lightwood, this class is only deployed if an input data type is explicitly recognized as binary (i.e. the column has only 2 unique values like True/False). If future data shows a new category (thus the data is no longer truly binary), this encoder will no longer be appropriate unless you are comfortable mapping ALL new classes as [0, 0].
An encoder can represent a feature column or target column; in this case it represents a target, is_target is True, and target_weights. The target_weights parameter enables users to specify how heavily each class should be weighted within a mixer - useful in imbalanced classes.
By default, the StatisticalAnalysis phase will provide target_weights as the relative fraction of each class in the data which is important for imbalanced populations; for example, suppose there is a 80/20 imbalanced representation across 3 different classes - target_weights will be a vector as such:
target_weights = {“class1”: 0.8, “class2”: 0.2}
Users should note that models will be presented with the inverse of the target weights, inv_target_weights, which will perform the 1/target_value_per_class operation. This means large values will result in small weights for the model.
- decode(encoded_data)[source]¶
Given encoded data, return in form of original category labels. The input to decode makes no presumption on whether the data is already in OHE form OR not, as it some models may output a set of probabilities of weights assigned to each class. The decoded value will always be the argmax of such a vector.
In the case that the vector is all 0s, the output is decoded as “UNKNOWN”
- Parameters
encoded_data (
Tensor
) – the output of a mixer model- Returns
Decoded values for each data point
- decode_probabilities(encoded_data)[source]¶
Provides decoded answers, as well as a probability assignment to each data point.
- Parameters
encoded_data (
Tensor
) – the output of a mixer model- Return type
Tuple
[List
[str
],List
[List
[float
]],Dict
[int
,str
]]- Returns
Decoded values for each data point, Probability vector for each category, and the reverse map of dimension to category name
- class encoder.CatArrayEncoder(stop_after, window=None, is_target=False)[source]¶
- Parameters
stop_after (
float
) – time budget in seconds.window (
Optional
[int
]) – expected length of array data.original_type – element-wise data type
- class encoder.CategoricalAutoEncoder(stop_after=3600, is_target=False, max_encoded_length=100, desired_error=0.01, batch_size=200)[source]¶
Trains an autoencoder (AE) to represent categorical information with over 100 categories. This is used to ensure that feature vectors for categorical data with many categories are not excessively large.
The AE defaults to a vector sized 100 but can be adjusted to user preference. It is highly advised NOT to use this encoder to feature engineer your target, as reconstruction accuracy will determine your AE’s ability to decode properly.
- Parameters
stop_after (
float
) – Stops training with provided time limit (sec)is_target (
bool
) – Encoder represents target class (NOT recommended)max_encoded_length (
int
) – Maximum length of vector representeddesired_error (
float
) – Threshold for reconstruction accuracy errorbatch_size (
int
) – Minimum batch size while training
- decode(encoded_data)[source]¶
Decodes from the embedding space, the original categories.
..warning If your reconstruction accuracy is not 100%, the CatAE may not return the correct category.
- Parameters
encoded_data (
Tensor
) – A torch tensor of embeddings for category predictions- Return type
List
[str
]- Returns
A list of ‘translated’ categories for each embedding
- encode(column_data)[source]¶
Encodes categorical information in column as the compressed vector from the CatAE.
- Parameters
column_data (
Iterable
[str
]) – An iterable of category samples from a column- Return type
Tensor
- Returns
An embedding for each sample in original input
- prepare(train_priming_data, dev_priming_data)[source]¶
Creates inputs and prepares a categorical autoencoder (CatAE) for input data. Currently, does not support a dev set; inputs for train and dev are concatenated together to train an autoencoder.
- Parameters
train_priming_data (
Series
) – Input training datadev_priming_data (
Series
) – Input dev data (Not supported currently)
- class encoder.DatetimeEncoder(is_target=False)[source]¶
This encoder produces an encoded representation for timestamps.
The approach consists on decomposing the timestamp objects into its constituent units (e.g. day-of-week, month, year, etc), and describing each of those with a single value that represents the magnitude in a sensible cycle length.
- decode(encoded_data, return_as_datetime=False)[source]¶
Given an encoded representation, returns the decoded value. Decoded values may not exist for all encoders (ex: rich text, audio, etc.)
- Parameters
encoded_data – The input representation in encoded format
- Returns
The decoded representation of data, per column, in the original data-type presented.
- encode(data)[source]¶
- Parameters
data – # @TODO: receive a consistent data type here; currently either list of lists or pd.Series w/lists
- Returns
encoded data
- encode_one(unix_timestamp)[source]¶
Encodes a list of unix_timestamps, or a list of tensors with unix_timestamps :param data: list of unix_timestamps (unix_timestamp resolution is seconds) :return: a list of vectors
- prepare(priming_data)[source]¶
Given ‘priming_data’ (i.e. training data), prepares encoders either through a rule-based (ex: one-hot encoding) or learned (ex: DistilBERT for text) model. This works explicitly on only training data.
- Parameters
priming_data – An iterable data structure where all the elements have type that is compatible with the encoder processing type; this may differ per encoder.
- class encoder.DatetimeNormalizerEncoder(is_target=False, sinusoidal=False)[source]¶
- decode(encoded_data, return_as_datetime=False)[source]¶
Given an encoded representation, returns the decoded value. Decoded values may not exist for all encoders (ex: rich text, audio, etc.)
- Parameters
encoded_data – The input representation in encoded format
- Returns
The decoded representation of data, per column, in the original data-type presented.
- encode(data)[source]¶
- Parameters
data – # @TODO: receive a consistent data type here; currently either list of lists or pd.Series w/lists
- Returns
encoded data
- encode_one(data)[source]¶
Encodes a list of unix_timestamps, or a list of tensors with unix_timestamps :param data: list of unix_timestamps (unix_timestamp resolution is seconds) :return: a list of vectors
- prepare(priming_data)[source]¶
Given ‘priming_data’ (i.e. training data), prepares encoders either through a rule-based (ex: one-hot encoding) or learned (ex: DistilBERT for text) model. This works explicitly on only training data.
- Parameters
priming_data – An iterable data structure where all the elements have type that is compatible with the encoder processing type; this may differ per encoder.
- class encoder.Img2VecEncoder(stop_after=3600, is_target=False, scale=(224, 224), mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])[source]¶
Generates encoded representations for images using a pre-trained deep neural network. Inputs must be str-based location of the data.
Without user-specified details, all input images are rescaled to a standard size of 224x224, and normalized using the mean and standard deviation of the ImageNet dataset (as it was used to train the underlying NN).
This encoder currently does not support a decode() call; models with an image output will not work.
For more information about the neural network this encoder uses, refer to the lightwood.encoder.image.helpers.img_to_vec.Img2Vec.
- Parameters
stop_after (
float
) – time budget, in seconds.is_target (
bool
) – Whether encoder represents target or notscale (
Tuple
[int
,int
]) – Resize scale of image (x, y)mean (
List
[float
]) – Mean of pixel valuesstd (
List
[float
]) – Standard deviation of pixel values
- encode(images)[source]¶
Creates encodings for a list of images; each image is referenced by a filepath or url.
- Parameters
images (
List
[str
]) – list of images, each image is a path to a file or a url.- Return type
Tensor
- Returns
a torch.floatTensor
- class encoder.MultiHotEncoder(is_target=False)[source]¶
- decode(vectors)[source]¶
Given an encoded representation, returns the decoded value. Decoded values may not exist for all encoders (ex: rich text, audio, etc.)
- Parameters
encoded_data – The input representation in encoded format
- Returns
The decoded representation of data, per column, in the original data-type presented.
- encode(column_data)[source]¶
Given the approach defined in prepare(), encodes column data into a numerical representation to form part of the feature vector.
After all columns are featurized, each encoded vector is concatenated to form a feature vector per row in the dataset.
- Parameters
column_data – An iterable data structure where all the elements have type that is compatible with the encoder processing type; this may differ per encoder.
- Returns
The encoded representation of data, per column
- prepare(priming_data, max_dimensions=100)[source]¶
Given ‘priming_data’ (i.e. training data), prepares encoders either through a rule-based (ex: one-hot encoding) or learned (ex: DistilBERT for text) model. This works explicitly on only training data.
- Parameters
priming_data – An iterable data structure where all the elements have type that is compatible with the encoder processing type; this may differ per encoder.
- class encoder.NumArrayEncoder(stop_after, window=None, is_target=False, positive_domain=False)[source]¶
- Parameters
stop_after (
float
) – time budget in seconds.window (
Optional
[int
]) – expected length of array data.original_type – element-wise data type
- class encoder.NumericEncoder(data_type=None, is_target=False, positive_domain=False)[source]¶
The numeric encoder takes numbers (float or integer) and converts it into tensors of the form:
[0 if the number is none, otherwise 1, 1 if the number is positive, otherwise 0, natural_log(abs(number)), number/absolute_mean]
This representation is:
[1 if the number is positive, otherwise 0, natural_log(abs(number)), number/absolute_mean]]
if encoding target values, since target values can’t be none.The
absolute_mean
is computed in theprepare
method and is just the mean of the absolute values of all numbers feed to prepare (which are not none)none
stands for any number that is an actual pythonNone
value or any sort of non-numeric value (a string, nan, inf)- Parameters
data_type (
Optional
[dtype
]) – The data type of the number (integer, float, quantity)is_target (
bool
) – Indicates whether the encoder refers to a target column or feature column (True==target)positive_domain (
bool
) – Forces the encoder to always output positive values
- decode(encoded_values, decode_log=None)[source]¶
- Parameters
encoded_values (
Union
[List
[Union
[int
,float
,bool
]],Tensor
]) – The encoded values to decode into single numbersdecode_log (
Optional
[bool
]) – Whether to decode thelog
orlinear
part of the representation, since the encoded vector contains both a log and a linear part
- Return type
list
- Returns
The decoded number
- encode(data)[source]¶
- Parameters
data (
Iterable
) – An iterable data structure containing the numbers to be encoded- Returns
A torch tensor with the representations of each number
- prepare(priming_data)[source]¶
“NumericalEncoder” uses a rule-based form to prepare results on training (priming) data. The averages etc. are taken from this distribution.
- Parameters
priming_data (
Iterable
) – an iterable data structure containing numbers numbers which will be used to compute the values used for normalizing the encoded representations
- class encoder.OneHotEncoder(is_target=False, target_weights=None, use_unknown=True)[source]¶
Creates a one-hot encoding (OHE) for categorical data. One-hot encoding represents categorical information as a vector where each individual dimension corresponds to a category. A category has a 1:1 mapping between dimension indicated by a “1” in that position. For example, imagine 3 categories, \(A\), \(B\), and \(C\); these can be represented as follows:
\[A &= [1, 0, 0] \ B &= [0, 1, 0] \ C &= [0, 0, 1]\]- The OHE encoder operates in 2 modes:
“use_unknown=True”: Makes an \(N+1\) length vector for \(N\) categories, the first index always corresponds to the unknown category.
“use_unknown=False”: Makes an \(N\) length vector for \(N\) categories, where an empty vector of 0s indicates an unknown/missing category.
An encoder can represent a feature column or target column; in this case it represents a target, is_target is True, and target_weights. The target_weights parameter enables users to specify how heavily each class should be weighted within a mixer - useful in imbalanced classes.
By default, the StatisticalAnalysis phase will provide target_weights as the relative fraction of each class in the data which is important for imbalanced populations; for example, suppose there is a 80/05/15 imbalanced representation across 3 different classes - target_weights will be a vector as such:
target_weights = {“class1”: 0.8, “class2”: 0.05, “class3”: 0.15}
Users should note that models will be presented with the inverse of the target weights, inv_target_weights, which will perform the 1/target_value_per_class operation. This means large values will result in small weights for the model.
- Parameters
is_target (
bool
) – True if this encoder featurizes the target columntarget_weights (
Optional
[Dict
[str
,float
]]) – Percentage of total population represented by each category (between [0, 1]).mode – True uses an extra dimension to account for unknown/out-of-distribution categories
- decode(encoded_data)[source]¶
Decodes OHE mapping into the original categories. Since this approach uses an argmax, decoding flexibly works either on logits or an explicitly OHE vector.
- Param
encoded_data:
:returns Returns the original category names for encoded data.
- decode_probabilities(encoded_data)[source]¶
Provides decoded answers, as well as a probability assignment to each data point.
- Parameters
encoded_data (
Tensor
) – the output of a mixer model
:returns Decoded values for each data point, Probability vector for each category, and the reverse map of dimension to category name
- Return type
Tuple
[List
[str
],List
[List
[float
]],Dict
[int
,str
]]
- class encoder.PretrainedLangEncoder(stop_after, is_target=False, batch_size=10, max_position_embeddings=None, frozen=False, epochs=1, output_type=None, embed_mode=True)[source]¶
- Parameters
is_target (
bool
) – Whether this encoder represents the target. NOT functional for text generation yet.batch_size (
int
) – size of batch while fine-tuningmax_position_embeddings (
Optional
[int
]) – max sequence length of input textcustom_train – If True, trains model on target procided
frozen (
bool
) – If True, freezes transformer layers during training.epochs (
int
) – number of epochs to train model withoutput_type (
Optional
[str
]) – Data dtype of the target; if categorical/binary, the option to return logits is possible.embed_mode (
bool
) – If True, assumes the output of the encode() step is the CLS embedding (this can be trained or not). If False, returns the logits of the tuned task.
- decode(encoded_values_tensor, max_length=100)[source]¶
Text generation via decoding is not supported.
- encode(column_data)[source]¶
Converts each text example in a column into encoded state. This can be either a vector embedding of the [CLS] token (represents the full text input) OR the logits prediction of the output.
The transformer model is of form: transformer base + pre-classifier linear layer + classifier layer
The embedding returned is of the [CLS] token after the pre-classifier layer; from internal testing, we found the latent space most highly separated across classes.
If the encoder represents the logits in classification, returns a soft-maxed output of the class vector.
- Parameters
column_data (
Iterable
[str
]) – List of text data as strings- Return type
Tensor
- Returns
Embedded vector N_rows x Nembed_dim OR logits vector N_rows x N_classes depending on if embed_mode is True or not.
- is_trainable_encoder: bool = True¶
//arxiv.org/abs/1910.01108).
In certain text tasks, this model can use a transformer to automatically fine-tune on a class of interest (providing there is a 2 column dataset, where the input column is text).
- Type
Creates a contextualized embedding to represent input text via the [CLS] token vector from DistilBERT (transformers) (Sanh et al. 2019 - https
- prepare(train_priming_data, dev_priming_data, encoded_target_values)[source]¶
Fine-tunes a transformer on the priming data.
CURRENTLY WIP; train + dev are placeholders for a validation-based approach.
Train + Dev are concatenated together and a transformer is then fine tuned with weight-decay applied on the transformer parameters. The option to freeze the underlying transformer and only train a linear layer exists if frozen=True. This trains faster, with the exception that the performance is often lower than fine-tuning on internal benchmarks.
- Parameters
train_priming_data (
Iterable
[str
]) – Text data in the train setdev_priming_data (
Iterable
[str
]) – Text data in the dev set (not currently supported; can be empty)encoded_target_values (
Tensor
) – Encoded target labels in Nrows x N_output_dimension
- class encoder.ShortTextEncoder(is_target=False, mode=None)[source]¶
- Parameters
is_target –
mode – None or “concat” or “mean”. When None, it will be set automatically based on is_target: (is_target) -> ‘concat’ (not is_target) -> ‘mean’
- decode(vectors)[source]¶
Given an encoded representation, returns the decoded value. Decoded values may not exist for all encoders (ex: rich text, audio, etc.)
- Parameters
encoded_data – The input representation in encoded format
- Returns
The decoded representation of data, per column, in the original data-type presented.
- encode(column_data)[source]¶
Given the approach defined in prepare(), encodes column data into a numerical representation to form part of the feature vector.
After all columns are featurized, each encoded vector is concatenated to form a feature vector per row in the dataset.
- Parameters
column_data (
List
[str
]) – An iterable data structure where all the elements have type that is compatible with the encoder processing type; this may differ per encoder.- Return type
Tensor
- Returns
The encoded representation of data, per column
- prepare(priming_data)[source]¶
Given ‘priming_data’ (i.e. training data), prepares encoders either through a rule-based (ex: one-hot encoding) or learned (ex: DistilBERT for text) model. This works explicitly on only training data.
- Parameters
priming_data – An iterable data structure where all the elements have type that is compatible with the encoder processing type; this may differ per encoder.
- encoder.TextRnnEncoder¶
alias of
lightwood.encoder.text.rnn.RnnEncoder
- class encoder.TimeSeriesEncoder(stop_after, window=None, is_target=False, original_type=None)[source]¶
Time series encoder. This module will pass the normalized series values, along with moving averages taken from the series’ last window values. :type stop_after:
float
:param stop_after: time budget in seconds. :type window:Optional
[int
] :param window: expected length of array data. :type original_type:Optional
[dtype
] :param original_type: element-wise data type
- class encoder.TsArrayNumericEncoder(timesteps, is_target=False, positive_domain=False, grouped_by=None)[source]¶
This encoder handles arrays of numerical time series data by wrapping the numerical encoder with behavior specific to time series tasks.
- Parameters
timesteps (
int
) – length of forecasting horizon, as defined by TimeseriesSettings.window.is_target (
bool
) – whether this encoder corresponds to the target column.positive_domain (
bool
) – whether the column domain is expected to be positive numbers.grouped_by – what columns, if any, are considered to group the original column and yield multiple time series.
- decode(encoded_values, dependency_data=None)[source]¶
Decodes a list of encoded arrays into values in their original domains.
- Parameters
encoded_values – encoded slices of numerical time series.
dependency_data – used to determine the correct normalizer for the input.
- Return type
List
[List
]- Returns
a list of decoded time series arrays.
- decode_one(encoded_value, dependency_data={})[source]¶
Decodes a single window of a time series into its original domain.
- Parameters
encoded_value – encoded slice of a numerical time series.
dependency_data – used to determine the correct normalizer for the input.
- Return type
List
- Returns
a list of length TimeseriesSettings.window with decoded values for the forecasted time series.
- encode(data, dependency_data={})[source]¶
Encodes a list of time series arrays using the underlying time series numerical encoder.
- Parameters
data (
Iterable
[Iterable
]) – list of numerical values to encode. Its length is determined by the tss.window parameter, and all data points belong to the same time series.dependency_data (
Optional
[Dict
[str
,str
]]) – dict with values of each group_by column for the time series, used to retrieve the correct normalizer.
- Return type
Tensor
- Returns
list of encoded time series arrays. Tensor is (len(data), N x K)-shaped, where N: self.data_window and K: sub-encoder # of output features.
- encode_one(data, dependency_data={})[source]¶
Encodes a single windowed slice of any given time series.
- Parameters
data (
Iterable
) – windowed slice of a numerical time series.dependency_data (
Optional
[Dict
[str
,str
]]) – used to determine the correct normalizer for the input.
- Return type
Tensor
- Returns
an encoded time series array, as per the underlying TsNumericEncoder object.
The output of this encoder for all time steps is concatenated, so the final shape of the tensor is (1, NxK), where N: self.data_window and K: sub-encoder # of output features.
- class encoder.TsCatArrayEncoder(timesteps, is_target=False, grouped_by=None)[source]¶
This encoder handles arrays of categorical time series data by wrapping the OHE encoder with behavior specific to time series tasks.
- Parameters
timesteps (
int
) – length of forecasting horizon, as defined by TimeseriesSettings.window.is_target (
bool
) – whether this encoder corresponds to the target column.grouped_by – what columns, if any, are considered to group the original column and yield multiple time series.
- decode(encoded_values, dependency_data=None)[source]¶
Decodes a list of encoded arrays into values in their original domains.
- Parameters
encoded_values – encoded slices of numerical time series.
dependency_data – used to determine the correct normalizer for the input.
- Return type
List
[List
]- Returns
a list of decoded time series arrays.
- decode_one(encoded_value)[source]¶
Decodes a single window of a time series into its original domain.
- Parameters
encoded_value – encoded slice of a numerical time series.
dependency_data – used to determine the correct normalizer for the input.
- Return type
List
- Returns
a list of length TimeseriesSettings.window with decoded values for the forecasted time series.
- encode(data, dependency_data={})[source]¶
Encodes a list of time series arrays using the underlying time series numerical encoder.
- Parameters
data (
Iterable
[Iterable
]) – list of numerical values to encode. Its length is determined by the tss.window parameter, and all data points belong to the same time series.dependency_data (
Optional
[Dict
[str
,str
]]) – dict with values of each group_by column for the time series, used to retrieve the correct normalizer.
- Return type
Tensor
- Returns
list of encoded time series arrays. Tensor is (len(data), N x K)-shaped, where N: self.data_window and K: sub-encoder # of output features.
- encode_one(data)[source]¶
Encodes a single windowed slice of any given time series.
- Parameters
data (
Iterable
) – windowed slice of a numerical time series.- Return type
Tensor
- Returns
an encoded time series array, as per the underlying TsNumericEncoder object.
The output of this encoder for all time steps is concatenated, so the final shape of the tensor is (1, NxK), where N: self.data_window and K: sub-encoder # of output features.
- class encoder.TsNumericEncoder(is_target=False, positive_domain=False, grouped_by=None)[source]¶
Variant of vanilla numerical encoder, supports dynamic mean re-scaling
- Parameters
data_type – The data type of the number (integer, float, quantity)
is_target (
bool
) – Indicates whether the encoder refers to a target column or feature column (True==target)positive_domain (
bool
) – Forces the encoder to always output positive values
- decode(encoded_values, decode_log=None, dependency_data=None)[source]¶
- Parameters
encoded_values – The encoded values to decode into single numbers
decode_log – Whether to decode the
log
orlinear
part of the representation, since the encoded vector contains both a log and a linear part
- Returns
The decoded number
- class encoder.VocabularyEncoder(is_target=False)[source]¶
- decode(encoded_values_tensor)[source]¶
Given an encoded representation, returns the decoded value. Decoded values may not exist for all encoders (ex: rich text, audio, etc.)
- Parameters
encoded_data – The input representation in encoded format
- Returns
The decoded representation of data, per column, in the original data-type presented.
- encode(column_data)[source]¶
Given the approach defined in prepare(), encodes column data into a numerical representation to form part of the feature vector.
After all columns are featurized, each encoded vector is concatenated to form a feature vector per row in the dataset.
- Parameters
column_data – An iterable data structure where all the elements have type that is compatible with the encoder processing type; this may differ per encoder.
- Returns
The encoded representation of data, per column
- prepare(priming_data)[source]¶
Given ‘priming_data’ (i.e. training data), prepares encoders either through a rule-based (ex: one-hot encoding) or learned (ex: DistilBERT for text) model. This works explicitly on only training data.
- Parameters
priming_data – An iterable data structure where all the elements have type that is compatible with the encoder processing type; this may differ per encoder.