Lightwood API Types

Lightwood consists of several high level abstractions to enable the data science/machine learning (DS/ML) pipeline in a step-by-step procedure.

class api.types.Module[source]

Modules are the blocks of code that end up being called from the JSON AI, representing either object instantiations or function calls.

Parameters
  • module – Name of the module (function or class name)

  • args – Argument to pass to the function or constructor

class api.types.TypeInformation[source]

For a dataset, provides information on columns types, how they’re used, and any other potential identifiers.

TypeInformation is generated within data.infer_types, where small samples of each column are evaluated in a custom framework to understand what kind of data type the model is. The user may override data types, but it is recommended to do so within a JSON-AI config file.

Parameters
  • dtypes – For each column’s name, the associated data type inferred.

  • additional_info – Any possible sub-categories or additional descriptive information.

  • identifiers – Columns within the dataset highly suspected of being identifiers or IDs. These do not contain informatic value, therefore will be ignored in subsequent training/analysis procedures unless manually indicated.

class api.types.StatisticalAnalysis(nr_rows, df_target_stddev, train_observed_classes, target_class_distribution, target_weights, histograms, buckets, missing, distinct, bias, avg_words_per_sentence, positive_domain, ts_stats)[source]

The Statistical Analysis data class allows users to consider key descriptors of their data using simple techniques such as histograms, mean and standard deviation, word count, missing values, and any detected bias in the information.

Parameters
  • nr_rows (int) – Number of rows (samples) in the dataset

  • df_target_stddev (Optional[float]) – The standard deviation of the target of the dataset

  • train_observed_classes (object) –

  • target_class_distribution (object) –

  • target_weights (object) – What weight the analysis suggests to assign each class by in the case of classification problems. Note: target_weights in the problem definition overides this

  • histograms (object) –

  • buckets (object) –

  • missing (object) –

  • distinct (object) –

  • bias (object) –

  • avg_words_per_sentence (object) –

  • positive_domain (bool) –

  • ts_stats (dict) –

class api.types.DataAnalysis(statistical_analysis, type_information)[source]

Data Analysis wraps :class: .StatisticalAnalysis and :class: .TypeInformation together. Further details can be seen in their respective documentation references.

class api.types.TimeseriesSettings(is_timeseries, order_by=None, window=None, group_by=None, use_previous_target=True, horizon=None, historical_columns=None, target_type='', allow_incomplete_history=True, eval_cold_start=True, interval_periods=())[source]

For time-series specific problems, more specific treatment of the data is necessary. The following attributes enable time-series tasks to be carried out properly.

Parameters
  • is_timeseries (bool) – Whether the input data should be treated as time series; if true, this flag is checked in subsequent internal steps to ensure processing is appropriate for time-series data.

  • order_by (Optional[str]) – Column by which the data should be ordered.

  • group_by (Optional[List[str]]) – Optional list of columns by which the data should be grouped. Each different combination of values for these columns will yield a different series.

  • window (Optional[int]) – The temporal horizon (number of rows) that a model intakes to “look back” into when making a prediction, after the rows are ordered by the order_by column and split into groups if applicable.

  • horizon (Optional[int]) – The number of points in the future that predictions should be made for, defaults to 1. Once trained, the model will be able to predict up to this many points into the future.

  • historical_columns (Optional[List[str]]) – The temporal dynamics of these columns will be used as additional context to train the time series predictor. Note that a non-historical column shall still be used to forecast, but without considering their change through time.

  • target_type (str) – Automatically inferred dtype of the target (e.g. dtype.integer, dtype.float).

  • use_previous_target (bool) – Use the previous values of the target column to generate predictions. Defaults to True.

  • allow_incomplete_history (bool) – whether predictions can be made for rows with incomplete historical context (i.e. less than window rows have been observed for the datetime that has to be forecasted).

  • eval_cold_start (bool) – whether to include predictions with incomplete history (thus part of the cold start region for certain mixers) when evaluating mixer scores with the validation dataset.

  • interval_periods (tuple) – tuple of tuples with user-provided period lengths for time intervals. Default values will be added for intervals left unspecified. For interval options, check the timeseries_analyzer.detect_period() method documentation. e.g.: ((‘daily’, 7),).

static from_dict(obj)[source]

Creates a TimeseriesSettings object from python dictionary specifications.

Param

obj: A python dictionary with the necessary representation for time-series. The only mandatory columns are order_by and window.

Returns

A populated TimeseriesSettings object.

static from_json(data)[source]

Creates a TimeseriesSettings object from JSON specifications via python dictionary.

Param

data: JSON-config file with necessary Time-series specifications

Returns

A populated TimeseriesSettings object.

to_dict(encode_json=False)[source]

Creates a dictionary from TimeseriesSettings object

Return type

Dict[str, Union[dict, list, str, int, float, bool, None]]

Returns

A python dictionary containing the TimeSeriesSettings specifications.

to_json()[source]

Creates JSON config from TimeseriesSettings object :rtype: Dict[str, Union[dict, list, str, int, float, bool, None]] :returns: The JSON config syntax containing the TimeSeriesSettings specifications.

class api.types.ProblemDefinition(target, pct_invalid, unbias_target, seconds_per_mixer, seconds_per_encoder, expected_additional_time, time_aim, target_weights, positive_domain, timeseries_settings, anomaly_detection, use_default_analysis, ignore_features, fit_on_all, strict_mode, seed_nr)[source]

The ProblemDefinition object indicates details on how the models that predict the target are prepared. The only required specification from a user is the target, which indicates the column within the input data that the user is trying to predict. Within the ProblemDefinition, the user can specify aspects about how long the feature-engineering preparation may take, and nuances about training the models.

Parameters
  • target (str) – The name of the target column; this is the column that will be used as the goal of the prediction.

  • pct_invalid (float) – Number of data points maximally tolerated as invalid/missing/unknown. If the data cleaning process exceeds this number, no subsequent steps will be taken.

  • unbias_target (bool) – all classes are automatically weighted inverse to how often they occur

  • seconds_per_mixer (Optional[int]) – Number of seconds maximum to spend PER mixer trained in the list of possible mixers.

  • seconds_per_encoder (Optional[int]) – Number of seconds maximum to spend when training an encoder that requires data to learn a representation.

  • expected_additional_time (Optional[int]) – Time budget for non-encoder/mixer tasks (ex: data analysis, pre-processing, model ensembling or model analysis)

  • time_aim (Optional[float]) – Time budget (in seconds) to train all needed components for the predictive tasks, including encoders and models.

  • target_weights (Optional[List[float]]) – indicates to the accuracy functions how much to weight every target class.

  • positive_domain (bool) – For numerical taks, force predictor output to be positive (integer or float).

  • timeseries_settings (TimeseriesSettings) – TimeseriesSettings object for time-series tasks, refer to its documentation for available settings.

  • anomaly_detection (bool) – Whether to conduct unsupervised anomaly detection; currently supported only for time- series.

  • ignore_features (List[str]) – The names of the columns the user wishes to ignore in the ML pipeline. Any column name found in this list will be automatically removed from subsequent steps in the ML pipeline.

  • use_default_analysis (bool) – whether default analysis blocks are enabled.

  • fit_on_all (bool) – Whether to fit the model on the held-out validation data. Validation data is strictly used to evaluate how well a model is doing and is NEVER trained. However, in cases where users anticipate new incoming data over time, the user may train the model further using the entire dataset.

  • strict_mode (bool) – crash if an unstable block (mixer, encoder, etc.) fails to run.

  • seed_nr (int) – custom seed to use when generating a predictor from this problem definition.

static from_dict(obj)[source]

Creates a ProblemDefinition object from a python dictionary with necessary specifications.

Parameters

obj (Dict) – A python dictionary with the necessary features for the ProblemDefinition class.

Only requires target to be specified.

Returns

A populated ProblemDefinition object.

static from_json(data)[source]

Creates a ProblemDefinition Object from JSON config file.

Parameters

data (str) –

Returns

A populated ProblemDefinition object.

to_dict(encode_json=False)[source]

Creates a python dictionary from the ProblemDefinition object

Return type

Dict[str, Union[dict, list, str, int, float, bool, None]]

Returns

A python dictionary

to_json()[source]

Creates a JSON config from the ProblemDefinition object

Return type

Dict[str, Union[dict, list, str, int, float, bool, None]]

Returns

TODO

class api.types.JsonAI(encoders, dtype_dict, dependency_dict, model, problem_definition, identifiers, cleaner=None, splitter=None, analyzer=None, explainer=None, imputers=None, analysis_blocks=None, timeseries_transformer=None, timeseries_analyzer=None, accuracy_functions=None)[source]

The JsonAI Class allows users to construct flexible JSON config to specify their ML pipeline. JSON-AI follows a recipe of how to pre-process data, construct features, and train on the target column. To do so, the following specifications are required internally.

Parameters
  • encoders (Dict[str, Module]) – A dictionary of the form: column_name -> encoder module

  • dtype_dict (Dict[str, dtype]) – A dictionary of the form: column_name -> data type

  • dependency_dict (Dict[str, List[str]]) – A dictionary of the form: column_name -> list of columns it depends on

  • model (Dict[str, Module]) – The ensemble and its submodels

  • problem_definition (ProblemDefinition) – The ProblemDefinition criteria.

  • identifiers (Dict[str, str]) – A dictionary of column names and respective data types that are likely identifiers/IDs within the data. Through the default cleaning process, these are ignored.

  • cleaner (Optional[Module]) – The Cleaner object represents the pre-processing step on a dataframe. The user can specify custom subroutines, if they choose, on how to handle preprocessing. Alternatively, “None” suggests Lightwood’s default approach in data.cleaner.

  • splitter (Optional[Module]) – The Splitter object is the method in which the input data is split into training/validation/testing data.

  • analyzer (Optional[Module]) – The Analyzer object is used to evaluate how well a model performed on the predictive task.

  • explainer (Optional[Module]) – The Explainer object deploys explainability tools of interest on a model to indicate how well a model generalizes its predictions.

  • imputers (Optional[List[Module]]) – A list of objects that will impute missing data on each column. They are called inside the cleaner.

  • analysis_blocks (Optional[List[Module]]) – The blocks that get used in both analysis and inference inside the analyzer and explainer blocks.

  • timeseries_transformer (Optional[Module]) – Procedure used to transform any timeseries task dataframe into the format that lightwood expects for the rest of the pipeline.

  • timeseries_analyzer (Optional[Module]) – Procedure that extracts key insights from any timeseries in the data (e.g. measurement frequency, target distribution, etc).

  • accuracy_functions (Optional[List[str]]) – A list of performance metrics used to evaluate the best mixers.

static from_dict(obj)[source]

Creates a JSON-AI object from dictionary specifications of the JSON-config.

static from_json(data)[source]

Creates a JSON-AI object from JSON config

to_dict(encode_json=False)[source]

Creates a python dictionary with necessary modules within the ML pipeline specified from the JSON-AI object.

Return type

Dict[str, Union[dict, list, str, int, float, bool, None]]

Returns

A python dictionary that has the necessary components of the ML pipeline for a given dataset.

to_json()[source]

Creates JSON config to represent the necessary modules within the ML pipeline specified from the JSON-AI object.

Return type

Dict[str, Union[dict, list, str, int, float, bool, None]]

Returns

A JSON config that has the necessary components of the ML pipeline for a given dataset.

class api.types.SubmodelData(name, accuracy, is_best)[source]
class api.types.ModelAnalysis(accuracies, accuracy_histogram, accuracy_samples, train_sample_size, test_sample_size, column_importances, confusion_matrix, histograms, dtypes, submodel_data)[source]

The ModelAnalysis class stores useful information to describe a model and understand its predictive performance on a validation dataset. For each trained ML algorithm, we store:

Parameters
  • accuracies (Dict[str, float]) – Dictionary with obtained values for each accuracy function (specified in JsonAI)

  • accuracy_histogram (Dict[str, list]) – Dictionary with histograms of reported accuracy by target value.

  • accuracy_samples (Dict[str, list]) – Dictionary with sampled pairs of observed target values and respective predictions.

  • train_sample_size (int) – Size of the training set (data that parameters are updated on)

  • test_sample_size (int) – Size of the testing set (explicitly held out)

  • column_importances (Dict[str, float]) – Dictionary with the importance of each column for the model, as estimated by an approach that closely follows a leave-one-covariate-out strategy.

  • confusion_matrix (object) – A confusion matrix for the validation dataset.

  • histograms (object) – Histogram for each dataset feature.

  • dtypes (object) – Inferred data types for each dataset feature.

class api.types.PredictionArguments(predict_proba=True, all_mixers=False, fixed_confidence=None, anomaly_cooldown=1, forecast_offset=0, simple_ts_bounds=False, time_format='')[source]

This class contains all possible arguments that can be passed to a Lightwood predictor at inference time. On each predict call, all arguments included in a parameter dictionary will update the respective fields in the PredictionArguments instance that the predictor will have.

Parameters

predict_proba (bool) – triggers (where supported) predictions in raw probability output form. I.e. for classifiers,

instead of returning only the predicted class, the output additionally includes the assigned probability for each class. :type all_mixers: bool :param all_mixers: forces an ensemble to return predictions emitted by all its internal mixers. :type fixed_confidence: Union[int, float, None] :param fixed_confidence: Used in the ICP analyzer module, specifies an alpha fixed confidence so that predictions, in average, are correct alpha percent of the time. For unsupervised anomaly detection, this also translates into the expected error rate. Bounded between 0.01 and 0.99 (respectively implies wider and tighter bounds, all other parameters being equal). :type anomaly_cooldown: int :param anomaly_cooldown: Sets the minimum amount of timesteps between consecutive firings of the the anomaly detector. :type simple_ts_bounds: bool :param simple_ts_bounds: in forecasting contexts, enabling this parameter disables the usual conformal-based bounds (with Bonferroni correction) and resorts to a simpler way of scaling bounds through the horizon based on the uncertainty estimation for the first value in the forecast (see helpers.ts.add_tn_num_conf_bounds for the implementation). :param anomaly_cooldown: Sets the minimum amount of timesteps between consecutive firings of the the anomaly detector. :type time_format: str :param time_format: For time series predictors. If set to infer, predicted order_by timestamps will be formatted back to the original dataset’s order_by format. Any other string value will be used as a formatting string, unless empty (‘’), which disables the feature (this is the default behavior).

static from_dict(obj)[source]

Creates a PredictionArguments object from a python dictionary with necessary specifications.

Parameters

obj (Dict) – A python dictionary with the necessary features for the PredictionArguments class.

Returns

A populated PredictionArguments object.

to_dict(encode_json=False)[source]

Creates a python dictionary from the PredictionArguments object

Return type

Dict[str, Union[dict, list, str, int, float, bool, None]]

Returns

A python dictionary