Predictor Interface¶
The PredictorInterface
creates the skeletal structure around basic functionality of Lightwood.
- class api.predictor.PredictorInterface[source]¶
Abstraction of a Lightwood predictor. The
PredictorInterface
encompasses how Lightwood interacts with the full ML pipeline. Internally,The
PredictorInterface
class must have several expected functions:analyze_data
: Peform a statistical analysis on the unprocessed data; this helps inform downstream encoders and mixers on how to treat the data types.preprocess
: Apply cleaning functions to each of the columns within the dataset to prepare them for featurizationsplit
: Split the input dataset into a train/dev/test set according to your splitter functionprepare
: Create and, if necessary, train your encoders to create feature representations from each column of your data.featurize
: For input, pre-processed data, create feature vectorsfit
: Train your mixer models to yield predictions from featurized dataanalyze_ensemble
: Evaluate the quality of fit for your mixer modelsadjust
: Incorporate new data to update pre-existing model(s).
For simplification, we offer an end-to-end approach that allows you to input raw data and follow every step of the process until you reach a trained predictor with the
learn
function:learn
: An end-to-end technique specifying how to pre-process, featurize, and train the model(s) of interest. The expected input is raw, untrained data. No explicit output is provided, but the Predictor object will “host” the trained model thus.
You can also use the predictor to now estimate new data:
predict
: Deploys the chosen best model, and evaluates the given data to provide target estimates.save
: Saves the Predictor object for further use.
The
PredictorInterface
is created via J{ai}son’s custom code creation. A problem inherits from this class with pre-populated routines to fill out expected results, given the nature of each problem type.- adjust(new_data, old_data=None, adjust_args=None)[source]¶
Adjusts a previously trained model on new data. Adopts the same process as
learn
but with the exception that the adjust function expects the best model to have been already trained.Warning
This is experimental and subject to change.
- Parameters
new_data (
DataFrame
) – New data used to adjust a previously trained model.old_data (
Optional
[DataFrame
]) – In some situations, the old data is still required to train a model (i.e. Regression mixer) to ensure the new data doesn’t entirely override it.adjust_args (
Optional
[dict
]) – Optional dictionary with parameters to customize the finetuning process.
- Return type
None
- Returns
Adjusts best-fit model in-place, doesn’t return anything.
- analyze_data(data)[source]¶
Performs a statistical analysis on the data to identify distributions, imbalanced classes, and other nuances within the data.
- Parameters
data (
DataFrame
) – Data used in training the model(s).- Return type
None
- analyze_ensemble(enc_data)[source]¶
Evaluate the quality of mixers within an ensemble of models.
- Parameters
enc_data (
Dict
[str
,DataFrame
]) – Pre-processed and featurized data, split into the relevant train/test splits.- Return type
None
- export(file_path, json_ai_code)[source]¶
Exports both the predictor object and its code to a single binary file for later usage.
- Parameters
file_path (
str
) – Location to store your Predictor Instance.json_ai_code (
str
) – The code generated by the user’s specification.
- Return type
None
- Returns
Saves Predictor instance.
- featurize(split_data)[source]¶
Provides an encoded representation for each dataset in
split_data
. Requires self.encoders to be prepared.- Parameters
split_data (
Dict
[str
,DataFrame
]) – Pre-processed data from the dataset, split into train/test (or any other keys relevant)- Returns
For each dataset provided in
split_data
, the encoded representations of the data.
- fit(enc_data)[source]¶
Fits “mixer” models to train predictors on the featurized data. Instantiates a set of trained mixers and an ensemble of them.
- Parameters
enc_data (
Dict
[str
,DataFrame
]) – Pre-processed and featurized data, split into the relevant train/test splits. Keys expected are “train”, “dev”, and “test”- Return type
None
- learn(data)[source]¶
Trains the attribute model starting from raw data. Raw data is pre-processed and cleaned accordingly. As data is assigned a particular type (ex: numerical, categorical, etc.), the respective feature encoder will convert it into a representation useable for training ML models. Of all ML models requested, these models are compiled and fit on the training data.
This step amalgates
preprocess
->featurize
->fit
with the necessary splitting + analyze_data that occurs.- Parameters
data (
DataFrame
) – (Unprocessed) Data used in training the model(s).- Return type
None
- Returns
Nothing; instantiates with best fit model from ensemble.
- predict(data, args={})[source]¶
Intakes raw data to provide predicted values for your trained model.
- Parameters
data (
DataFrame
) – Data (n_samples, n_columns) that the model(s) will evaluate on and provide the target prediction.args (
Dict
[str
,object
]) – parameters needed to update the predictorPredictionArguments
object, which holds any parameters relevant for prediction.
- Return type
DataFrame
- Returns
A dataframe of predictions of the same length of input.
- prepare(data)[source]¶
Prepares the encoders for each column of data.
- Parameters
data (
Dict
[str
,DataFrame
]) – Pre-processed data that has been split into train/test. Explicitly uses “train” and/or “dev” in preparation of encoders.- Return type
None
- Returns
Nothing; prepares the encoders for learned representations.
- preprocess(data)[source]¶
Cleans the unprocessed dataset provided.
- Parameters
data (
DataFrame
) – (Unprocessed) Data used in training the model(s).- Return type
DataFrame
- Returns
The cleaned data frame
- save(file_path)[source]¶
With a provided file path, saves the Predictor instance for later use. :type file_path:
str
:param file_path: Location to store your Predictor Instance. :rtype:None
:returns: Saves Predictor instance.
- split(data)[source]¶
Categorizes the data into a training/testing split; if data is a classification problem, will stratify the data.
- Parameters
data (
DataFrame
) – Pre-processed data, but generically any dataset to split into train/dev/test.- Return type
Dict
[str
,DataFrame
]- Returns
Dictionary containing training/testing fraction