Helpers
¶
Various helper functions
- class helpers.LightwoodAutocast(enabled=True)[source]¶
Equivalent to torch.cuda.amp.autocast, but checks device compute capability to activate the feature only when the GPU has tensor cores to leverage AMP.
- helpers.add_tn_num_conf_bounds(data, tss_args)[source]¶
Deprecated. Instead we now opt for the much better solution of having scores for each timestep (see all TS classes in analysis/nc)
Add confidence (and bounds if applicable) to t+n predictions, for n>1 TODO: active research question: how to guarantee 1-e coverage for t+n, n>1 For now, (conservatively) increases width by the confidence times the log of the time step (and a scaling factor).
- helpers.analyze_sentences(data)[source]¶
- Parameters
data – list of str
- Returns
- tuple(
int: nr words total, dict: word_dist, dict: nr_words_dist
)
- helpers.bounded_ts_accuracy(true_values, predictions, **kwargs)[source]¶
The normal MASE accuracy inside
evaluate_array_accuracy
has a break point of 1.0: smaller values mean a naive forecast is better, and bigger values imply the forecast is better than a naive one. It is upper-bounded by 1e4.This 0-1 bounded MASE variant scores the 1.0 breakpoint to be equal to 0.5. For worse-than-naive, it scales linearly (with a factor). For better-than-naive, we fix 10 as 0.99, and scaled-logarithms (with 10 and 1e4 cutoffs as respective bases) are used to squash all remaining preimages to values between 0.5 and 1.0.
- Return type
float
- helpers.cast_string_to_python_type(string)[source]¶
Returns None, an integer, float or a string from a string
- helpers.evaluate_accuracy(data, predictions, target, accuracy_functions, ts_analysis={}, n_decimals=3)[source]¶
Dispatcher for accuracy evaluation.
- Parameters
data (
DataFrame
) – original dataframe.predictions (
Series
) – output of a lightwood predictor for the input data.target (
str
) – target column name.accuracy_functions (
List
[str
]) – list of accuracy function names. Support currently exists for scikit-learn’s metrics module, plus any custom methods that Lightwood exposes.ts_analysis (
Optional
[dict
]) – lightwood.data.timeseries_analyzer output, used to compute time series task accuracy.n_decimals (
Optional
[int
]) – used to round accuracies.
- Return type
Dict
[str
,float
]- Returns
accuracy metric for a dataset and predictions.
- helpers.evaluate_array_accuracy(true_values, predictions, **kwargs)[source]¶
Default time series forecasting accuracy method. Returns mean score over all timesteps in the forecasting horizon, as determined by the base_acc_fn (R2 score by default).
- Return type
float
- helpers.evaluate_cat_array_accuracy(true_values, predictions, **kwargs)[source]¶
Evaluate accuracy in categorical time series forecasting tasks.
Balanced accuracy is computed for each timestep (as determined by timeseries_settings.horizon), and the final accuracy is the reciprocal of the average score through all timesteps.
- Return type
float
- helpers.evaluate_multilabel_accuracy(true_values, predictions, **kwargs)[source]¶
Evaluates accuracy for multilabel/tag prediction.
- Returns
weighted f1 score of predictions and ground truths.
- helpers.evaluate_num_array_accuracy(true_values, predictions, **kwargs)[source]¶
Evaluate accuracy in numerical time series forecasting tasks. Defaults to mean absolute scaled error (MASE) if in-sample residuals are available. If this is not the case, R2 score is computed instead.
Scores are computed for each timestep (as determined by timeseries_settings.horizon), and the final accuracy is the reciprocal of the average score through all timesteps.
- Return type
float
- helpers.evaluate_regression_accuracy(true_values, predictions, **kwargs)[source]¶
Evaluates accuracy for regression tasks. If predictions have a lower and upper bound, then within-bound accuracy is computed: whether the ground truth value falls within the predicted region. If not, then a (positive bounded) R2 score is returned instead.
- Returns
accuracy score as defined above.
- helpers.gen_chars(length, character)[source]¶
# lambda to Generates a string consisting of length consiting of repeating character :param length: :param character: :return:
- helpers.get_group_matches(data, combination, group_columns)[source]¶
Given a particular group combination, return the data subset that belongs to it.
- Return type
Tuple
[list
,DataFrame
]
- helpers.is_nan_numeric(value)[source]¶
Determines if value might be nan or inf or some other numeric value (i.e. which can be cast as float) that is not actually a number.
- Return type
bool
- helpers.is_none(value)[source]¶
We use pandas :( Pandas has no way to guarantee “stability” for the type of a column, it choses to arbitrarily change it based on the values. Pandas also change the values in the columns based on the types. Lightwood relies on having
None
values for a cells that represent “missing” or “corrupt”.When we assign
None
to a cell in a dataframe this might get turned to nan or other values, this function checks if a cell isNone
or any other values a pd.DataFrame might convertNone
to.It also check some extra values (like
''
) that pandas never convertsNone
to (hopefully). But lightwood would still consider those values “None values”, and this will allow for more generic use later.