Data Types (dtypes)

Lightwood supports several data types used in standard machine learning pipelines. The dtype class is used to label columns of information as the right input format. The type inference procedure affects what feature engineering methodology is used on a labeled column.

Currently, the supported way to encourage new data types is to include a custom tag in this file and to import a custom cleaning approach. Users may inherit the basic functionality of the cleaner and include their own flag specific to their data type. For steps on how to do this, please see the tutorials.

class api.dtype.dtype[source]

Definitions of all data types currently supported. Dtypes currently supported include:

  • Numerical: Data that should be represented in the form of a number. Currently integer, float, and quantity are supported.

  • Categorical: Data that represents a class or label and is discrete. Currently binary, categorical, and tags are supported.

  • Date/Time: Time-series data that is temporal/sequential. Currently date, and datetime are supported.

  • Text: Data that can be considered as language information. Currently short_text, and rich_text are supported. Short text has a small vocabulary (~ 100 words) and is generally a limited number of characters. Rich text is anything with greater complexity.

  • Complex: Data types that require custom techniques. Currently audio, video and image are available, but highly experimental.

  • Array: Data in the form of a sequence where order must be preserved. tsarray dtypes are for “normal” columns that will be transformed to arrays at a row-level because they will be treated as time series.

  • Miscellaneous: Miscellaneous data descriptors include empty, an explicitly unknown value versus invalid, a data type not currently supported.

Custom data types may be implemented here as a flag for subsequent treatment and processing. You are welcome to include your own definitions, so long as they do not override the existing type names (alternatively, if you do, please edit subsequent parts of the preprocessing pipeline to correctly indicate how you want to deal with these data types).