Xpdeep Dataset#

This page describes the concepts behind how Xpdeep handles data formatting and preprocessing.

Description Spaces#

A dataset consists of many samples, where each sample is characterized in two description spaces:

The raw space which is the original sample space, that can be understood by human experts (e.g. words, RGB images, dates).
The preprocessed space which is a numerical space suitable to be used as input/output space for deep learning models (e.g. tokens, one hot vectors).

Warning

The relationship between the raw space and the preprocessed space is primordial in the explainability process.

The raw space and the preprocessed space are conveniently divided into two subspaces:

The target space.
The input space.

The preprocessed input space and the preprocessed target space are limited to the space of N-dimensional tensors, since Xpdeep is currently pytorch compatible and uses torch.Tensor under the hood.

Future Release

Multimodal preprocessed inputs allowing multiple tensors to be used as preprocessed input.
Multimodal preprocessed targets allowing multiple tensors to be used as preprocessed target.

Schema#

The key concept of the Xpdeep Dataset is the Schema object: it defines the structure of the dataset as a list of Xpdeep columns where each column is either an ExplainableFeature or a Metadata.

The Schema is used by Xpdeep to navigate from the raw to the preprocessed space and reciprocally.

Future Release

An optional augmentation_function will be added to the schema to perform data augmentation.

Explainable Features#

In order for the Xpdeep explanations to generalize to a large variety of datasets, a dataset sample is divided into a set of explainable features. A feature corresponds to a dataset column that is needed as an input or target by the model.

For each explainable feature, a bijective mapping is defined between its value in the raw space and in the preprocessed space. The preprocessor object handles this mapping.

Warning

An explainable feature must always be associated with a preprocessor.

transform is the transformation from the raw space to the preprocessed space, and inverse_transform is the transformation from the preprocessed space to the raw space.

Specifically, the Xpdeep feature object contains:

a name, which is the key indexing the feature in the schema defined below.
a boolean to know whether it represents a target or an input feature with the is_target attribute.
a preprocessor object, which is responsible for mapping the feature from the raw space to the preprocessed space and vice versa.

The schema combines the per-feature preprocessors to define two bijective mappings from the raw input space to the processed input space, and from the raw target space to the processed target space.

Metadata#

Metadata is similar to features as both characterize a column in a dataset. However, a metadata is not associated with a preprocessor and cannot be explained.

Future Release

Currently, only preprocessed input and targets are available at training time. In the future, it will be possible to specify preprocessed metadata which are metadata associated with an identity preprocessor that can be used at training time (e.g. attention masks, padding masks) and for which some explanations will be available.

Index Xpdeep#

Each dataset must contain an index_xp_deep column that is a metadata that uniquely identifies a data sample, like a row index.

Future Release

It will be possible to use an existing index column as "index_xp_deep".

Dataset types#

Depending on the raw data available, two options exist to create your Dataset with Xpdeep.

Note

Each dataset represents a portion of the data, which may be a standard train, validation or test split.

StandardDataset#

If the raw data is in a supported format, it is possible to use a StandardDataset. Xpdeep is currently compatible with the parquet format, under pyarrow.Table. Parquet is a columnar storage format optimized for efficiently storing and processing large amounts of structured data. Future releases will include other standard formats (".csv", "directories", ".txt", ".json" etc.). See the API reference to see them all.

Future Release

CsvDataset and ImageDirectoryDataset will be available.

CustomDataset#

Future Release

If the raw data format is not in a supported format, it is possible to use a CustomDataset.

CustomDataset are user-defined datasets that are preprocessed client-side on the customer's own machines. Data is returned to the Xpdeep server in batches, with each batch containing raw and preprocessed data.

This way, there is no restriction on the raw data and it can be adapted to any specific need. The client side code is responsible for returning data in a valid format (batch and schema).

Preprocessing#

Preprocessing is performed on the fly for each data batch with Xpdeep trusted preprocessor objects.

Future Release

Preprocessing performed in advance for the entire dataset, prior to training.

Warning

Even with in advance preprocessing, Xpdeep still needs to call the preprocessor's inverse transform method to generate its explanations.

Therefore:

Using StandardDataset, data is processed on the Xpdeep server only with Xpdeep's available preprocessing methods.
With CustomDataset, data is processed client-side and with your custom processing methods:
1. Data is sent to the xpdeep-server by batches with both raw AND preprocessed data.
2. When the server requests an inverse transform call, the request is returned to the client, who then returns the correct data.