Skip to content

from xpdeep.dataset.schema import FittedSchema

Create a Schema#

To use your dataset with the Xpdeep framework, you need a schema object defining the dataset structure.

The Schema exists with several levels.

As it can be tedious to find the correct schema, Xpdeep provides an AutoAnalyzer to help you get a first schema version. You can later update it if some feature analysis seem incorrect.

Note

The dataset during the training process corresponds to the train dataset.

1. Find a Schema#

The first step consist in finding each feature type and their associated preprocessor. You can find a list of available Feature in the API reference.

Warning

For security issue, we do not allow arbitrary code to be executed in the framework yet. Therefore, with StandardDataset, your preprocessing must come from a list of trusted preprocessors. Xpdeep currently support scikit-learn and pytorch preprocessing to be used to build your preprocessor.

With the Auto Analyzer#

With a dataset object, you can get a first schema proposal using analyze method of ParquetDataset object instance.

Set the Target(s)#

The only requirement is to indicate which feature(s) should be considered as target(s).

Please use target_names parameter to specify which features should be considered as targets prior to the analysis.

analyzed_train_dataset = train_dataset.analyze(target_names=["flower_type"])
👀 Full file preview
"""Create a schema."""

import pyarrow as pa
import pyarrow.parquet as pq
from sklearn.preprocessing import StandardScaler

import xpdeep
from xpdeep import Project
from xpdeep.dataset.feature import ExplainableFeature
from xpdeep.dataset.feature.feature_types import NumericalFeature
from xpdeep.dataset.parquet_dataset import ParquetDataset
from xpdeep.dataset.preprocessor.preprocessor import SklearnPreprocessor

demo = {"api_key": "your_api_key", "api_url": "your_api_url"}
xpdeep.init(**demo)

xpdeep.set_project(Project.create_or_get(name="toy dataset example", description="tutorial"))

# Create a Dataset

S3_DATASET_ENDPOINT_URL = "S3_DATASET_ENDPOINT_URL"
S3_DATASET_ACCESS_KEY_ID = "S3_DATASET_ACCESS_KEY_ID"
S3_DATASET_SECRET_ACCESS_KEY = "S3_DATASET_SECRET_ACCESS_KEY"
S3_DATASET_BUCKET_NAME = "S3_DATASET_BUCKET_NAME"

raw_data = pa.table({
    "petal_length": [1.4, 1.5, 1.3, 4.5, 4.1, 5.0, 6.0, 5.5],
    "petal_width": [0.2, 0.2, 0.2, 1.5, 1.3, 1.8, 2.5, 2.3],
    "flower_type": ["Setosa", "Setosa", "Setosa", "Versicolor", "Versicolor", "Versicolor", "Virginica", "Virginica"],
})

# Write the table to a Parquet file
pq.write_table(raw_data, "train.parquet")


path = f"s3://{S3_DATASET_BUCKET_NAME}/train.parquet"
storage_options = {
    "key": S3_DATASET_ACCESS_KEY_ID,
    "secret": S3_DATASET_SECRET_ACCESS_KEY,
    "client_kwargs": {
        "endpoint_url": S3_DATASET_ENDPOINT_URL,
    },
    "s3_additional_kwargs": {"addressing_style": "path"},
}


train_dataset = ParquetDataset(name="dataset_name", path=path, storage_options=storage_options)

# Create a Schema

forced_feature = ExplainableFeature(
    name="petal_length",
    is_target=False,
    preprocessor=SklearnPreprocessor(preprocess_function=StandardScaler()),
    feature_type=NumericalFeature(),
)

analyzed_train_dataset = train_dataset.analyze(target_names=["flower_type"])
fitted_train_dataset = analyzed_train_dataset.fit()
print(fitted_train_dataset.fitted_schema)

You can also set the target name directly on the analyzed schema, after the analysis.

analyzed_train_dataset = train_dataset.analyze()
analyzed_train_dataset.analyzed_schema["flower_type"].is_target = True

Set the Features#

In addition, you can force a feature type by calling the analyze method with specific features. In the following example, the feature with name "petal_length" will be a NumericalFeature.

from xpdeep.dataset.feature.feature_types import NumericalFeature
from xpdeep.dataset.preprocessor.preprocessor import SklearnPreprocessor
from sklearn.preprocessing import StandardScaler
from xpdeep.dataset.feature import ExplainableFeature

forced_feature = ExplainableFeature(
    name="petal_length",
    is_target=False,
    preprocessor=SklearnPreprocessor(preprocess_function=StandardScaler()),
    feature_type=NumericalFeature()
)

analyzed_train_dataset = train_dataset.analyze(forced_feature)

As the returned schema is only a proposal, you can edit it later if it doesn't correctly match your needs. Any feature can be overwritten or updated.

from xpdeep.dataset.feature import ExplainableFeature
from xpdeep.dataset.feature.feature_types import NumericalFeature
from xpdeep.dataset.preprocessor.preprocessor import SklearnPreprocessor
from sklearn.preprocessing import StandardScaler

# Set feature type name after the schema inference
analyzed_train_dataset = train_dataset.analyze()
analyzed_train_dataset.analyzed_schema["petal_length"] = ExplainableFeature(
    name="petal_length",
    is_target=False,
    preprocessor=SklearnPreprocessor(
        preprocess_function=StandardScaler(),
    ),
    feature_type=NumericalFeature()
)

This editable schema can be updated to match other wanted feature and preprocessing type.

You can remove a feature from the schema if needed, using its name:

analyzed_train_dataset.analyzed_schema.remove_feature(feature_name="petal_length")

Or from Scratch#

You can also create your own analyzed schema from scratch without using the auto-analyze.

from xpdeep.dataset.feature import ExplainableFeature
from xpdeep.dataset.feature.feature_types import NumericalFeature
from xpdeep.dataset.preprocessor.preprocessor import SklearnPreprocessor
from sklearn.preprocessing import StandardScaler
from xpdeep.dataset.schema import AnalyzedSchema

feature_1 = ExplainableFeature(
    name="petal_length",
    is_target=False,
    preprocessor=SklearnPreprocessor(
        preprocess_function=StandardScaler(),
    ),
    feature_type=NumericalFeature()
)

feature_2 = ExplainableFeature(
    name="petal_width",
    is_target=False,
    preprocessor=SklearnPreprocessor(
        preprocess_function=StandardScaler(),
    ),
    feature_type=NumericalFeature()
)

analyzed_schema = AnalyzedSchema(feature_1, feature_2)

Finally, use the analyzed schema to build the AnalyzedParquetDataset.

from xpdeep.dataset.parquet_dataset import AnalyzedParquetDataset

analyzed_train_dataset = AnalyzedParquetDataset(
    name="train_dataset",
    path=directory["train_set_path"],
    analyzed_schema=analyzed_schema
)

You can remove a feature from the schema if needed, using its name:

analyzed_train_dataset.analyzed_schema.remove_feature(feature_name="petal_length")

2. Fit the Schema#

Once satisfied with the feature types and their preprocessor, the next step is to fit the dataset schema. Each preprocessor is indeed responsible for the raw <-> preprocessed space mapping and must be fit to allow the association.

From the AnalyzedParquetDataset#

The schema object can be used to automatically fit each feature preprocessor.

fitted_train_dataset = analyzed_train_dataset.fit()
print(fitted_train_dataset.fitted_schema)

+-----------------------------------------------+
|                Schema Contents                |
+--------------------+--------------+-----------+
| Type               | Name         | Is Target |
+--------------------+--------------+-----------+
| NumericalFeature   | petal_length |         |
| NumericalFeature   | petal_width  |         |
| CategoricalFeature | flower_type  |         |
+--------------------+--------------+-----------+
👀 Full file preview
"""Create a schema."""

import pyarrow as pa
import pyarrow.parquet as pq
from sklearn.preprocessing import StandardScaler

import xpdeep
from xpdeep import Project
from xpdeep.dataset.feature import ExplainableFeature
from xpdeep.dataset.feature.feature_types import NumericalFeature
from xpdeep.dataset.parquet_dataset import ParquetDataset
from xpdeep.dataset.preprocessor.preprocessor import SklearnPreprocessor

demo = {"api_key": "your_api_key", "api_url": "your_api_url"}
xpdeep.init(**demo)

xpdeep.set_project(Project.create_or_get(name="toy dataset example", description="tutorial"))

# Create a Dataset

S3_DATASET_ENDPOINT_URL = "S3_DATASET_ENDPOINT_URL"
S3_DATASET_ACCESS_KEY_ID = "S3_DATASET_ACCESS_KEY_ID"
S3_DATASET_SECRET_ACCESS_KEY = "S3_DATASET_SECRET_ACCESS_KEY"
S3_DATASET_BUCKET_NAME = "S3_DATASET_BUCKET_NAME"

raw_data = pa.table({
    "petal_length": [1.4, 1.5, 1.3, 4.5, 4.1, 5.0, 6.0, 5.5],
    "petal_width": [0.2, 0.2, 0.2, 1.5, 1.3, 1.8, 2.5, 2.3],
    "flower_type": ["Setosa", "Setosa", "Setosa", "Versicolor", "Versicolor", "Versicolor", "Virginica", "Virginica"],
})

# Write the table to a Parquet file
pq.write_table(raw_data, "train.parquet")


path = f"s3://{S3_DATASET_BUCKET_NAME}/train.parquet"
storage_options = {
    "key": S3_DATASET_ACCESS_KEY_ID,
    "secret": S3_DATASET_SECRET_ACCESS_KEY,
    "client_kwargs": {
        "endpoint_url": S3_DATASET_ENDPOINT_URL,
    },
    "s3_additional_kwargs": {"addressing_style": "path"},
}


train_dataset = ParquetDataset(name="dataset_name", path=path, storage_options=storage_options)

# Create a Schema

forced_feature = ExplainableFeature(
    name="petal_length",
    is_target=False,
    preprocessor=SklearnPreprocessor(preprocess_function=StandardScaler()),
    feature_type=NumericalFeature(),
)

analyzed_train_dataset = train_dataset.analyze(target_names=["flower_type"])
fitted_train_dataset = analyzed_train_dataset.fit()
print(fitted_train_dataset.fitted_schema)

Or from Scratch#

It is also possible to directly build a FittedParquetDataset from an existing FittedSchema using the default constructor. This can be useful to instantiate FittedParquetDataset for a test set from another dataset schema.

from xpdeep.dataset.parquet_dataset import FittedParquetDataset

fitted_validation_dataset = FittedParquetDataset(
    name="validation_dataset",
    path=directory["test_set_path"],
    fitted_schema=fitted_train_dataset.fitted_schema
)
👀 Full file preview
"""Create a schema."""

import pyarrow as pa
import pyarrow.parquet as pq
from sklearn.preprocessing import StandardScaler

import xpdeep
from xpdeep import Project
from xpdeep.dataset.feature import ExplainableFeature
from xpdeep.dataset.feature.feature_types import NumericalFeature
from xpdeep.dataset.parquet_dataset import ParquetDataset
from xpdeep.dataset.preprocessor.preprocessor import SklearnPreprocessor

demo = {"api_key": "your_api_key", "api_url": "your_api_url"}
xpdeep.init(**demo)

xpdeep.set_project(Project.create_or_get(name="toy dataset example", description="tutorial"))

# Create a Dataset

S3_DATASET_ENDPOINT_URL = "S3_DATASET_ENDPOINT_URL"
S3_DATASET_ACCESS_KEY_ID = "S3_DATASET_ACCESS_KEY_ID"
S3_DATASET_SECRET_ACCESS_KEY = "S3_DATASET_SECRET_ACCESS_KEY"
S3_DATASET_BUCKET_NAME = "S3_DATASET_BUCKET_NAME"

raw_data = pa.table({
    "petal_length": [1.4, 1.5, 1.3, 4.5, 4.1, 5.0, 6.0, 5.5],
    "petal_width": [0.2, 0.2, 0.2, 1.5, 1.3, 1.8, 2.5, 2.3],
    "flower_type": ["Setosa", "Setosa", "Setosa", "Versicolor", "Versicolor", "Versicolor", "Virginica", "Virginica"],
})

# Write the table to a Parquet file
pq.write_table(raw_data, "train.parquet")


path = f"s3://{S3_DATASET_BUCKET_NAME}/train.parquet"
storage_options = {
    "key": S3_DATASET_ACCESS_KEY_ID,
    "secret": S3_DATASET_SECRET_ACCESS_KEY,
    "client_kwargs": {
        "endpoint_url": S3_DATASET_ENDPOINT_URL,
    },
    "s3_additional_kwargs": {"addressing_style": "path"},
}


train_dataset = ParquetDataset(name="dataset_name", path=path, storage_options=storage_options)

# Create a Schema

forced_feature = ExplainableFeature(
    name="petal_length",
    is_target=False,
    preprocessor=SklearnPreprocessor(preprocess_function=StandardScaler()),
    feature_type=NumericalFeature(),
)

analyzed_train_dataset = train_dataset.analyze(target_names=["flower_type"])
fitted_train_dataset = analyzed_train_dataset.fit()
print(fitted_train_dataset.fitted_schema)

Once in possession of a suitable fitted schema associated to your FittedParquetDataset, the next step is to build your explainable model.

3. Data augmentation#

You can also add data augmentation functions for image features using ImageFeatureAugmentation. This function will be used to generate new images from raw data and/or preprocessed data. Currently, we only support augmentation on image features using torchvision. The augmentation must be defined on the FittedSchema and not on the AnalyzedSchema, otherwise it won't be considered.

from torchvision.transforms import Compose, RandomRotation
from xpdeep.dataset.feature.augmentation import ImageFeatureAugmentation
from xpdeep.dataset.feature import ExplainableFeature
from xpdeep.dataset.feature.feature_types import ImageFeature

augmentation = Compose([RandomRotation(90)])
image_rotation_augmentation = ImageFeatureAugmentation(augment_preprocessed=augmentation)

image_feature = ExplainableFeature(
    preprocessor=ScaleImage(input_size=(28, 28)),
    name="image",
    feature_type=ImageFeature()
)

fitted_schema = FittedSchema(image_feature)

fitted_schema["image"].augmentation = image_rotation_augmentation

Future Release

In the future, xpdeep will support augmentation on other feature types.

Warning

The ImageFeature expects the channel-last format, i.e. batch_size x H x W x num_channels.

You may need to use Compose([Permute([0, 3, 1, 2]), YourTransformation(), Permute([0, 2, 3, 1])]) if your augmentation requires the channel first, as it is usually the case for torchvision.transforms objects. You don't need to add a transform to convert to torch tensor first, as it is automatically handled by xpdeep.