parquet_dataset
Parquet datasets, to be used with raw data under the ".parquet" format.
Classes:
| Name | Description |
|---|---|
BaseParquetDataset |
Base Parquet Dataset class. |
ParquetDataset |
Parquet Dataset class, to be analyzed via the AutoAnalyzer. |
AnalyzedParquetDataset |
Analyzed Parquet Dataset class to be created from an existing analyzed schema. |
FittedParquetDataset |
Fitted Parquet Dataset class to be created from an existing fitted schema. |
Attributes:
| Name | Type | Description |
|---|---|---|
logger |
|
logger = logging.getLogger(__name__)
#
BaseParquetDataset
#
Base Parquet Dataset class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str
|
Dataset name. |
required |
|
str
|
The dataset parquet file path, which must be |
required |
|
dict[str, Any]
|
Optional storage options to stream data from a cloud storage instance. |
<class 'dict'>
|
|
int | None
|
Size of the parquet dataset, if None it will be automatically inferred before database insertion which may take time on large files. |
None
|
Methods:
| Name | Description |
|---|---|
get_num_rows |
Get the number of rows in the parquet file. |
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
|
path |
str
|
|
storage_options |
dict[str, object]
|
|
num_rows |
int | None
|
|
name: str
#
path: str
#
storage_options: dict[str, object] = field(factory=dict)
#
num_rows: int | None = None
#
get_num_rows() -> int
#
Get the number of rows in the parquet file.
Source code in src/xpdeep/dataset/parquet_dataset.py
ParquetDataset
#
Parquet Dataset class, to be analyzed via the AutoAnalyzer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str
|
|
required |
|
str
|
|
required |
|
dict[str, object]
|
dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object's (key, value) pairs dict(iterable) -> new dictionary initialized as if via: d = {} for k, v in iterable: d[k] = v dict(**kwargs) -> new dictionary initialized with the name=value pairs in the keyword argument list. For example: dict(one=1, two=2) |
<class 'dict'>
|
|
int | None
|
|
None
|
Methods:
| Name | Description |
|---|---|
analyze |
Analyze the dataset and create an Analyzed Schema. |
analyze(*forced_type: ExplainableFeature | IndexMetadata, target_names: list[str] | None = None) -> AnalyzedParquetDataset
#
Analyze the dataset and create an Analyzed Schema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Feature
|
Features objects to force custom feature type for specific column names in the Arrow Table. |
()
|
|
list[str] | None
|
Optional list of column names indicating which columns should be considered targets. Default None. |
None
|
Returns:
| Type | Description |
|---|---|
AnalyzedParquetDataset
|
The analyzed dataset, a parquet dataset with an analyzed schema attached. |
Source code in src/xpdeep/dataset/parquet_dataset.py
AnalyzedParquetDataset
#
Analyzed Parquet Dataset class to be created from an existing analyzed schema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str
|
|
required |
|
str
|
|
required |
|
dict[str, object]
|
dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object's (key, value) pairs dict(iterable) -> new dictionary initialized as if via: d = {} for k, v in iterable: d[k] = v dict(**kwargs) -> new dictionary initialized with the name=value pairs in the keyword argument list. For example: dict(one=1, two=2) |
<class 'dict'>
|
|
int | None
|
|
None
|
|
AnalyzedSchema
|
|
required |
Methods:
| Name | Description |
|---|---|
fit |
Create a Fitted Parquet Dataset object. |
Attributes:
| Name | Type | Description |
|---|---|---|
analyzed_schema |
AnalyzedSchema
|
|
analyzed_schema: AnalyzedSchema = field(kw_only=True)
#
fit() -> FittedParquetDataset
#
Create a Fitted Parquet Dataset object.
Source code in src/xpdeep/dataset/parquet_dataset.py
FittedParquetDataset
#
Fitted Parquet Dataset class to be created from an existing fitted schema.
A fitted parquet dataset can be saved remotely using save method or by using it as a parameter for
computing an explanation. Once it had been saved remotely, updates are not allowed anymore on it.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str
|
|
required |
|
str
|
|
required |
|
dict[str, object]
|
dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object's (key, value) pairs dict(iterable) -> new dictionary initialized as if via: d = {} for k, v in iterable: d[k] = v dict(**kwargs) -> new dictionary initialized with the name=value pairs in the keyword argument list. For example: dict(one=1, two=2) |
<class 'dict'>
|
|
int | None
|
|
None
|
|
FittedSchema
|
|
required |
|
int | None
|
|
None
|
Methods:
| Name | Description |
|---|---|
__setattr__ |
Set attribute. |
stable_hash |
Build a hash of the fitted parquet dataset. |
get_last_parquet_modification_datetime_utc |
Get the datetime of the last modification of a file or object. Assume timezone UTC. |
to_model |
Convert to ParquetDatasetArtifactInsert instance. |
save |
Save the Fitted Parquet Dataset remotely. |
load_all |
List all datasets of the current project. |
get_by_id |
Get fitted dataset by its ID. |
get_by_name |
Get fitted dataset by its name. |
delete |
Delete the current object remotely. |
load_computed_statistics |
Load all computed statistics on this dataset. |
Attributes:
| Name | Type | Description |
|---|---|---|
fitted_schema |
FittedSchema
|
|
size |
int | None
|
|
id |
str
|
Get artifact id if the object exists remotely. |
fitted_schema: FittedSchema = field(kw_only=True)
#
size: int | None = None
#
id: str
#
Get artifact id if the object exists remotely.
__setattr__(attr: str, value: object) -> None
#
Set attribute.
Source code in src/xpdeep/dataset/parquet_dataset.py
stable_hash() -> str
#
Build a hash of the fitted parquet dataset.
Source code in src/xpdeep/dataset/parquet_dataset.py
get_last_parquet_modification_datetime_utc() -> datetime | None
#
Get the datetime of the last modification of a file or object. Assume timezone UTC.
Returns:
| Type | Description |
|---|---|
datetime | None
|
Datetime of the last modification in UTC or None if not found. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the file or object does not exist. |
Source code in src/xpdeep/dataset/parquet_dataset.py
to_model() -> ParquetDatasetArtifactInsert
#
Convert to ParquetDatasetArtifactInsert instance.
Source code in src/xpdeep/dataset/parquet_dataset.py
save(*, force: bool = False) -> FittedParquetDataset
#
Save the Fitted Parquet Dataset remotely.
Source code in src/xpdeep/dataset/parquet_dataset.py
load_all() -> list[FittedParquetDataset]
#
List all datasets of the current project.
get_by_id(dataset_id: str) -> FittedParquetDataset
#
Get fitted dataset by its ID.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str
|
The ID of the dataset to retrieve. |
required |
Source code in src/xpdeep/dataset/parquet_dataset.py
get_by_name(dataset_name: str) -> FittedParquetDataset
#
Get fitted dataset by its name.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str
|
The name of the dataset to retrieve. |
required |
Source code in src/xpdeep/dataset/parquet_dataset.py
delete() -> None
#
Delete the current object remotely.
Source code in src/xpdeep/dataset/parquet_dataset.py
load_computed_statistics() -> list[ExplanationStatisticSelect]
#
Load all computed statistics on this dataset.