parquet_dataset
Parquet datasets, to be used with raw data under the ".parquet" format.
Classes:
| Name | Description |
|---|---|
BaseParquetDataset |
Base Parquet Dataset class. |
ParquetDataset |
Parquet Dataset class, to be analyzed via the AutoAnalyzer. |
AnalyzedParquetDataset |
Analyzed Parquet Dataset class to be created from an existing analyzed schema. |
FittedParquetDataset |
Fitted Parquet Dataset class to be created from an existing fitted schema. |
Attributes:
| Name | Type | Description |
|---|---|---|
logger |
|
logger = logging.getLogger(__name__)
#
BaseParquetDataset
#
Base Parquet Dataset class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str
|
Dataset name. |
required |
|
str
|
The dataset parquet file path, which must be |
required |
|
dict[str, Any]
|
Optional storage options to stream data from a cloud storage instance. |
<class 'dict'>
|
|
int | None
|
Size of the parquet dataset, if None it will be automatically inferred before database insertion which may take time on large files. |
None
|
Methods:
| Name | Description |
|---|---|
get_num_rows |
Get the number of rows in the parquet file. |
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
|
path |
str
|
|
storage_options |
dict[str, object]
|
|
num_rows |
int | None
|
|
name: str
#
path: str
#
storage_options: dict[str, object] = field(factory=dict)
#
num_rows: int | None = None
#
get_num_rows() -> int
#
Get the number of rows in the parquet file.
Source code in src/xpdeep/dataset/parquet_dataset.py
ParquetDataset
#
Parquet Dataset class, to be analyzed via the AutoAnalyzer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str
|
|
required |
|
str
|
|
required |
|
dict[str, object]
|
dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object's (key, value) pairs dict(iterable) -> new dictionary initialized as if via: d = {} for k, v in iterable: d[k] = v dict(**kwargs) -> new dictionary initialized with the name=value pairs in the keyword argument list. For example: dict(one=1, two=2) |
<class 'dict'>
|
|
int | None
|
|
None
|
Methods:
| Name | Description |
|---|---|
analyze |
Analyze the dataset and create an Analyzed Schema. |
analyze(*forced_type: ExplainableFeature | IndexMetadata, target_names: list[str] | None = None) -> AnalyzedParquetDataset
#
Analyze the dataset and create an Analyzed Schema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Feature
|
Features objects to force custom feature type for specific column names in the Arrow Table. |
()
|
|
list[str] | None
|
Optional list of column names indicating which columns should be considered targets. Default None. |
None
|
Returns:
| Type | Description |
|---|---|
AnalyzedParquetDataset
|
The analyzed dataset, a parquet dataset with an analyzed schema attached. |
Source code in src/xpdeep/dataset/parquet_dataset.py
AnalyzedParquetDataset
#
Analyzed Parquet Dataset class to be created from an existing analyzed schema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str
|
|
required |
|
str
|
|
required |
|
dict[str, object]
|
dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object's (key, value) pairs dict(iterable) -> new dictionary initialized as if via: d = {} for k, v in iterable: d[k] = v dict(**kwargs) -> new dictionary initialized with the name=value pairs in the keyword argument list. For example: dict(one=1, two=2) |
<class 'dict'>
|
|
int | None
|
|
None
|
|
AnalyzedSchema
|
|
required |
Methods:
| Name | Description |
|---|---|
fit |
Create a Fitted Parquet Dataset object. |
Attributes:
| Name | Type | Description |
|---|---|---|
analyzed_schema |
AnalyzedSchema
|
|
analyzed_schema: AnalyzedSchema = field(kw_only=True)
#
fit() -> FittedParquetDataset
#
Create a Fitted Parquet Dataset object.
Source code in src/xpdeep/dataset/parquet_dataset.py
FittedParquetDataset
#
Fitted Parquet Dataset class to be created from an existing fitted schema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str
|
|
required |
|
str
|
|
required |
|
dict[str, object]
|
dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object's (key, value) pairs dict(iterable) -> new dictionary initialized as if via: d = {} for k, v in iterable: d[k] = v dict(**kwargs) -> new dictionary initialized with the name=value pairs in the keyword argument list. For example: dict(one=1, two=2) |
<class 'dict'>
|
|
int | None
|
|
None
|
|
FittedSchema
|
|
required |
|
int | None
|
|
None
|
Methods:
| Name | Description |
|---|---|
to_model |
Convert to ParquetDatasetArtifactInsert instance. |
artifact_id |
Get artifact id if not set yet. |
Attributes:
| Name | Type | Description |
|---|---|---|
fitted_schema |
FittedSchema
|
|
size |
int | None
|
|
fitted_schema: FittedSchema = field(kw_only=True)
#
size: int | None = None
#
to_model() -> ParquetDatasetArtifactInsert
#
Convert to ParquetDatasetArtifactInsert instance.
Source code in src/xpdeep/dataset/parquet_dataset.py
artifact_id() -> str
#
Get artifact id if not set yet.