Tf large scale xr dataset
LargeScaleXrDataset#
TFXrDataset is specifically designed to integrate xarray.Datasets with TensorFlow machine learning workflows.
It efficiently processes and streams data chunks for training or inference, making it particularly well-suited for h
andling extensive datasets that exceed memory capacities. This class enables dynamic data preprocessing and chunk-based
iteration, facilitating performance optimization and seamless integration into TensorFlow pipelines.
Constructor#
def __init__(self, ds: xr.Dataset, rand_chunk: bool = True, drop_nan: str = 'auto', chunk_indices: list = None,
apply_mask: bool = True, drop_sample: bool = False, fill_method: str = None, const: float = None,
filter_var: str = 'filter_mask', num_chunks: int = None, callback_fn = None,
block_size: List[Tuple[str, int]] = None, sample_size: List[Tuple[str, int]] = None,
overlap: List[Tuple[str, float]] = None, process_chunks: bool = False):
Parameters#
- ds (
xarray.Dataset): The dataset from which data chunks are processed. - rand_chunk (
bool): If True, chunks are selected randomly; otherwise, they are processed sequentially. - drop_nan (
str): If notNone, specifies the method for handling areas with missing values: - auto: Drops the entire sample if any part contains a
NaN. - if_all_nan (
List[str]): Drops the sample only if it is entirely composed ofNaNs. - masked (
List[str]): Drops valid regions (as defined byfilter_var) containing anyNaNs. Requiresfilter_varto be defined. - chunk_indices (
List[int]): Specifies indices of chunks to be processed if not randomly selected. - apply_mask (
bool): Whether to apply a filtering condition defined by thefilter_varto the data chunks. - drop_sample (
bool): If True, drops any samples that doesn't contain relevant values according to thefilter_varcriteria completely. - fill_method (
str): Method used to fill masked or NaN data. - None:
NaNs are not filled. - mean:
NaNs are filled with the mean value of the non-NaNvalues - sample_mean:
NaNs are filled with the sample mean value. - noise:
NaNare filled with random noise within the range of the non-NaNvalues. - constant:
NaNs are filled with the specified constant value (const). - const (
float): Constant value used when fill_method is 'constant'. - filter_var (
str): Filter mask name used for filtering data chunks. - num_chunk (
int): Specifies the number of chunks to process if not processing all. - callback (
int): Optional function applied to each chunk after initial processing. - block_size (
List[Tuple[str, int]]): Size of the chunks to process, which can define memory usage and performance. IfNonethe chunk size of ds is utilized - sample_size (
List[Tuple[str, int]]): List of tuples specifying the dimensions and their respective sizes. - overlap (
List[Tuple[str, float]]): List of tuples specifying the dimensions and their respective overlap proportion. - process_chunks (
bool): Whether to preprocess each chunk before returning.
len#
Returns the number of chunks, providing insights into the volume of data being processed.
def __len__(self) -> int
Parameters#
None: This method has no input parameters.
Returns#
int: number of chunks
get_datasets#
Creates a TensorFlow dataset from the generator.
get_dataset(self) -> tf.data.Dataset
Parameters#
None: This method has no input parameters.
Returns#
tf.data.Dataset:TensorFlowdataset object, yielding chunks of data.
Example#
# Create an instance of LargeScaleXrDataset
dataset_processor = LargeScaleXrDataset(
xr_dataset=my_xarray_dataset,
rand_chunk=True,
drop_nan=True,
use_filter=True,
filter_var='land_mask',
sample_size=[('time', 24)],
overlap=[('time', 1)]
)
# Use the generator to feed data into a TensorFlow model
for data_chunk in dataset_processor.generate():
# Assuming 'model' is an instance of a TensorFlow model
predictions = model.predict(data_chunk['input_data'])
print("Processed predictions:", predictions)
This example demonstrates how to initialize the LargeScaleXrDataset class and use its generate method to continuously supply data to a TensorFlow model.
The class is essential for applications where large datasets are common, and chunks must be loaded to memory and processed successively.
Notes#
- If instances of
LargeScaleXrDatasetare handed over to thetraining.tensorflow.Trainerevery processed chunk must contain valid data - If validity of data samples can not be guaranteed after preprocessing a chunk, prepare an appropriate datasets using the XrDataset or the MultiProcSampler.
- Samples obtained from a chunk serve as a batch of data. If a consistent batch size is required leverage the [XrDataset]xr-dataset.md) or the MultiProcSampler to prepare data accordingly.