Xr dataset
XrDataset#
The XrDataset class within the datasets.xr_dataset module is specifically tailored to sample smaller datasets that are manageable in memory.
It is created for scenarios where direct, rapid manipulation of data in memory is feasible and preferable over the more complex, disk-based operations typical of very large datasets.
This capability makes it ideal for machine learning applications where dataset size allows for quick iterations and immediate feedback on preprocessing and modeling efforts.
This class aims provides tools to selectively process data chunks based on specified criteria and apply data cleaning operations such as NaN value handling and optional filtering based on specific variables.
It efficiently concatenates processed chunks into a single dataset for further analysis or model training.
Constructor#
def __init__(self, ds: xr.Dataset, chunk_indices: List[int] = None, rand_chunk: bool = True, drop_nan: str = 'auto',
apply_mask: bool = True, drop_sample: bool = False, fill_method: str = None, const: float = None,
filter_var: str = 'filter_mask', patience: int = 500, block_size: List[Tuple[str, int]] = None,
sample_size: List[Tuple[str, int]] = None, overlap: List[Tuple[str, float]] = None, callback: Callable = None,
num_chunks: int = None, to_pred: Union[str, List[str]] = None, scale_fn: str = 'standardize'):
Parameters#
- ds (
xarray.Dataset): The input dataset to extract the training samples from. - chunk_indices (
List[int]): List of indices specifying which chunks to process. - rand_chunk (
bool): If True, chunks are selected randomly; otherwise, they are processed sequentially. - drop_nan (
str): If notNone, specifies the method for handling areas with missing values:- auto: Drops the entire sample if any part contains a
NaN. - if_all_nan (
List[str]): Drops the sample only if it is entirely composed ofNaNs. - masked (
List[str]): Drops valid regions (as defined byfilter_var) containing anyNaNs. Requiresfilter_varto be defined.
- auto: Drops the entire sample if any part contains a
- apply_mask (
bool): Whether to apply a filtering condition to the data chunks. - drop_sample (
bool): If True, drops any samples that doesn't contain relevant values according to thefilter_varcriteria completely. - fill_method (
str): Method used to fill masked or NaN data.- None:
NaNs are not filled. - mean:
NaNs are filled with the mean value of the non-NaNvalues - sample_mean:
NaNs are filled with the sample mean value. - noise:
NaNare filled with random noise within the range of the non-NaNvalues. - constant:
NaNs are filled with the specified constant value (const).
- None:
- const (
float): Constant value used when fill_method is 'constant'. - filter_var (
str): Filter mask name used for filtering data chunks. - patience (
int): The number of consecutive iterations without a valid chunk before stopping. - block_size (
List[Tuple[str, int]]): Size of the chunks to process, which can define memory usage and performance. IfNonethe chunk size of ds is utilized. - sample_size (
List[Tuple[str, int]]): List of tuples specifying the sizes of the resulting dataset's samples - overlap (
List[Tuple[str, float]]): List of tuples specifying the overlap proportion of the resulting dataset's samples - callback (
Callable): Optional function applied to each chunk after initial processing. - num_chunk (
int): Specifies the number of chunks to process if not processing all. - to_pred (
Union[str, List[str]]): Variable or list of variables to construct the dependent variable or sample to predict. - scale_fn (
str): Feature scaling function to apply (standardize,normalize, orNone).
get_datasets#
Retrieves the fully processed dataset, ready for use in applications or further analysis. This method ensures that all preprocessing steps are applied and the data is returned in a manageable format.
def get_datasets(self) -> Union[Dict[str, np.ndarray], Tuple[Tuple[np.ndarray, np.ndarray], Tuple[np.ndarray, np.ndarray]]]
Parameters#
None: This method has no input parameters.
Returns#
Union[Dict[str, np.ndarray], Tuple[Tuple[np.ndarray, np.ndarray], Tuple[np.ndarray, np.ndarray]]]: Returns the processed dataset.- If
to_predisNone: Returns a dictionary where keys are variable names and values are concatenated numpy arrays representing the dataset. - If
to_predis provided: Returns a tuple containing: (X_train, y_train): Training features and targets. (X_test, y_test): Testing features and targets.
Example#
import numpy as np
import xarray as xr
from ml4xcube.splits import assign_block_split
from ml4xcube.datasets.xr_dataset import XrDataset
# Example dataset with chunking
data = np.random.rand(100, 200, 300)
dataset = xr.Dataset({
'temperature': (('time', 'lat', 'lon'), data),
'precipitation': (('time', 'lat', 'lon'), data)
}).chunk({'time': 10, 'lat': 50, 'lon': 50})
# Initialize the XrDataset
xr_dataset = XrDataset(
ds = dataset,
rand_chunk = True,
drop_nan = 'if_all_nan',
fill_method = 'mean',
sample_size = [('time', 1),('lat', 10), ('lon', 10)],
overlap = [('time', 0.),('lat', 0.8), ('lon', 0.8)],
to_pred = 'precipitation',
num_chunks = 3
)
# Retrieve the processed dataset
train_data, test_data = xr_dataset.get_datasets()
XrDataset class with a dataset, where chunks are randomly selected.
NaN values are managed by replacing them with the sample mean, and specific sample sizes and overlaps are defined to
create the datasets. precipitation is predicted based on temperature.
The processed train and test data is then retrieved, showcasing the class's capability to efficiently manage and
preprocess smaller datasets suitable for machine learning models and other analytical tasks.
Notes#
XrDatasetis designed in order to obtain data samples fromnum_chunksunique chunks.- If
num_chunksis not provided it will be set automatically determined, usingchunk_indicesif assigned or the number of total chunks in the dataset - until
num_chunkschunks are found containing valid data samples thepatienceparameter is used. It manages the number of attempts to find the next valid chunk before stopping. - Sampling with
XrDatasetprovides the following options: - Use specific chunks if
chunk_indicesare set. - Limit
num_chunksto the total chunks if exceeded. - Default
num_chunksto total chunks if unspecified. - Ensure all chunks in
num_chunkscontain valid, non-NaNdata.