Standardize
standardize#
def standardize(ds: Union[xr.Dataset, Dict[str, np.ndarray]], stats_dict: Dict[str, List[float]], filter_var: str = None) -> Union[xr.Dataset, Dict[str, np.ndarray]]
Description#
The standardize function performs standardization fpr all variables within an xarray.Dataset or a dictionary of
NumPy arrays, contained in the stats_dict dictionary. This dictionary provides mean and standard deviation values.
This standardization process adjusts each data point
so that the resulting distribution of each variable has a mean of 0 and a standard deviation of 1. This is crucial for
many statistical analyses and machine learning models to ensure that features have comparable scales without biasing
the model due to the variance in magnitude. Variables specified by filter_var are excluded from standardization, which
is beneficial for non-data variables like masks or indices.
Parameters#
- ds (
Union[xarray.Dataset, Dict[str, numpy.ndarray]]): The dataset to standardize. It can either be anxarray.Datasetor a dictionary where keys are variable names and values are NumPy arrays. - stats_dict (
Dict[str, List[float]]): A dictionary containing the minimum and maximum values (xmin,xmax) for each variable that requires normalization. - filter_var (
str): The name of a variable to exclude from normalization. This is useful for excluding non-data variables like mask or index fields.
Returns#
Union[xarray.Dataset, Dict[str, numpy.ndarray]]: The standardized dataset. The data structure returned depends on the input; it will return anxarray.Datasetif provided with one, otherwise it will return a dictionary.
Example#
import numpy as np
import xarray as xr
from preprocessing import get_statistics, standardize
# Creating an example dataset
ds = xr.Dataset({
'temperature': (('time', 'lat', 'lon'), np.random.rand(10, 20, 30)),
'humidity': (('time', 'lat', 'lon'), np.random.rand(10, 20, 30)),
'land_mask': (('time', 'lat', 'lon'), np.random.randint(0, 2, size=(10, 20, 30)))
})
# Calculate statistics for 'temperature' and 'humidity'
stats = get_statistics(ds, exclude_vars=['land_mask'])
print("Statistics calculated:", stats)
# Standardize the dataset, excluding 'land_mask' from standardization
standardized_ds = standardize(ds, stats, filter_var='land_mask')
print("Standardized Dataset:")
for var in ['temperature', 'humidity']:
print(f"Standardized {var}: mean={np.mean(standardized_ds[var].values)}, std={np.std(standardized_ds[var].values)}")
temperature and humidity variables.
Notes#
- Standardization is carried out by subtracting the mean and dividing by the standard deviation. If the standard deviation is zero (indicating no variability within the variable), the variable values are reduced by the mean alone since division by zero is not feasible.
- This function supports excluding specific variables from the standardization process, which is especially useful for preserving the integrity of certain types of data like binary masks or categorical indices.
- The function intelligently handles both
xarray.Datasetand dictionary formats, making it versatile for different data handling contexts in scientific computing and machine learning.