Chunking of datasets
How to save a data cube with a desired chunking¶
A DeepESDL example notebook¶
This notebook demonstrates how modify the chunking of a dataset before persisting it.
Please, also refer to the DeepESDL documentation and visit the platform's website for further information!
Brockmann Consult, 2025
This notebook runs with the python environment deepesdl-xcube-1.9.1
, please checkout the documentation for help on changing the environment.
First, lets create a small cube, which we can rechunk. We will use ESA CCI data for this. Please head over to "xcube datastores - Generate CCI data cubes" to get more details about the xcube-cci data store :)
import datetime
import os
from xcube.core.store import new_data_store
from xcube.core.chunk import chunk_dataset
store = new_data_store("ccizarr")
Next, we create a cube containing 3y of data:
def open_zarrstore(filename, time_range, variables):
ds = store.open_data(filename)
subset = ds.sel(time=slice(time_range[0], time_range[1]))
subset = subset[variables]
return subset
dataset = open_zarrstore(
"ESACCI-L4_GHRSST-SST-GMPE-GLOB_CDR2.0-1981-2016-v02.0-fv01.0.zarr",
time_range=[datetime.datetime(2013, 10, 1), datetime.datetime(2016, 9, 30)],
variables=["analysed_sst"],
)
dataset
<xarray.Dataset> Size: 9GB Dimensions: (time: 1095, lat: 720, lon: 1440) Coordinates: * lat (lat) float32 3kB -89.88 -89.62 -89.38 ... 89.38 89.62 89.88 * lon (lon) float32 6kB -179.9 -179.6 -179.4 ... 179.4 179.6 179.9 * time (time) datetime64[ns] 9kB 2013-10-01T12:00:00 ... 2016-09-2... Data variables: analysed_sst (time, lat, lon) float64 9GB dask.array<chunksize=(4, 720, 720), meta=np.ndarray> Attributes: (12/47) Conventions: CF-1.4 acknowledgment: Funded by ESA cdm_data_type: grid comment: creator_email: science.leader@esa-sst-cci.org creator_name: SST_cci ... ... summary: An ensemble product with input from a number ... time_coverage_end: 20170101T000000Z time_coverage_start: 20161231T000000Z title: Global SST Ensemble, L4 GMPE uuid: dc0c5b25-93bf-4943-aba1-7f0de9109620 westernmost_longitude: -180.0
dataset.analysed_sst.encoding
{'chunks': (16, 720, 720), 'preferred_chunks': {'time': 16, 'lat': 720, 'lon': 720}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': np.int16(-32768), 'scale_factor': 0.009999999776482582, 'add_offset': 273.1499938964844, 'dtype': dtype('int16')}
In the example above, we can see that the variable analysed_sst is chunked as follows: (16, 720, 720). This means, each chunk contains 16 time values, 720 lat values and 720 lon values per chunk. Variables, which contain 1 time value and many spatial dimensions in one chunk are optimal for visualisation/plotting of one time stamp.
For analysing long time series, it is benificial to chunk a dataset accordingly, so the chunks contain more values of the time dimension and less of the spatial dimensions.
# time optimised chunking - please note, this is just an example
time_chunksize = 1095
x_chunksize = 10 # or lon
y_chunksize = 10 # or lat
Rechunking the dataset with desired chunking using xcube chunk_dataset.
rechunked_ds = chunk_dataset(dataset,
{"time": time_chunksize,
"lat": y_chunksize,
"lon": x_chunksize},
format_name='zarr',
data_vars_only=True)
Save rechunked dataset to team s3 storage.
S3_USER_STORAGE_KEY = os.environ["S3_USER_STORAGE_KEY"]
S3_USER_STORAGE_SECRET = os.environ["S3_USER_STORAGE_SECRET"]
S3_USER_STORAGE_BUCKET = os.environ["S3_USER_STORAGE_BUCKET"]
team_store = new_data_store(
"s3",
root=S3_USER_STORAGE_BUCKET,
storage_options=dict(
anon=False, key=S3_USER_STORAGE_KEY, secret=S3_USER_STORAGE_SECRET
),
)
team_store.list_data_ids()
['ESACCI-L4_GHRSST-SST-GMPE-GLOB_CDR2.0-1981-2016-v02.0-fv01.0.rechunked.zarr', 'LC-1x720x1440-0.25deg-2.0.0-v1.zarr', 'LC-1x720x1440-0.25deg-2.0.0-v2.zarr', 'SST.levels', 'SeasFireCube-8D-0.25deg-1x720x1440-3.0.0.zarr', 'amazonas_v8.zarr', 'amazonas_v9.zarr', 'analysed_sst.zarr', 'analysed_sst_2.zarr', 'analysed_sst_3.zarr', 'analysed_sst_4.zarr', 'esa-cci-permafrost-1x1151x1641-0.1.0.zarr', 'esa-cci-permafrost-1x1151x1641-0.4.0.zarr', 'esa-cci-permafrost-1x1151x1641-0.5.0.zarr', 'esa-cci-permafrost-1x1151x1641-0.6.0.zarr', 'esa-cci-permafrost-1x1151x1641-0.7.0.zarr', 'esa-cci-permafrost-1x1151x1641-0.8.0.zarr', 'esa-cci-permafrost-1x1151x1641-1.0.0.zarr', 'esa_gda-health_pakistan_ERA5_precipitation_and_temperature_testdata.zarr', 'noise_trajectory.zarr']
output_id = "ESACCI-L4_GHRSST-SST-GMPE-GLOB_CDR2.0-1981-2016-v02.0-fv01.0.rechunked_AB.zarr"
team_store.write_data(rechunked_ds, output_id, replace=True)
'ESACCI-L4_GHRSST-SST-GMPE-GLOB_CDR2.0-1981-2016-v02.0-fv01.0.rechunked_AB.zarr'
ds_re = team_store.open_data(output_id)
ds_re
<xarray.Dataset> Size: 9GB Dimensions: (time: 1095, lat: 720, lon: 1440) Coordinates: * lat (lat) float32 3kB -89.88 -89.62 -89.38 ... 89.38 89.62 89.88 * lon (lon) float32 6kB -179.9 -179.6 -179.4 ... 179.4 179.6 179.9 * time (time) datetime64[ns] 9kB 2013-10-01T12:00:00 ... 2016-09-2... Data variables: analysed_sst (time, lat, lon) float64 9GB dask.array<chunksize=(1095, 10, 10), meta=np.ndarray> Attributes: (12/47) Conventions: CF-1.4 acknowledgment: Funded by ESA cdm_data_type: grid comment: creator_email: science.leader@esa-sst-cci.org creator_name: SST_cci ... ... summary: An ensemble product with input from a number ... time_coverage_end: 20170101T000000Z time_coverage_start: 20161231T000000Z title: Global SST Ensemble, L4 GMPE uuid: dc0c5b25-93bf-4943-aba1-7f0de9109620 westernmost_longitude: -180.0
Lets have a look at the chunking of the varialble analysed_sst now: (1095, 10, 10). This means, each chunk contains 1095 time values, 10 lat values and 10 lon values per chunk. That is corresponding to what we have defined to be used by xcube chunk_dataset.
ds_re.analysed_sst.encoding
{'chunks': (1095, 10, 10), 'preferred_chunks': {'time': 1095, 'lat': 10, 'lon': 10}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': np.int16(-32768), 'scale_factor': 0.009999999776482582, 'add_offset': 273.1499938964844, 'dtype': dtype('int16')}
# Clean up test dataset
team_store.delete_data(output_id)