Chunking of datasets

How to save a data cube with a desired chunking¶

A DeepESDL example notebook¶

This notebook demonstrates how modify the chunking of a dataset before persisting it.

Please, also refer to the DeepESDL documentation and visit the platform's website for further information!

Brockmann Consult, 2025

This notebook runs with the python environment deepesdl-xcube-1.9.1, please checkout the documentation for help on changing the environment.

First, lets create a small cube, which we can rechunk. We will use ESA CCI data for this. Please head over to "xcube datastores - Generate CCI data cubes" to get more details about the xcube-cci data store :)

In [1]:

Copied!

import datetime
import os

from xcube.core.store import new_data_store
from xcube.core.chunk import chunk_dataset
import datetime
import os

from xcube.core.store import new_data_store
from xcube.core.chunk import chunk_dataset

In [2]:

Copied!

store = new_data_store("ccizarr")
store = new_data_store("ccizarr")

Next, we create a cube containing 3y of data:

In [3]:

Copied!





def open_zarrstore(filename, time_range, variables):
    ds = store.open_data(filename)
    subset = ds.sel(time=slice(time_range[0], time_range[1]))
    subset = subset[variables]

    return subset


dataset = open_zarrstore(
    "ESACCI-L4_GHRSST-SST-GMPE-GLOB_CDR2.0-1981-2016-v02.0-fv01.0.zarr",
    time_range=[datetime.datetime(2013, 10, 1), datetime.datetime(2016, 9, 30)],
    variables=["analysed_sst"],
)
def open_zarrstore(filename, time_range, variables):
    ds = store.open_data(filename)
    subset = ds.sel(time=slice(time_range[0], time_range[1]))
    subset = subset[variables]

    return subset


dataset = open_zarrstore(
    "ESACCI-L4_GHRSST-SST-GMPE-GLOB_CDR2.0-1981-2016-v02.0-fv01.0.zarr",
    time_range=[datetime.datetime(2013, 10, 1), datetime.datetime(2016, 9, 30)],
    variables=["analysed_sst"],
)

In [4]:

Copied!

dataset
dataset

Out[4]:

<xarray.Dataset> Size: 9GB
Dimensions:       (time: 1095, lat: 720, lon: 1440)
Coordinates:
  * lat           (lat) float32 3kB -89.88 -89.62 -89.38 ... 89.38 89.62 89.88
  * lon           (lon) float32 6kB -179.9 -179.6 -179.4 ... 179.4 179.6 179.9
  * time          (time) datetime64[ns] 9kB 2013-10-01T12:00:00 ... 2016-09-2...
Data variables:
    analysed_sst  (time, lat, lon) float64 9GB dask.array<chunksize=(4, 720, 720), meta=np.ndarray>
Attributes: (12/47)
    Conventions:                CF-1.4
    acknowledgment:             Funded by ESA
    cdm_data_type:              grid
    comment:                    
    creator_email:              science.leader@esa-sst-cci.org
    creator_name:               SST_cci
    ...                         ...
    summary:                    An ensemble product with input from a number ...
    time_coverage_end:          20170101T000000Z
    time_coverage_start:        20161231T000000Z
    title:                      Global SST Ensemble, L4 GMPE
    uuid:                       dc0c5b25-93bf-4943-aba1-7f0de9109620
    westernmost_longitude:      -180.0

In [5]:

Copied!

dataset.analysed_sst.encoding
dataset.analysed_sst.encoding

Out[5]:

{'chunks': (16, 720, 720),
 'preferred_chunks': {'time': 16, 'lat': 720, 'lon': 720},
 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0),
 'filters': None,
 '_FillValue': np.int16(-32768),
 'scale_factor': 0.009999999776482582,
 'add_offset': 273.1499938964844,
 'dtype': dtype('int16')}

In the example above, we can see that the variable analysed_sst is chunked as follows: (16, 720, 720). This means, each chunk contains 16 time values, 720 lat values and 720 lon values per chunk. Variables, which contain 1 time value and many spatial dimensions in one chunk are optimal for visualisation/plotting of one time stamp.

For analysing long time series, it is benificial to chunk a dataset accordingly, so the chunks contain more values of the time dimension and less of the spatial dimensions.

In [6]:

Copied!





# time optimised chunking - please note, this is just an example
time_chunksize = 1095
x_chunksize = 10  # or lon
y_chunksize = 10  # or lat
# time optimised chunking - please note, this is just an example
time_chunksize = 1095
x_chunksize = 10  # or lon
y_chunksize = 10  # or lat

Rechunking the dataset with desired chunking using xcube chunk_dataset.

In [7]:

Copied!





rechunked_ds = chunk_dataset(dataset, 
                             {"time": time_chunksize,
                              "lat": y_chunksize,
                              "lon": x_chunksize}, 
                             format_name='zarr', 
                             data_vars_only=True) 
rechunked_ds = chunk_dataset(dataset, 
                             {"time": time_chunksize,
                              "lat": y_chunksize,
                              "lon": x_chunksize}, 
                             format_name='zarr', 
                             data_vars_only=True) 

Save rechunked dataset to team s3 storage.

In [8]:

Copied!

S3_USER_STORAGE_KEY = os.environ["S3_USER_STORAGE_KEY"]
S3_USER_STORAGE_SECRET = os.environ["S3_USER_STORAGE_SECRET"]
S3_USER_STORAGE_BUCKET = os.environ["S3_USER_STORAGE_BUCKET"]
S3_USER_STORAGE_KEY = os.environ["S3_USER_STORAGE_KEY"]
S3_USER_STORAGE_SECRET = os.environ["S3_USER_STORAGE_SECRET"]
S3_USER_STORAGE_BUCKET = os.environ["S3_USER_STORAGE_BUCKET"]

In [9]:

Copied!





team_store = new_data_store(
    "s3",
    root=S3_USER_STORAGE_BUCKET,
    storage_options=dict(
        anon=False, key=S3_USER_STORAGE_KEY, secret=S3_USER_STORAGE_SECRET
    ),
)
team_store = new_data_store(
    "s3",
    root=S3_USER_STORAGE_BUCKET,
    storage_options=dict(
        anon=False, key=S3_USER_STORAGE_KEY, secret=S3_USER_STORAGE_SECRET
    ),
)

In [10]:

Copied!

team_store.list_data_ids()
team_store.list_data_ids()

Out[10]:

['ESACCI-L4_GHRSST-SST-GMPE-GLOB_CDR2.0-1981-2016-v02.0-fv01.0.rechunked.zarr',
 'LC-1x720x1440-0.25deg-2.0.0-v1.zarr',
 'LC-1x720x1440-0.25deg-2.0.0-v2.zarr',
 'SST.levels',
 'SeasFireCube-8D-0.25deg-1x720x1440-3.0.0.zarr',
 'amazonas_v8.zarr',
 'amazonas_v9.zarr',
 'analysed_sst.zarr',
 'analysed_sst_2.zarr',
 'analysed_sst_3.zarr',
 'analysed_sst_4.zarr',
 'esa-cci-permafrost-1x1151x1641-0.1.0.zarr',
 'esa-cci-permafrost-1x1151x1641-0.4.0.zarr',
 'esa-cci-permafrost-1x1151x1641-0.5.0.zarr',
 'esa-cci-permafrost-1x1151x1641-0.6.0.zarr',
 'esa-cci-permafrost-1x1151x1641-0.7.0.zarr',
 'esa-cci-permafrost-1x1151x1641-0.8.0.zarr',
 'esa-cci-permafrost-1x1151x1641-1.0.0.zarr',
 'esa_gda-health_pakistan_ERA5_precipitation_and_temperature_testdata.zarr',
 'noise_trajectory.zarr']

In [11]:

Copied!

output_id = "ESACCI-L4_GHRSST-SST-GMPE-GLOB_CDR2.0-1981-2016-v02.0-fv01.0.rechunked_AB.zarr"
output_id = "ESACCI-L4_GHRSST-SST-GMPE-GLOB_CDR2.0-1981-2016-v02.0-fv01.0.rechunked_AB.zarr"

In [12]:

Copied!

team_store.write_data(rechunked_ds, output_id, replace=True)
team_store.write_data(rechunked_ds, output_id, replace=True)

Out[12]:

'ESACCI-L4_GHRSST-SST-GMPE-GLOB_CDR2.0-1981-2016-v02.0-fv01.0.rechunked_AB.zarr'

In [13]:

Copied!

ds_re = team_store.open_data(output_id)
ds_re = team_store.open_data(output_id)

In [14]:

Copied!

ds_re
ds_re

Out[14]:

<xarray.Dataset> Size: 9GB
Dimensions:       (time: 1095, lat: 720, lon: 1440)
Coordinates:
  * lat           (lat) float32 3kB -89.88 -89.62 -89.38 ... 89.38 89.62 89.88
  * lon           (lon) float32 6kB -179.9 -179.6 -179.4 ... 179.4 179.6 179.9
  * time          (time) datetime64[ns] 9kB 2013-10-01T12:00:00 ... 2016-09-2...
Data variables:
    analysed_sst  (time, lat, lon) float64 9GB dask.array<chunksize=(1095, 10, 10), meta=np.ndarray>
Attributes: (12/47)
    Conventions:                CF-1.4
    acknowledgment:             Funded by ESA
    cdm_data_type:              grid
    comment:                    
    creator_email:              science.leader@esa-sst-cci.org
    creator_name:               SST_cci
    ...                         ...
    summary:                    An ensemble product with input from a number ...
    time_coverage_end:          20170101T000000Z
    time_coverage_start:        20161231T000000Z
    title:                      Global SST Ensemble, L4 GMPE
    uuid:                       dc0c5b25-93bf-4943-aba1-7f0de9109620
    westernmost_longitude:      -180.0

Lets have a look at the chunking of the varialble analysed_sst now: (1095, 10, 10). This means, each chunk contains 1095 time values, 10 lat values and 10 lon values per chunk. That is corresponding to what we have defined to be used by xcube chunk_dataset.

In [15]:

Copied!

ds_re.analysed_sst.encoding
ds_re.analysed_sst.encoding

Out[15]:

{'chunks': (1095, 10, 10),
 'preferred_chunks': {'time': 1095, 'lat': 10, 'lon': 10},
 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0),
 'filters': None,
 '_FillValue': np.int16(-32768),
 'scale_factor': 0.009999999776482582,
 'add_offset': 273.1499938964844,
 'dtype': dtype('int16')}

In [16]:

Copied!

# Clean up test dataset
team_store.delete_data(output_id)
# Clean up test dataset
team_store.delete_data(output_id)

In [ ]: