Introduction to xcube's "zenodo" data store¶

This notebook shows an example how to access a TIF and a NetCDF published on the https://zenodo.org webpage.

Furthermore it contains an example of how to preload a Zarr file published in compressed tar format on the https://zenodo.org webpage. The compressed files will be downloaded, unpacked and the the Zarr files will be made available, which can be subsequently used by the data store as usual.

For more examples, e.g. how to preload a zip or a nested zip file, please head over to the examples located in the xcube-zenodo GitHub repository https://github.com/xcube-dev/xcube-zenodo/tree/main/examples

Please, also refer to the DeepESDL documentation and visit the platform's website for further information!

Brockmann Consult, 2025

This notebook runs with the python environment deepesdl-xcube-1.9.1, please checkout the documentation for help on changing the environment.

In [2]:

Copied!

# mandatory imports
from xcube.core.store import new_data_store
from xcube.core.store import get_data_store_params_schema
# mandatory imports
from xcube.core.store import new_data_store
from xcube.core.store import get_data_store_params_schema

First, we get the store parameters needed to initialize a zenodo data store.

In [3]:

Copied!

store_params = get_data_store_params_schema("zenodo")
store_params
store_params = get_data_store_params_schema("zenodo")
store_params

Out[3]:

<xcube.util.jsonschema.JsonObjectSchema at 0x7f747403f770>

Example of lazy access of a tif file¶

We initiate a zenodo data store. Note that the xcube-zenodo plugin is recognized after installation by setting the first argument to "zenodo" in the new_data_store function. To specify the data source, we set the root parameter to the record ID, which can be found in the URL of the corresponding Zenodo publication page. Let's have a look at the data published with the title "Canopy height and biomass map for Europe": https://zenodo.org/records/815444 with the record ID root="815444".

In [4]:

Copied!

%%time
store = new_data_store("zenodo", root="8154445")
%%time
store = new_data_store("zenodo", root="8154445")

CPU times: user 5.6 ms, sys: 0 ns, total: 5.6 ms
Wall time: 5.49 ms

The data IDs can be streamed by executing the following cell, which are equal to the filenames in the file section.

In [5]:

Copied!

%%time
store.list_data_ids()
%%time
store.list_data_ids()

CPU times: user 15.8 ms, sys: 4.82 ms, total: 20.6 ms
Wall time: 480 ms

Out[5]:

['planet_canopy_cover_30m_v0.1.tif',
 'planet_agb_30m_v0.1.tif',
 'planet_canopy_height_30m_v0.1.tif']

We can describe the dataset using the describe_data method, as shown below.

In [6]:

Copied!

store.describe_data("planet_canopy_cover_30m_v0.1.tif")
store.describe_data("planet_canopy_cover_30m_v0.1.tif")

Out[6]:

<xcube.core.store.descriptor.DatasetDescriptor at 0x7f7474002cf0>

Next we can open the data. We can first view the available opening parameters, which can be added to the open_data method in the subsequent cell.

In [7]:

Copied!

%%time
open_params = store.get_open_data_params_schema(data_id="planet_canopy_cover_30m_v0.1.tif")
open_params
%%time
open_params = store.get_open_data_params_schema(data_id="planet_canopy_cover_30m_v0.1.tif")
open_params

CPU times: user 177 μs, sys: 64 μs, total: 241 μs
Wall time: 245 μs

Out[7]:

<xcube.util.jsonschema.JsonObjectSchema at 0x7f74740d7a10>

In [8]:

Copied!





%%time
ds = store.open_data(
    "planet_canopy_cover_30m_v0.1.tif",
    tile_size=(1024, 1024),
)
ds
%%time
ds = store.open_data(
    "planet_canopy_cover_30m_v0.1.tif",
    tile_size=(1024, 1024),
)
ds

CPU times: user 45.3 ms, sys: 8.33 ms, total: 53.7 ms
Wall time: 52.3 ms

Out[8]:

<xarray.Dataset> Size: 25GB
Dimensions:      (x: 170397, y: 149363)
Coordinates:
  * x            (x) float64 1MB 2.555e+06 2.555e+06 ... 7.667e+06 7.667e+06
  * y            (y) float64 1MB 5.82e+06 5.82e+06 ... 1.339e+06 1.339e+06
    spatial_ref  int64 8B 0
Data variables:
    band_1       (y, x) uint8 25GB dask.array<chunksize=(1024, 1024), meta=np.ndarray>
Attributes:
    source:   https://zenodo.org/records/8154445/files/planet_canopy_cover_30...

We plot parts of the opened data as an example below. The data shows the canopy cover fraction within a range of [0, 100].

In [9]:

Copied!

%%time
ds.band_1[100000:102000, 100000:102000].plot(vmin=0, vmax=100)
%%time
ds.band_1[100000:102000, 100000:102000].plot(vmin=0, vmax=100)

CPU times: user 734 ms, sys: 83.5 ms, total: 818 ms
Wall time: 2.78 s

Out[9]:

<matplotlib.collections.QuadMesh at 0x7f74602e7e00>

$No description has been provided for this image$

We can also open a TIFF as a xcube's multi-resolution dataset, where we can select the level of resolution. The opened dataset however is not cloud optimized and thus consists of only one level.

In [10]:

Copied!





%%time
mlds = store.open_data(
    "planet_canopy_cover_30m_v0.1.tif",
    tile_size=(1024, 1024),
    data_type="mldataset"
)
mlds.num_levels
%%time
mlds = store.open_data(
    "planet_canopy_cover_30m_v0.1.tif",
    tile_size=(1024, 1024),
    data_type="mldataset"
)
mlds.num_levels

CPU times: user 15.9 ms, sys: 234 μs, total: 16.2 ms
Wall time: 326 ms

Out[10]:

In [11]:

Copied!

%%time
ds = mlds.get_dataset(0)
ds
%%time
ds = mlds.get_dataset(0)
ds

CPU times: user 56.2 ms, sys: 0 ns, total: 56.2 ms
Wall time: 269 ms

Out[11]:

<xarray.Dataset> Size: 25GB
Dimensions:      (x: 170397, y: 149363)
Coordinates:
  * x            (x) float64 1MB 2.555e+06 2.555e+06 ... 7.667e+06 7.667e+06
  * y            (y) float64 1MB 5.82e+06 5.82e+06 ... 1.339e+06 1.339e+06
    spatial_ref  int64 8B 0
Data variables:
    band_1       (y, x) uint8 25GB dask.array<chunksize=(1024, 1024), meta=np.ndarray>
Attributes:
    source:   https://zenodo.org/records/8154445/files/planet_canopy_cover_30...

Example of lazy access of a netcdf file¶

We can also use the zenodo data store to open NetCDF files. To do this we need to initiate a new data store with the corresponding record ID. In the below example we use "Atlas of Tides, North West European Shelf, from NEMO tide and surge model.": https://zenodo.org/records/13882297 with the record ID root="13882297".

In [13]:

Copied!

%%time
store = new_data_store("zenodo", root="13882297")
%%time
store = new_data_store("zenodo", root="13882297")

CPU times: user 4.92 ms, sys: 171 μs, total: 5.09 ms
Wall time: 5.02 ms

We can list all available data IDs again executing the following cell.

In [14]:

Copied!

%%time
store.list_data_ids()
%%time
store.list_data_ids()

CPU times: user 21 ms, sys: 647 μs, total: 21.6 ms
Wall time: 543 ms

Out[14]:

['gridded_constituents_tideonly.nc',
 'gridded_constituents_ERA5weather.nc',
 'gridded_tidestats_ERA5weather.nc',
 'gridded_tidestats_tideonly.nc']

Next we open a dataset. Note if chunks are given, the data set is loaded lazily as a chunked xr.Dataset.

In [15]:

Copied!





%%time
ds = store.open_data(
    "/gridded_tidestats_ERA5weather.nc",
    chunks={}
)
ds
%%time
ds = store.open_data(
    "/gridded_tidestats_ERA5weather.nc",
    chunks={}
)
ds

CPU times: user 73 ms, sys: 29.5 ms, total: 103 ms
Wall time: 6.44 s

Out[15]:

<xarray.Dataset> Size: 15MB
Dimensions:  (y: 375, x: 297)
Dimensions without coordinates: y, x
Data variables: (12/17)
    nav_lon  (y, x) float64 891kB dask.array<chunksize=(375, 297), meta=np.ndarray>
    nav_lat  (y, x) float64 891kB dask.array<chunksize=(375, 297), meta=np.ndarray>
    z0       (y, x) float64 891kB dask.array<chunksize=(375, 297), meta=np.ndarray>
    HAT      (y, x) float64 891kB dask.array<chunksize=(375, 297), meta=np.ndarray>
    LAT      (y, x) float64 891kB dask.array<chunksize=(375, 297), meta=np.ndarray>
    MHW      (y, x) float64 891kB dask.array<chunksize=(375, 297), meta=np.ndarray>
    ...       ...
    MHHW     (y, x) float64 891kB dask.array<chunksize=(375, 297), meta=np.ndarray>
    MLLW     (y, x) float64 891kB dask.array<chunksize=(375, 297), meta=np.ndarray>
    RangeAT  (y, x) float64 891kB dask.array<chunksize=(375, 297), meta=np.ndarray>
    MSRange  (y, x) float64 891kB dask.array<chunksize=(375, 297), meta=np.ndarray>
    MRange   (y, x) float64 891kB dask.array<chunksize=(375, 297), meta=np.ndarray>
    MNRange  (y, x) float64 891kB dask.array<chunksize=(375, 297), meta=np.ndarray>
Attributes:
    Author:       Joanne Williams, joll@noc.ac.uk
    Institute:    National Oceanography Centre
    Title:        Tidal constituents for NOCtide from model run
    Modelrun:     ERA5weather
    Modelconfig:  newfriction
    TimeStamp:    11-Jun-2024 13:01:44
    Notes:        Statistics based on ERA5 hindcast run from 1980 to 2022. \n...

We plot the Mean Low Water (MLW) data as an example.

In [16]:

Copied!

%%time
ds.MLW.plot()
%%time
ds.MLW.plot()

CPU times: user 37.4 ms, sys: 1.01 ms, total: 38.4 ms
Wall time: 300 ms

Out[16]:

<matplotlib.collections.QuadMesh at 0x7f74430d4410>

No description has been provided for this image

Example of preload access of a zipped zarr file¶

We can also use the zenodo data store to access zipped zarr files from zenodo. The compressed files will be downloaded, unpacked and the the Zarr files will be made available, which can be subsequently used by the data store as usual.The data is downloaded into a directory called "zenodo_cache" in you currend working directory.

We initiate a zenodo data store for the "Dheed : a global database of dry and hot extreme events" record https://zenodo.org/records/11546130 with the record ID root="11546130". Note that the xcube-zenodo plugin is recognized after installation by setting the first argument to "zenodo" in the new_data_store function. We can optionally specify the cache data store's ID and parameters using the cache_store_id and cache_store_params keyword arguments. By default, cache_store_id is set to file, and cache_store_params defaults to dict(root="zenodo_cache/11546130", max_depth=10).

In [17]:

Copied!

%%time
store = new_data_store("zenodo", root="11546130")
%%time
store = new_data_store("zenodo", root="11546130")

CPU times: user 5.43 ms, sys: 0 ns, total: 5.43 ms
Wall time: 5.32 ms

Compressed files can be preloaded using the preload_data method. This approach enables the downloading of compressed files that cannot be lazily loaded, allowing them to be stored and readily available for the duration of the project. Also this method uses preload_params, which can be viewed in the next cell.

In [18]:

Copied!

%%time
preload_params = store.get_preload_data_params()
preload_params
%%time
preload_params = store.get_preload_data_params()
preload_params

CPU times: user 34 μs, sys: 7 μs, total: 41 μs
Wall time: 42.9 μs

Out[18]:

<xcube.util.jsonschema.JsonObjectSchema at 0x7f7441d19710>

The preload_data method returns a store which may be used subsequently to access the preloaded data, as shown in the subsequent cells. If no data IDs are given, all available data in compressed format will be preloaded. Note that the preload_method is new and highly experimental.

Please note: if you see an "Error displaying widget: model not found", don't worry - we are trying to find a solution, but the cell will still execute. Please also note, that the below dataset takes quite some minutes to load.

In [19]:

Copied!





cache_store = store.preload_data(
    "EventCube_ranked_pot0.01_ne0.1.zarr.zip",
    "mergedlabels.zarr.zip"
)
cache_store = store.preload_data(
    "EventCube_ranked_pot0.01_ne0.1.zarr.zip",
    "mergedlabels.zarr.zip"
)

VBox(children=(HTML(value='<table>\n<thead>\n<tr><th>Data ID                                </th><th>Status  <…

The data IDs can be view by listing the data IDs of the cache store, which is returned by the preload_data method. The new data ID is identical to the original, except that the .zip extension indicating a compressed format has been removed.

In [20]:

Copied!

cache_store.list_data_ids()
cache_store.list_data_ids()

Out[20]:

['mergedlabels.zarr', 'EventCube_ranked_pot0.01_ne0.1.zarr']

Next we want to open one of the datasets. We first view the availbale parameters to open the data.

In [21]:

Copied!





%%time
open_params = cache_store.get_open_data_params_schema(
    data_id="EventCube_ranked_pot0.01_ne0.1.zarr"
)
open_params
%%time
open_params = cache_store.get_open_data_params_schema(
    data_id="EventCube_ranked_pot0.01_ne0.1.zarr"
)
open_params

CPU times: user 0 ns, sys: 717 μs, total: 717 μs
Wall time: 721 μs

Out[21]:

<xcube.util.jsonschema.JsonObjectSchema at 0x7f7440f14590>

In [22]:

Copied!

%%time
ds = cache_store.open_data("EventCube_ranked_pot0.01_ne0.1.zarr")
ds
%%time
ds = cache_store.open_data("EventCube_ranked_pot0.01_ne0.1.zarr")
ds

CPU times: user 8.39 ms, sys: 537 μs, total: 8.93 ms
Wall time: 31.1 ms

Out[22]:

<xarray.Dataset> Size: 111GB
Dimensions:    (latitude: 721, longitude: 1440, time: 26663)
Coordinates:
  * latitude   (latitude) float32 3kB 90.0 89.75 89.5 ... -89.5 -89.75 -90.0
  * longitude  (longitude) float32 6kB 0.0 0.25 0.5 0.75 ... 359.2 359.5 359.8
  * time       (time) datetime64[ns] 213kB 1950-01-01 1950-01-02 ... 2022-12-31
Data variables:
    layer      (latitude, longitude, time) float32 111GB dask.array<chunksize=(6, 120, 5844), meta=np.ndarray>

We plot the opened data at the last time step as an example below.

In [23]:

Copied!

%%time
ds.layer.isel(time=-1).plot()
%%time
ds.layer.isel(time=-1).plot()

CPU times: user 2.96 s, sys: 406 ms, total: 3.36 s
Wall time: 1.94 s

Out[23]:

<matplotlib.collections.QuadMesh at 0x7f742a27b4d0>

Some additional notes:

The preload_data function persists the data and if you don't need it anymore it needs to be activly deleted.
The close() method only cleans up the download folder in case there were some leftovers due to some network or io issues. But the preloaded datasets remain and need to be deleted by the users manually if they don't need it.

Furthermore, if the user runs the preload job again even when the dataset is available in the cache, it will run the preload job from scratch to avoid considering a half written zarr as a complete preloaded dataset.

In [ ]: