Integration with EarthCODE#
The main tool to achieve a seamless EarthCODE integration for DeepESDL users is deep-code
.
deep-code#
deep-code is a lightweight Python tool for publishing datasets and scientific workflows from DeepESDL directly to the EarthCODE open-science-catalog. It provides both a command-line interface (CLI) and a Python API for flexible use.
Prerequisites#
Before using deep-code
, you need to configure a few authentication files and environment variables.
1. GitHub Authentication (.gitaccess
file)#
deep-code
requires a GitHub Personal Access Token (PAT) to publish your work.
You must create a .gitaccess
file containing your GitHub credentials.
-
Generate a GitHub PAT
- Go to GitHub โ Settings โ Developer settings โ Personal access tokens.
- Click Generate new token.
- Select the following scope:
repo
โ Full control of repositories (read, fork, push, pull).
- Generate the token and copy it (GitHub will not display it again).
-
Create a
.gitaccess
file
In your project directory or home folder, create a plain text file named.gitaccess
:
github-username: your-git-user
github-token: your-personal-access-token
Replace your-git-user and your-personal-access-token with your actual GitHub username and token.
2. S3 Configuration (Optional)#
NOTE: If you are working inside DeepESDL, skip this section.
By default, deep-code assumes datasets are stored in:
- deepesdl-public bucket, or
- a DeepESDL-specific team S3 bucket.
If your data is stored elsewhere, configure the following environment variables:
export S3_USER_STORAGE_BUCKET=my-test-bucket
export AWS_DEFAULT_REGION=eu-west-2
In Python, load them with:
load_dotenv() # take environment variables
3. Metadata Input Files#
To publish with deep-code, you need two YAML metadata files:
-
Dataset metadata (dataset_config.yaml)
-
Workflow metadata (workflow_config.yaml)
The templates for the metadata files can be automatically generated using the cli command.
deep-code generate-config
The metadata files are used to generate valid STAC Items following the EarthCODE Open Science Catalog (OSC) convention, and automatically submit a pull request to register them in the catalog.
Dataset Metadata (Products)#
Define your dataset metadata in a YAML file:
dataset_id: The name of the dataset object within your S3 bucket
collection_id: A unique identifier for the dataset collection
osc_themes: [wildfires] Open Science theme (choose from https://opensciencedata.esa.int/themes/catalog)
documentation_link: Link to relevant documentation, publication, or handbook
access_link: Public S3 URL to the dataset
dataset_status: Status of the dataset, e.g. 'ongoing', 'completed', or 'planned'
osc_region: Geographical coverage, e.g. 'global'
cf_parameter: The main geophysical variable, ideally matching a CF standard name or OSC variable
Notes:
osc_themes
must match an existing OSC theme. https://opensciencedata.esa.int/themes/catalogcf_parameter
should use well-established variables (from OSC or CF conventions).
Workflow Metadata#
Define your workflow metadata in seperate YAML file:
workflow_id: A unique identifier for your workflow
properties:
title: Human-readable title of the workflow
description: A concise summary of what the workflow does
keywords: Relevant scientific or technical keywords
themes: Thematic area(s) of focus (e.g. land, ocean, atmosphere) - see the above note
license: License type (e.g. MIT, Apache-2.0, CC-BY-4.0, proprietary)
jupyter_kernel_info:
name: Name of the execution environment or notebook kernel
python_version: Python version used
env_file: Link to the environment file (YAML) used to create the notebook environment
jupyter_notebook_url: Link to the source notebook (e.g. on GitHub)
contact:
name: Contact person's full name
organization: Affiliated institution or company
links:
rel: "about"
type: "text/html"
href: Link to homepage or personal/institutional profile
CLI Usage#
deep-code
is a lightweight python tool that provides a command line tool, which has subcommands
providing different utility functions. Use the --help option with these subcommands to
get more details on usage.
(deep-code) tejas@tejas-nb:~/bc/projects/deepesdl/deep-code$ deep-code --help
Usage: deep-code [OPTIONS] COMMAND [ARGS]...
Deep Code CLI.
Options:
--help Show this message and exit.
Commands:
generate-config
publish Request publishing a dataset along with experiment and...
The CLI retrieves the Git username and personal access token from a hidden file named .gitaccess. Ensure this file is located in the same directory where you execute the CLI command.
Subcommands#
1. deep-code generate-config#
Generates starter configuration templates for publishing to EarthCODE's open science catalog.
Usage#
deep-code generate-config [OPTIONS]
Options#
--output-dir, -o : Output directory (default: current)
Examples#
deep-code generate-config
deep-code generate-config -o ./configs
2. deep-code publish#
Publishes metadata of experiment, workflow and dataset to the EarthCODE open science catalog.
Usage#
deep-code publish DATASET_CONFIG WORKFLOW_CONFIG [--environment ENVIRONMENT]
Arguments#
DATASET_CONFIG - Path to the dataset configuration YAML file
(e.g., dataset-config.yaml)
WORKFLOW_CONFIG - Path to the workflow configuration YAML file
(e.g., workflow-config.yaml)
Options#
--environment, -e - Target catalog environment: production (default) | staging | testing
Python API Usage#
deep-code
can also be used directly from Python or inside a Jupyter Notebook:
from deep_code.tools.publish import Publisher
publisher = Publisher(
dataset_config_path="dataset_config.yaml",
workflow_config_path="workflow_config.yaml"
)
publisher.publish_all() # publish both dataset and the workflow
๐ Quickstart Example#
Follow these steps to publish your first dataset and workflow:
-
Create a
.gitaccess
file in your project directory with your GitHub credentials:github-username: my-username github-token: ghp_xxx123yourPATxxx
-
Write dataset metadata (dataset_config.yaml):
dataset_id: wildfire-sample collection_id: wildfire-collection osc_themes: [wildfires] documentation_link: https://example.org/wildfire-docs access_link: https://my-bucket.s3.eu-west-2.amazonaws.com/wildfire dataset_status: ongoing osc_region: global osc_status: completed cf_parameter: burned_area
-
Write workflow metadata (workflow_config.yaml):
workflow_id: wildfire-analysis-v1 properties: title: "Wildfire Analysis Workflow" description: "Analyzes burned area from EO data." keywords: ["wildfire", "burned area", "remote sensing"] themes: ["land"] license: "CC-BY-4.0" jupyter_kernel_info: name: "python3" python_version: "3.11" env_file: "environment.yaml" jupyter_notebook_url: https://github.com/my-org/my-repo/blob/main/workflow.ipynb contact: name: "Jane Doe" organization: "Example Institute" links: rel: "about" type: "text/html" href: "https://example.org"
-
Publish to EarthCODE:
deep-code publish dataset_config.yaml workflow_config.yaml
๐ ๏ธ Troubleshooting#
Here are some common issues and fixes when using deep-code:
-
Authentication failed for 'https://github.com/.../'
- Ensure your .gitaccess file exists and is in the directory where you run deep-code.
- Verify the file format (no extra spaces, correct keys github-username and github-token).
- Check that your GitHub PAT includes the repo scope.
-
FileNotFoundError: dataset_config.yaml not found
- Make sure you provide the correct path when running deep-code publish.
- Use relative or absolute paths, e.g.:
deep-code publish ./configs/dataset_config.yaml ./configs/workflow_config.yaml
-
AccessDenied when reading S3 data
-
Verify that your AWS credentials are set correctly (S3_USER_STORAGE_KEY, S3_USER_STORAGE_SECRET).
-
Check that the S3_USER_STORAGE_BUCKET environment variables are set.
-
If running inside DeepESDL, skip S3 config (itโs managed internally).
-
-
Notebook environment mismatch
-
Ensure the jupyter_kernel_info in your workflow YAML matches the environment you actually used.
-
The env_file should point to a valid environment YAML (conda/virtualenv).
-
-
My pull request to EarthCODE failed
-
Check the logs printed by deep-code publish.
-
Sometimes validation fails if: osc_themes doesnโt match a valid OSC theme or cf_parameter is not recognized.
-
Required metadata fields are missing.
-