Integration with EarthCODE#

The main tool to achieve a seamless EarthCODE integration for DeepESDL users is deep-code.

deep-code#

deep-code is a lightweight Python tool for publishing datasets and scientific workflows from DeepESDL directly to the EarthCODE open-science-catalog. It provides both a command-line interface (CLI) and a Python API for flexible use.

Prerequisites#

Before using deep-code, you need to configure a few authentication files and environment variables.

1. GitHub Authentication (`.gitaccess` file)#

deep-code requires a GitHub Personal Access Token (PAT) to publish your work.
You must create a .gitaccess file containing your GitHub credentials.

Generate a GitHub PAT
- Go to GitHub → Settings → Developer settings → Personal access tokens.
- Click Generate new token.
- Select the following scope:
  - repo – Full control of repositories (read, fork, push, pull).
- Generate the token and copy it (GitHub will not display it again).
Create a .gitaccess file
In your project directory or home folder, create a plain text file named .gitaccess:

github-username: your-git-user
github-token: your-personal-access-token

Replace your-git-user and your-personal-access-token with your actual GitHub username and token.

2. S3 Configuration (Optional)#

NOTE: If you are working inside DeepESDL, skip this section.

By default, deep-code assumes datasets are stored in:

deepesdl-public bucket, or
a DeepESDL-specific team S3 bucket.

If your data is stored elsewhere, configure the following environment variables:

export S3_USER_STORAGE_BUCKET=my-test-bucket
export AWS_DEFAULT_REGION=eu-west-2

In Python, load them with:

load_dotenv()  # take environment variables

3. Metadata Input Files#

To publish with deep-code, you need two YAML metadata files:

Dataset metadata (dataset_config.yaml)
Workflow metadata (workflow_config.yaml)

The templates for the metadata files can be automatically generated using the cli command.

deep-code generate-config

The metadata files are used to generate valid STAC Items following the EarthCODE Open Science Catalog (OSC) convention, and automatically submit a pull request to register them in the catalog.

Dataset Metadata (Products)#

Define your dataset metadata in a YAML file:

dataset_id: The name of the dataset object within your S3 bucket
collection_id: A unique identifier for the dataset collection
osc_themes: [wildfires] Open Science theme (choose from https://opensciencedata.esa.int/themes/catalog)
documentation_link: Link to relevant documentation, publication, or handbook
access_link: Public S3 URL to the dataset
dataset_status: Status of the dataset, e.g. 'ongoing', 'completed', or 'planned'
osc_region: Geographical coverage, e.g. 'global'
cf_parameter: The main geophysical variable, ideally matching a CF standard name or OSC variable

Notes:

osc_themes must match an existing OSC theme. https://opensciencedata.esa.int/themes/catalog
cf_parameter should use well-established variables (from OSC or CF conventions).

Workflow Metadata#

Define your workflow metadata in seperate YAML file:

workflow_id: A unique identifier for your workflow
properties:
    title: Human-readable title of the workflow
    description: A concise summary of what the workflow does
    keywords: Relevant scientific or technical keywords
    themes: Thematic area(s) of focus (e.g. land, ocean, atmosphere) - see the above note
    license: License type (e.g. MIT, Apache-2.0, CC-BY-4.0, proprietary)
    jupyter_kernel_info:
        name: Name of the execution environment or notebook kernel
        python_version: Python version used
        env_file: Link to the environment file (YAML) used to create the notebook environment
jupyter_notebook_url: Link to the source notebook (e.g. on GitHub)
contact:
    name: Contact person's full name
    organization: Affiliated institution or company
    links:
        rel: "about"
        type: "text/html"
        href: Link to homepage or personal/institutional profile

CLI Usage#

deep-code is a lightweight python tool that provides a command line tool, which has subcommands providing different utility functions. Use the --help option with these subcommands to get more details on usage.

(deep-code) tejas@tejas-nb:~/bc/projects/deepesdl/deep-code$ deep-code --help
Usage: deep-code [OPTIONS] COMMAND [ARGS]...

  Deep Code CLI.

Options:
  --help  Show this message and exit.

Commands:
  generate-config
  publish          Request publishing a dataset along with experiment and...

The CLI retrieves the Git username and personal access token from a hidden file named .gitaccess. Ensure this file is located in the same directory where you execute the CLI command.

Subcommands#

1. deep-code generate-config#

Generates starter configuration templates for publishing to EarthCODE's open science catalog.

Usage#

deep-code generate-config [OPTIONS]

Options#

--output-dir, -o : Output directory (default: current)

Examples#

deep-code generate-config
deep-code generate-config -o ./configs

2. deep-code publish#

Publishes metadata of experiment, workflow and dataset to the EarthCODE open science catalog.

Usage#

deep-code publish DATASET_CONFIG WORKFLOW_CONFIG [--environment ENVIRONMENT]

Arguments#

DATASET_CONFIG - Path to the dataset configuration YAML file
(e.g., dataset-config.yaml)

WORKFLOW_CONFIG - Path to the workflow configuration YAML file
(e.g., workflow-config.yaml)

Options#

--environment, -e - Target catalog environment: production (default) | staging | testing

Python API Usage#

deep-code can also be used directly from Python or inside a Jupyter Notebook:

from deep_code.tools.publish import Publisher

    publisher = Publisher(
        dataset_config_path="dataset_config.yaml",
        workflow_config_path="workflow_config.yaml"
    )
    publisher.publish_all() # publish both dataset and the workflow

🚀 Quickstart Example#

Follow these steps to publish your first dataset and workflow:

Create a .gitaccess file in your project directory with your GitHub credentials:
```
github-username: my-username
github-token: ghp_xxx123yourPATxxx
```

Write dataset metadata (dataset_config.yaml):

 dataset_id: wildfire-sample
 collection_id: wildfire-collection
 osc_themes: [wildfires]
 documentation_link: https://example.org/wildfire-docs
 access_link: https://my-bucket.s3.eu-west-2.amazonaws.com/wildfire
 dataset_status: ongoing
 osc_region: global
 osc_status: completed 
 cf_parameter: burned_area

Write workflow metadata (workflow_config.yaml):

workflow_id: wildfire-analysis-v1
properties:
  title: "Wildfire Analysis Workflow"
  description: "Analyzes burned area from EO data."
  keywords: ["wildfire", "burned area", "remote sensing"]
  themes: ["land"]
  license: "CC-BY-4.0"
  jupyter_kernel_info:
    name: "python3"
    python_version: "3.11"
    env_file: "environment.yaml"
jupyter_notebook_url: https://github.com/my-org/my-repo/blob/main/workflow.ipynb
contact:
  name: "Jane Doe"
  organization: "Example Institute"
  links:
    rel: "about"
    type: "text/html"
    href: "https://example.org"

Publish to EarthCODE:

deep-code publish dataset_config.yaml workflow_config.yaml

🛠️ Troubleshooting#

Here are some common issues and fixes when using deep-code:

Authentication failed for 'https://github.com/.../'
- Ensure your .gitaccess file exists and is in the directory where you run deep-code.
- Verify the file format (no extra spaces, correct keys github-username and github-token).
- Check that your GitHub PAT includes the repo scope.
FileNotFoundError: dataset_config.yaml not found
- Make sure you provide the correct path when running deep-code publish.
- Use relative or absolute paths, e.g.:

 deep-code publish ./configs/dataset_config.yaml ./configs/workflow_config.yaml

AccessDenied when reading S3 data
- Verify that your AWS credentials are set correctly (S3_USER_STORAGE_KEY, S3_USER_STORAGE_SECRET).
- Check that the S3_USER_STORAGE_BUCKET environment variables are set.
- If running inside DeepESDL, skip S3 config (it’s managed internally).
Notebook environment mismatch
- Ensure the jupyter_kernel_info in your workflow YAML matches the environment you actually used.
- The env_file should point to a valid environment YAML (conda/virtualenv).
My pull request to EarthCODE failed
- Check the logs printed by deep-code publish.
- Sometimes validation fails if: osc_themes doesn’t match a valid OSC theme or cf_parameter is not recognized.
- Required metadata fields are missing.

Integration with EarthCODE#

deep-code#

Prerequisites#

1. GitHub Authentication (.gitaccess file)#

2. S3 Configuration (Optional)#

3. Metadata Input Files#

Dataset Metadata (Products)#

Workflow Metadata#

CLI Usage#

Subcommands#

1. deep-code generate-config#

Usage#

Options#

Examples#

2. deep-code publish#

Usage#

Arguments#

Options#

Python API Usage#

🚀 Quickstart Example#

🛠️ Troubleshooting#

1. GitHub Authentication (`.gitaccess` file)#