Skip to content

Integration with EarthCODE#

The main tool to achieve a seamless EarthCODE integration for DeepESDL users is deep-code.

deep-code#

deep-code is a lightweight Python tool for publishing datasets and scientific workflows from DeepESDL directly to the EarthCODE open-science-catalog. It provides both a command-line interface (CLI) and a Python API for flexible use.


Prerequisites#

Before using deep-code, you need to configure a few authentication files and environment variables.


1. GitHub Authentication (.gitaccess file)#

deep-code requires a GitHub Personal Access Token (PAT) to publish your work.
You must create a .gitaccess file containing your GitHub credentials.

  1. Generate a GitHub PAT

    • Go to GitHub โ†’ Settings โ†’ Developer settings โ†’ Personal access tokens.
    • Click Generate new token.
    • Select the following scope:
      • repo โ€“ Full control of repositories (read, fork, push, pull).
    • Generate the token and copy it (GitHub will not display it again).
  2. Create a .gitaccess file
    In your project directory or home folder, create a plain text file named .gitaccess:

github-username: your-git-user
github-token: your-personal-access-token

Replace your-git-user and your-personal-access-token with your actual GitHub username and token.

2. S3 Configuration (Optional)#

NOTE: If you are working inside DeepESDL, skip this section.

By default, deep-code assumes datasets are stored in:

  • deepesdl-public bucket, or
  • a DeepESDL-specific team S3 bucket.

If your data is stored elsewhere, configure the following environment variables:

export S3_USER_STORAGE_BUCKET=my-test-bucket
export AWS_DEFAULT_REGION=eu-west-2

In Python, load them with:

load_dotenv()  # take environment variables

3. Metadata Input Files#

To publish with deep-code, you need two YAML metadata files:

  1. Dataset metadata (dataset_config.yaml)

  2. Workflow metadata (workflow_config.yaml)

The templates for the metadata files can be automatically generated using the cli command.

deep-code generate-config

The metadata files are used to generate valid STAC Items following the EarthCODE Open Science Catalog (OSC) convention, and automatically submit a pull request to register them in the catalog.

Dataset Metadata (Products)#

Define your dataset metadata in a YAML file:

dataset_id: The name of the dataset object within your S3 bucket
collection_id: A unique identifier for the dataset collection
osc_themes: [wildfires] Open Science theme (choose from https://opensciencedata.esa.int/themes/catalog)
documentation_link: Link to relevant documentation, publication, or handbook
access_link: Public S3 URL to the dataset
dataset_status: Status of the dataset, e.g. 'ongoing', 'completed', or 'planned'
osc_region: Geographical coverage, e.g. 'global'
cf_parameter: The main geophysical variable, ideally matching a CF standard name or OSC variable

Notes:

  • osc_themes must match an existing OSC theme. https://opensciencedata.esa.int/themes/catalog
  • cf_parameter should use well-established variables (from OSC or CF conventions).
Workflow Metadata#

Define your workflow metadata in seperate YAML file:

workflow_id: A unique identifier for your workflow
properties:
    title: Human-readable title of the workflow
    description: A concise summary of what the workflow does
    keywords: Relevant scientific or technical keywords
    themes: Thematic area(s) of focus (e.g. land, ocean, atmosphere) - see the above note
    license: License type (e.g. MIT, Apache-2.0, CC-BY-4.0, proprietary)
    jupyter_kernel_info:
        name: Name of the execution environment or notebook kernel
        python_version: Python version used
        env_file: Link to the environment file (YAML) used to create the notebook environment
jupyter_notebook_url: Link to the source notebook (e.g. on GitHub)
contact:
    name: Contact person's full name
    organization: Affiliated institution or company
    links:
        rel: "about"
        type: "text/html"
        href: Link to homepage or personal/institutional profile

CLI Usage#

deep-code is a lightweight python tool that provides a command line tool, which has subcommands providing different utility functions. Use the --help option with these subcommands to get more details on usage.

(deep-code) tejas@tejas-nb:~/bc/projects/deepesdl/deep-code$ deep-code --help
Usage: deep-code [OPTIONS] COMMAND [ARGS]...

  Deep Code CLI.

Options:
  --help  Show this message and exit.

Commands:
  generate-config
  publish          Request publishing a dataset along with experiment and...

The CLI retrieves the Git username and personal access token from a hidden file named .gitaccess. Ensure this file is located in the same directory where you execute the CLI command.

Subcommands#

1. deep-code generate-config#

Generates starter configuration templates for publishing to EarthCODE's open science catalog.

Usage#
deep-code generate-config [OPTIONS]
Options#
--output-dir, -o : Output directory (default: current)
Examples#
deep-code generate-config
deep-code generate-config -o ./configs

2. deep-code publish#

Publishes metadata of experiment, workflow and dataset to the EarthCODE open science catalog.

Usage#
deep-code publish DATASET_CONFIG WORKFLOW_CONFIG [--environment ENVIRONMENT]
Arguments#
DATASET_CONFIG - Path to the dataset configuration YAML file
(e.g., dataset-config.yaml)

WORKFLOW_CONFIG - Path to the workflow configuration YAML file
(e.g., workflow-config.yaml)
Options#
--environment, -e - Target catalog environment: production (default) | staging | testing

Python API Usage#

deep-code can also be used directly from Python or inside a Jupyter Notebook:

from deep_code.tools.publish import Publisher

    publisher = Publisher(
        dataset_config_path="dataset_config.yaml",
        workflow_config_path="workflow_config.yaml"
    )
    publisher.publish_all() # publish both dataset and the workflow

๐Ÿš€ Quickstart Example#

Follow these steps to publish your first dataset and workflow:

  1. Create a .gitaccess file in your project directory with your GitHub credentials:

    github-username: my-username
    github-token: ghp_xxx123yourPATxxx
    

  2. Write dataset metadata (dataset_config.yaml):

     dataset_id: wildfire-sample
     collection_id: wildfire-collection
     osc_themes: [wildfires]
     documentation_link: https://example.org/wildfire-docs
     access_link: https://my-bucket.s3.eu-west-2.amazonaws.com/wildfire
     dataset_status: ongoing
     osc_region: global
     osc_status: completed 
     cf_parameter: burned_area
    

  3. Write workflow metadata (workflow_config.yaml):

    workflow_id: wildfire-analysis-v1
    properties:
      title: "Wildfire Analysis Workflow"
      description: "Analyzes burned area from EO data."
      keywords: ["wildfire", "burned area", "remote sensing"]
      themes: ["land"]
      license: "CC-BY-4.0"
      jupyter_kernel_info:
        name: "python3"
        python_version: "3.11"
        env_file: "environment.yaml"
    jupyter_notebook_url: https://github.com/my-org/my-repo/blob/main/workflow.ipynb
    contact:
      name: "Jane Doe"
      organization: "Example Institute"
      links:
        rel: "about"
        type: "text/html"
        href: "https://example.org"
    

  4. Publish to EarthCODE:

    deep-code publish dataset_config.yaml workflow_config.yaml
    

๐Ÿ› ๏ธ Troubleshooting#

Here are some common issues and fixes when using deep-code:

  1. Authentication failed for 'https://github.com/.../'

    • Ensure your .gitaccess file exists and is in the directory where you run deep-code.
    • Verify the file format (no extra spaces, correct keys github-username and github-token).
    • Check that your GitHub PAT includes the repo scope.
  2. FileNotFoundError: dataset_config.yaml not found

    • Make sure you provide the correct path when running deep-code publish.
    • Use relative or absolute paths, e.g.:
 deep-code publish ./configs/dataset_config.yaml ./configs/workflow_config.yaml
  1. AccessDenied when reading S3 data

    • Verify that your AWS credentials are set correctly (S3_USER_STORAGE_KEY, S3_USER_STORAGE_SECRET).

    • Check that the S3_USER_STORAGE_BUCKET environment variables are set.

    • If running inside DeepESDL, skip S3 config (itโ€™s managed internally).

  2. Notebook environment mismatch

    • Ensure the jupyter_kernel_info in your workflow YAML matches the environment you actually used.

    • The env_file should point to a valid environment YAML (conda/virtualenv).

  3. My pull request to EarthCODE failed

    • Check the logs printed by deep-code publish.

    • Sometimes validation fails if: osc_themes doesnโ€™t match a valid OSC theme or cf_parameter is not recognized.

    • Required metadata fields are missing.