PADOCC - User Documentation

padocc (Pipeline to Aggregate Data for Optimised Cloud Capabilites) is a Python package (formerly kerchunk-builder) for aggregating data to enable methods of access for cloud-based applications.

The pipeline makes it easy to generate data-aggregated access patterns in the form of Reference Files or Cloud Formats across different datasets simultaneously with validation steps to ensure the outputs are correct.

Vast amounts of archival data in a variety of formats can be processed using the package’s group mechanics and automatic deployment to a job submission system.

Currently supported input file formats:
  • NetCDF/HDF

  • GeoTiff (coming soon)

  • GRIB (coming soon)

  • MetOffice (future)

padocc is capable of generating both reference files with Kerchunk (JSON or Parquet) and cloud formats like Zarr.

The pipeline consists of four central phases, with an additional phase for ingesting/cataloging the produced Kerchunk files. This is not part of the code-base of the pipeline currently but could be added in a future update.

Stages of the Kerchunk Pipeline

Indices and Tables

Acknowledgements

PADOCC was developed at the Centre for Environmental Data Analysis, supported by the ESA CCI Knowledge Exchange program and contributing to the development of the Earth Observation Data Hub (EODH).