PADOCC - User Documentation

padocc (Pipeline to Aggregate Data for Optimised Cloud Capabilites) is a Python package for aggregating data to enable methods of access for cloud-based applications.

The pipeline makes it easy to generate data-aggregated access patterns in the form of Reference Files or Cloud Formats across different datasets simultaneously with validation steps to ensure the outputs are correct.

Vast amounts of archival data in a variety of formats can be processed using the package’s group mechanics and automatic deployment to a job submission system.

Currently supported input file formats:
  • NetCDF/HDF

  • GeoTiff

  • GRIB

  • MetOffice (future)

padocc is capable of generating both reference files with Kerchunk (JSON or Parquet) and cloud formats like Zarr. Additionally, PADOCC creates CF-compliant aggregation files as part of the standard workflow, which means you get CFA-netCDF files as standard! You can find out more about Climate Forecast Aggregations here, these files are denoted with the extension .nca and can be opened using xarray with engine="CFA" if you have the CFAPyX package installed.

The pipeline consists of three central phases, with an additional phase for ingesting/cataloging the produced Kerchunk files. These phases represent operations that can be applied across groups of datasets in parallel, depending on the architecture of your system. For further information around configuring PADOCC for parallel deployment please contact daniel.westwood@stfc.ac.uk.

The ingestion/cataloging phase is not currently implemented for public use but may be added in a future update.

Stages of the PADOCC workflow

Indices and Tables

Acknowledgements

PADOCC was developed at the Centre for Environmental Data Analysis, supported by the ESA CCI Knowledge Exchange program and contributing to the development of the Earth Observation Data Hub (EODH).

CEDA Logo ESA Logo