PADOCC - User Documentation
padocc (Pipeline to Aggregate Data for Optimised Cloud Capabilites) is a Python package for aggregating data to enable methods of access for cloud-based applications.
The padocc
tool makes it easy to generate data-aggregated access patterns in the form of Reference Files or Cloud Formats across many datasets simultaneously with validation steps to ensure the outputs are correct.
Vast amounts of archival data in a variety of formats can be processed using the package’s group mechanics and automatic deployment to a job submission system.
Latest Release: v1.3 05/02/2025: This release now adds a huge number of additional features to both projects and groups (see the CLI and Interactive sections in this documentation for details). Several alpha-stage features are still untested or not well documented, please report any issues to the github repo <https://github.com/cedadev/padocc>_.
Formats that can be generated
padocc is capable of generating both reference files with Kerchunk (JSON or Parquet) and cloud formats like Zarr.
Additionally, PADOCC creates CF-compliant aggregation files as part of the standard workflow, which means you get CFA-netCDF files as standard!
You can find out more about Climate Forecast Aggregations here, these files are denoted with the extension .nca
and can be opened using xarray with engine="CFA"
if you have the CFAPyX
package installed.
General usage
The pipeline consists of three central phases, with an additional phase for ingesting/cataloging the produced Kerchunk files. These phases represent operations that can be applied across groups of datasets in parallel, depending on the architecture of your system. The recommended way of running the core phases is to use the command line tool<core/cli>.
To check the status of various elements of the pipeline, including the progress of any group/project in your working directory, it is recommended that you make use of padocc through an interactive<core/interactive> interface like a Jupyter Notebook or Shell. Simply import the necessary components and start assessing your projects and groups.
For further information around configuring PADOCC for parallel deployment please contact daniel.westwood@stfc.ac.uk.
The ingestion/cataloging phase is not currently implemented for public use but may be added in a future update.

Contents:
Operations:
PADOCC API Reference:
Indices and Tables
Acknowledgements
PADOCC was developed at the Centre for Environmental Data Analysis, supported by the ESA CCI Knowledge Exchange program and contributing to the development of the Earth Observation Data Hub (EODH).

