PADOCC - User Documentation

padocc (Pipeline to Aggregate Data for Optimised Cloud Capabilites) is a Python package for aggregating data to enable methods of access for cloud-based applications.

The padocc tool makes it easy to generate data-aggregated access patterns in the form of Reference Files or Cloud Formats across many datasets simultaneously with validation steps to ensure the outputs are correct.

Vast amounts of archival data in a variety of formats can be processed using the package’s group mechanics and automatic deployment to a job submission system.

Latest Release: v1.3.5 17/04/2025: This release now adds a huge number of additional features to both projects and groups (See the release notes for details). There are still some alpha-stage features that have not been fully tested, please report any issues to the github repo, especially when using the new shortcut CLI options.

Formats that can be generated

padocc is capable of generating both reference files with Kerchunk (JSON or Parquet) and cloud formats like Zarr. Additionally, PADOCC creates CF-compliant aggregation files as part of the standard workflow, which means you get CFA-netCDF files as standard! You can find out more about Climate Forecast Aggregations here, these files are denoted with the extension .nca and can be opened using xarray with engine="CFA" if you have the CFAPyX package installed.

General usage

The pipeline consists of three central phases, plus many different operations that can be applied to different datasets depending on use cases. These phases represent operations that can be applied across groups of datasets in parallel, depending on the architecture of your system. The recommended way of running the core phases is to use the command line tool<core/cli>. For a list of operations that go beyond the core phases, see the section entitled All Operations.

To check the status of various elements of the pipeline, including the progress of any group/project in your working directory, it is recommended that you make use of padocc through an interactive<core/interactive> interface like a Jupyter Notebook or Shell. Simply import the necessary components and start assessing your projects and groups.

For further information around configuring PADOCC for parallel deployment please contact daniel.westwood@stfc.ac.uk.

The ingestion/cataloging phase is not currently implemented for public use but may be added in a future update.

Contents:

Operations:

PADOCC API Reference:

Indices and Tables

Acknowledgements

PADOCC was developed at the Centre for Environmental Data Analysis, supported by the ESA CCI Knowledge Exchange program and contributing to the development of the Earth Observation Data Hub (EODH).