PADOCC Command Line Tool

The command-line tool padocc
allows quick deployments of serial and parallel processing jobs for your projects and groups within the padocc workspace. The core phases are most readily executed using the command line tool.
Note
For information on how to setup the padocc environment, please see the Installation section of this documentation. Some general tips would be to:
Ensure you have the padocc package installed and the command line tool is accessible. You can check this by running
which padocc
in your terminal.Set the working directory
WORKDIR
environment variable. All pipeline directories and files will be created under this directory, which includes all the groups you define. It is suggested to have only one working directory where possible, although if a distinction is needed for different groups of datasets, using multiple working directories can be done with user discretion.
Using the CLI Tool
General Command Form
The general form of a command for padocc should be to call the command line tool padocc
with a minimum of the phase
argument specified afterwards. E.g:
$ padocc init
In almost all cases, other arguments will be necessary for any particular operation you would like to perform.
usage: padocc phase [-h] [-f] [-v] [-d] [-T] [-b BYPASS] [-w WORKDIR] [-G GROUPID] [-s SUBSET] [-r REPEAT_ID] [-p PROJ_CODE] [-C MODE] [-i INPUT] [-n NEW_VERSION] [-t TIME_ALLOWED] [--mem-allowed MEM_ALLOWED] [-M MEMORY] [-B] [-e VENVPATH] [-A] [--allow-band-increase]
Padocc CLI Flags
The flags above show all the different possible options for operating the pipeline. Listed here are some of the more common flags that can be applied to most or all of the different phased
operations for padocc.
-h, --help show this help message and exit
-f, --forceful Force overwrite of steps if previously done
-v, --verbose Print helpful statements while running (add more v's for greater verbosity)
-d, --dryrun Perform dry-run (i.e no new files/dirs created)
-T, --thorough Thorough processing - start from scratch
-b BYPASS, --bypass-errs (See the Deep Dive section for info on this feature)
-w WORKDIR, --workdir WORKDIR
Working directory for pipeline (if not specified as an environment variable.)
-G GROUPID, --groupID GROUPID
Group identifier label
Other flags listed in the command above are described in the Complex Operation section of this documentation.
Create a group from scratch (optional)
This optional first step allows you to create empty groups in the workspace that can be properly initialised later.
$ padocc new -G my-new-group
There is no particular advantage to creating empty groups but this may be beneficial for organisation of multiple new groups where the data is still being collected.
Special Functions
- The following accepted options to the
phase
argument act as shortcuts to specific functions in padocc available via an interactive session. These functions are now available via the CLI in a limited capacity, and use the--special
kwarg as a catch-all for providing configuration info to these functions. list
: Lists all groups in the current workspace and their contents.status
: Shows status of all projects in a group (requires-G
flag)add
: Enables adding projects to a group, including via the moles tags option (requires-G
, moles enabled via--special moles
)check
: Check an attribute in all projects across the group (requires-G
, supply attribute via--special <attribute>
)complete
: Enables the completion workflow for complete projects (requires-G
, supply completion directory via--special <dir>
)
Pipeline Functions
The following descriptions are for main pipeline functions, most of which are parallelisable with the --parallel
flag.
Initialise a group
- The pipeline takes a CSV (or similar) input file from which to instantiate a
GroupOperation
, which includes: creating subdirectories for all associated datasets (projects)
creating multiple group files with information regarding this group.
$ padocc init -G my-new-group -i path/to/input_file.csv
An example of the output for this command, when the -v
flag is added can be found below. The test data is composed of two rain
datasets each with 5 NetCDF files filles with arbitrary data. You can access this test data through the github repo<https://github.com/cedadev/padocc>_. Under padocc/tests/data
:
INFO [PADOCC-CLI-init]: Starting initialisation
INFO [PADOCC-CLI-init]: Copying input file from relative path - resolved to <your-directory-structure>/file.csv
INFO [PADOCC-CLI-init]: Creating project directories
INFO [PADOCC-CLI-init]: Creating directories/filelists for 1/2
INFO [PADOCC-CLI-init]: Updated new status: init - Success
INFO [PADOCC-CLI-init]: Creating directories/filelists for 2/2
INFO [PADOCC-CLI-init]: Updated new status: init - Success
INFO [PADOCC-CLI-init]: Created 12 files, 4 directories in group rain-example
INFO [PADOCC-CLI-init]: Written as group ID: rain-example
Scan
The first main phase of the pipeline involves scanning a subset of the native source files to determine certain parameters:
Ensure source files are compatible with one of the available converters for Kerchunk/Zarr etc.:
Calculate expected memory (for job allocation later.)
Calculate estimated chunk sizes and other values.
Determine suggested file type, including whether to use JSON or Parquet for Kerchunk references.
Identify Identical/Concat dims for use in Compute phase.
Determine any other specific parameters for the dataset on creation and concatenation.
A scan operation is performed across a group of datasets/projects to determine specific properties of each project and some estimates of time/memory allocations that will be required in later phases.
The scan phase can be activated with the following:
$ padocc scan -G my-group -C kerchunk
Alternatively, you can run any of the phases interactively in a python shell/notebook environment:
mygroup = GroupOperation(
'my-group',
workdir='path/to/pipeline/directory'
)
# Assuming this group has already been initialised from a file.
mygroup.run('scan',mode='kerchunk')
The above demonstrates why the command line tool is easier to use for phased operations, as most of the configurations are known and handled using the various flags. Interactive operations (like checking specific project properties etc.) are not covered by the CLI tool, so need to be completed using an interactive environment.
Compute
Building the Cloud/reference product for a dataset requires a multi-step process:
Example for Kerchunk:
Create Kerchunk references for each archive-type file.
Save cache of references for each file prior to concatenation.
Perform concatenation (abort if concatenation fails, can load cache on second attempt).
Perform metadata corrections (based on updates and removals specified at the start)
Add Kerchunk history global attributes (creation time, pipeline version etc.)
Reconfigure each chunk for remote access (replace local path with https:// download path)
Computation will either refer to outright data conversion to a new format, or referencing using one of the Kerchunk drivers to create a reference file. In either case the computation may be extensive and require processing in the background or deployment and parallelisation across the group of projects.
Computation can be executed in serial for a group with the following:
padocc compute -G my-group
Validate
Cloud products must be validated against equivalent Xarray objects from CF Aggregations (CFA) where possible, or otherwise using the original NetCDF as separate Xarray Datasets.
Ensure all variables present in original files are present in the cloud products (barring exceptions where metadata has been altered/corrected)
Ensure array shapes are consistent across the products.
Ensure data representations are consistent (values in array subsets)
The validation step produced a two-sectioned report that outlines validation warnings and errors with the data or metadata around the project. See the documentation on the validation report for more details.
It is advised to run the validator for all projects in a group to determine any issues with the conversion process. Some file types or specific arrangements may produce unwanted effects that result in differences between the original and new representations. This can be identified with the validator which checks the Xarray representations and identifies differences in both data and metadata.
$ padocc validate -G my-group
Next Steps
Cloud products that have been validated are moved to a complete
directory with the project code as the name, plus the revision identifier abX.X - learn more about this in the Extra section.
These can then be linked to a catalog or ingested into the CEDA archive where appropriate.