A Deeper Dive into PADOCC Mechanics

Revision Numbers

The PADOCC revision numbers for each product are auto-generated using the following rules.

  • All projects begin with the revision number 1.1.

  • The first number denotes major updates to the product, for instance where a data source file has been replaced.

  • The second number denotes minor changes like alterations to attributes and metadata.

  • The letters prefixed to the revision numbers identify the file type for the product. For example a zarr store has the letter z applied, while a Kerchunk (parquet) store has kp.

The Validation Report

The ValidateDatasets class produces a validation report for both data and metadata validations. This is designed to be fairly simple to interpret, while still being machine-readable. The following headings which may be found in the report have the following meanings:

1. Metadata Report (with Examples) These are considered non-fatal errors that will need either a minor correction or can be ignored.

  • variables.time: {'type':'missing'...} - The time variable is missing from the specified product.

  • dims.all_dims: {'type':'order'} - The ordering of dimensions is not consistent across products.

  • attributes {'type':'ignore'...} - Attributes that have been ignored. These may have already been edited.

  • attributes {'type':'missing'...} - Attributes that are missing from the specified product file.

  • attributes {'type':'not_equal'...} - Attributes that are not equal across products.

2. Data Report These are considered fatal errors that need a major correction or possibly a fix to the pipeline itself.

  • size_errors - The size of the array is not consistent between products.

  • dim_errors - Arrays have inconsistent dimensions (where not ignored).

  • dim_size_errors - The dimensions are consistent for a variable but their sizes are not.

  • data_errors - The data arrays do not match across products, this is the most fatal of all validation errors.

The validator should give an idea of which array comparisons failed. * data_errors: {'type':'growbox_exceeded'...} - The variable in question could not be validated as no area could be identified that is not empty of values.

BypassSwitch Options

Certain non-fatal errors may be bypassed using the Bypass flag:

Format: -b "D"

Default: "D" # Highlighted by a '*'

"D" - * Skip driver failures - Pipeline tries different options for NetCDF (default).
    -   Only need to turn this skip off if all drivers fail (KerchunkDriverFatalError).
"F" -   Skip scanning (fasttrack) and go straight to compute. Required if running compute before scan
        is attempted.
"L" -   Skip adding links in compute (download links) - this will be required on ingest.
"S" -   Skip errors when running a subset within a group. Record the error then move onto the next dataset.

Custom Pipeline Errors

A summary of the custom errors that are experienced through running the pipeline.

exception padocc.core.errors.ArchiveConnectError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)

Bases: KerchunkException

Connection to the CEDA Archive could not be established

exception padocc.core.errors.BlacklistProjectCode(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)

Bases: KerchunkException

The project code you are trying to run for is on the list of project codes to ignore.

exception padocc.core.errors.ChunkDataError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)

Bases: KerchunkException

Overflow Error from pandas during decoding of chunk information, most likely caused by bad data retrieval.

exception padocc.core.errors.ComputeError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)

Bases: KerchunkException

Compute stage failed - likely due to invalid config/use of the classes

exception padocc.core.errors.ConcatFatalError(var: str | None = None, chunk1: int | None = None, chunk2: int | None = None, verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)

Bases: KerchunkException

Chunk sizes differ between refs - files cannot be concatenated

exception padocc.core.errors.ConcatenationError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)

Bases: KerchunkException

Variables could not be concatenated over time and are not duplicates - no known solution

exception padocc.core.errors.ExpectMemoryError(required: int = 0, current: str = '', verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)

Bases: KerchunkException

The process is expected to run out of memory given size estimates.

exception padocc.core.errors.ExpectTimeoutError(required: int = 0, current: str = '', verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)

Bases: KerchunkException

The process is expected to time out given timing estimates.

exception padocc.core.errors.FilecapExceededError(nfiles: int = 0, verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)

Bases: KerchunkException

During scanning, could not find suitable files within the set of files specified.

exception padocc.core.errors.FullsetRequiredError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)

Bases: KerchunkException

This project must be validated using the full set of files.

exception padocc.core.errors.IdenticalVariablesError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)

Bases: KerchunkException

All variables found to be suitably identical between files as to not stack or concatenate

exception padocc.core.errors.KerchunkDecodeError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)

Bases: KerchunkException

Decoding of Kerchunk file failed - likely a time array issue.

exception padocc.core.errors.KerchunkDriverFatalError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)

Bases: KerchunkException

All drivers failed (NetCDF3/Hdf5/Tiff) - run without driver bypass to assess the issue with each driver type.

exception padocc.core.errors.KerchunkException(proj_code: str | None, groupdir: str | None)

Bases: Exception

exception padocc.core.errors.MissingKerchunkError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)

Bases: KerchunkException

Kerchunk file not found.

exception padocc.core.errors.MissingVariableError(vtype: str = '$', verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)

Bases: KerchunkException

A variable is missing from the environment or set of arguments.

exception padocc.core.errors.NaNComparisonError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)

Bases: KerchunkException

When comparing NaN values between objects - different values found

exception padocc.core.errors.NoOverwriteError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)

Bases: KerchunkException

Output file already exists and the process does not have forceful overwrite (-f) set.

exception padocc.core.errors.NoValidTimeSlicesError(message: str = 'kerchunk', verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)

Bases: KerchunkException

Unable to find any time slices to test within the object.

exception padocc.core.errors.PartialDriverError(filenums: int | None = None, verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)

Bases: KerchunkException

All drivers failed (NetCDF3/Hdf5/Tiff) for one or more files within the list

exception padocc.core.errors.ProjectCodeError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)

Bases: KerchunkException

Could not find the correct project code from the list of project codes for this run.

exception padocc.core.errors.RemoteProtocolError(filenums: int | None = None, verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)

Bases: KerchunkException

All drivers failed (NetCDF3/Hdf5/Tiff) for one or more files within the list

exception padocc.core.errors.ShapeMismatchError(var: dict | None = None, first: dict | None = None, second: dict | None = None, verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)

Bases: KerchunkException

Shapes of ND arrays do not match between Kerchunk and Xarray objects - when using a subset of the Netcdf files.

exception padocc.core.errors.SoftfailBypassError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)

Bases: KerchunkException

Validation could not be completed because some arrays only contained NaN values which cannot be compared.

exception padocc.core.errors.SourceNotFoundError(sfile: str | None, verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)

Bases: KerchunkException

Source File could not be located.

exception padocc.core.errors.TrueShapeValidationError(message: str = 'kerchunk', verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)

Bases: KerchunkException

Shapes of ND arrays do not match between Kerchunk and Xarray objects - when using the complete set of files.

exception padocc.core.errors.ValidationError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)

Bases: KerchunkException

One or more checks within validation have failed - most likely elementwise comparison of data.

exception padocc.core.errors.VariableMismatchError(missing: dict | None = None, verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)

Bases: KerchunkException

During testing, variables present in the NetCDF file are not present in Kerchunk

exception padocc.core.errors.XKShapeToleranceError(tolerance: int = 0, diff: int = 0, dim: str = '', verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)

Bases: KerchunkException

Attempted validation using a tolerance for shape mismatch on concat-dims, shape difference exceeds tolerance allowance.

padocc.core.errors.error_handler(err: Exception, logger: Logger, phase: str, dryrun: bool = False, subset_bypass: bool = False, jobid: str | None = None, status_fh: object | None = None)

This function should be used at top-level loops over project codes ONLY - not within the main body of the package.

  1. Single slurm job failed - raise Error

  2. Single serial job failed - raise Error

  3. One of a set of tasks failed - print error for that dataset as traceback.