A Deeper Dive into PADOCC Mechanics
Revision Numbers
The PADOCC revision numbers for each product are auto-generated using the following rules.
All projects begin with the revision number
1.1
.The first number denotes major updates to the product, for instance where a data source file has been replaced.
The second number denotes minor changes like alterations to attributes and metadata.
The letters prefixed to the revision numbers identify the file type for the product. For example a zarr store has the letter
z
applied, while a Kerchunk (parquet) store haskp
.
The Validation Report
The ValidateDatasets
class produces a validation report for both data and metadata validations.
This is designed to be fairly simple to interpret, while still being machine-readable.
The following headings which may be found in the report have the following meanings:
1. Metadata Report (with Examples) These are considered non-fatal errors that will need either a minor correction or can be ignored.
variables.time: {'type':'missing'...}
- The time variable is missing from the specified product.dims.all_dims: {'type':'order'}
- The ordering of dimensions is not consistent across products.attributes {'type':'ignore'...}
- Attributes that have been ignored. These may have already been edited.attributes {'type':'missing'...}
- Attributes that are missing from the specified product file.attributes {'type':'not_equal'...}
- Attributes that are not equal across products.
2. Data Report These are considered fatal errors that need a major correction or possibly a fix to the pipeline itself.
size_errors
- The size of the array is not consistent between products.dim_errors
- Arrays have inconsistent dimensions (where not ignored).dim_size_errors
- The dimensions are consistent for a variable but their sizes are not.data_errors
- The data arrays do not match across products, this is the most fatal of all validation errors.
The validator should give an idea of which array comparisons failed.
* data_errors: {'type':'growbox_exceeded'...}
- The variable in question could not be validated as no area could be identified that is not empty of values.
BypassSwitch Options
Certain non-fatal errors may be bypassed using the Bypass flag:
Format: -b "D"
Default: "D" # Highlighted by a '*'
"D" - * Skip driver failures - Pipeline tries different options for NetCDF (default).
- Only need to turn this skip off if all drivers fail (KerchunkDriverFatalError).
"F" - Skip scanning (fasttrack) and go straight to compute. Required if running compute before scan
is attempted.
"L" - Skip adding links in compute (download links) - this will be required on ingest.
"S" - Skip errors when running a subset within a group. Record the error then move onto the next dataset.
Custom Pipeline Errors
A summary of the custom errors that are experienced through running the pipeline.
- exception padocc.core.errors.ArchiveConnectError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkException
Connection to the CEDA Archive could not be established
- exception padocc.core.errors.BlacklistProjectCode(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkException
The project code you are trying to run for is on the list of project codes to ignore.
- exception padocc.core.errors.ChunkDataError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkException
Overflow Error from pandas during decoding of chunk information, most likely caused by bad data retrieval.
- exception padocc.core.errors.ComputeError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkException
Compute stage failed - likely due to invalid config/use of the classes
- exception padocc.core.errors.ConcatFatalError(var: str | None = None, chunk1: int | None = None, chunk2: int | None = None, verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkException
Chunk sizes differ between refs - files cannot be concatenated
- exception padocc.core.errors.ConcatenationError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkException
Variables could not be concatenated over time and are not duplicates - no known solution
- exception padocc.core.errors.ExpectMemoryError(required: int = 0, current: str = '', verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkException
The process is expected to run out of memory given size estimates.
- exception padocc.core.errors.ExpectTimeoutError(required: int = 0, current: str = '', verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkException
The process is expected to time out given timing estimates.
- exception padocc.core.errors.FilecapExceededError(nfiles: int = 0, verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkException
During scanning, could not find suitable files within the set of files specified.
- exception padocc.core.errors.FullsetRequiredError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkException
This project must be validated using the full set of files.
- exception padocc.core.errors.IdenticalVariablesError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkException
All variables found to be suitably identical between files as to not stack or concatenate
- exception padocc.core.errors.KerchunkDecodeError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkException
Decoding of Kerchunk file failed - likely a time array issue.
- exception padocc.core.errors.KerchunkDriverFatalError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkException
All drivers failed (NetCDF3/Hdf5/Tiff) - run without driver bypass to assess the issue with each driver type.
- exception padocc.core.errors.KerchunkException(proj_code: str | None, groupdir: str | None)
Bases:
Exception
- exception padocc.core.errors.MissingKerchunkError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkException
Kerchunk file not found.
- exception padocc.core.errors.MissingVariableError(vtype: str = '$', verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkException
A variable is missing from the environment or set of arguments.
- exception padocc.core.errors.NaNComparisonError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkException
When comparing NaN values between objects - different values found
- exception padocc.core.errors.NoOverwriteError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkException
Output file already exists and the process does not have forceful overwrite (-f) set.
- exception padocc.core.errors.NoValidTimeSlicesError(message: str = 'kerchunk', verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkException
Unable to find any time slices to test within the object.
- exception padocc.core.errors.PartialDriverError(filenums: int | None = None, verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkException
All drivers failed (NetCDF3/Hdf5/Tiff) for one or more files within the list
- exception padocc.core.errors.ProjectCodeError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkException
Could not find the correct project code from the list of project codes for this run.
- exception padocc.core.errors.RemoteProtocolError(filenums: int | None = None, verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkException
All drivers failed (NetCDF3/Hdf5/Tiff) for one or more files within the list
- exception padocc.core.errors.ShapeMismatchError(var: dict | None = None, first: dict | None = None, second: dict | None = None, verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkException
Shapes of ND arrays do not match between Kerchunk and Xarray objects - when using a subset of the Netcdf files.
- exception padocc.core.errors.SoftfailBypassError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkException
Validation could not be completed because some arrays only contained NaN values which cannot be compared.
- exception padocc.core.errors.SourceNotFoundError(sfile: str | None, verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkException
Source File could not be located.
- exception padocc.core.errors.TrueShapeValidationError(message: str = 'kerchunk', verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkException
Shapes of ND arrays do not match between Kerchunk and Xarray objects - when using the complete set of files.
- exception padocc.core.errors.ValidationError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkException
One or more checks within validation have failed - most likely elementwise comparison of data.
- exception padocc.core.errors.VariableMismatchError(missing: dict | None = None, verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkException
During testing, variables present in the NetCDF file are not present in Kerchunk
- exception padocc.core.errors.XKShapeToleranceError(tolerance: int = 0, diff: int = 0, dim: str = '', verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkException
Attempted validation using a tolerance for shape mismatch on concat-dims, shape difference exceeds tolerance allowance.
- padocc.core.errors.error_handler(err: Exception, logger: Logger, phase: str, dryrun: bool = False, subset_bypass: bool = False, jobid: str | None = None, status_fh: object | None = None)
This function should be used at top-level loops over project codes ONLY - not within the main body of the package.
Single slurm job failed - raise Error
Single serial job failed - raise Error
One of a set of tasks failed - print error for that dataset as traceback.