PADOCC Bespoke Features
Climate Forecast Aggregations
PADOCC uses the Climate Forecast Aggregations as a basis of comparison for any generated cloud formats, to use in the validation process. This base operation can now be disabled using the cfa_enabled property of each project. This can be set to False for any group taking too long to compute in CFA (watch out for this in the scan section as if a sample is taking a very long time it is likely the full dataset will take much longer).
Remote connection to object storage
padocc now has the capability to write to s3 storage endpoints for zarr stores, as well as using s3 object storage as the immediate storage medium for zarr datasets. This means that zarr stores generated via padocc can be written to object storage on creation, without filling up local disk space. Future updates will also include transfer mechanisms for Kerchunk datasets, where the kerchunk data must be edited then transferred.
Remote s3 configuration
- The following configuration details must be passed to one of the entrypoints for remote s3 connections for padocc:
The
add_projectfunction when creating a new project.The
add_s3_configfunction for an existing project.
Remote s3 config:
{
"s3_url":"http://<tenancy-name-o>.s3.jc.rl.ac.uk",
"bucket_id":"my-existing-bucket",
"s3_kwargs":None,
"s3_credentials":"/path/to/credentials/json"
}
For JASMIN object store tenancies see the Object Store Services Portal, plus the documentation page for how to set up s3 credentials. It is best to keep the credentials in a separate file as this config info will be copied to all projects being accessed.
Once this config has been added to the project, any subsequent compute operation will generate zarr data in the given object store space. Note: The creation may induce errors if interrupted halfway through. Simply delete the content on the object store and start again - this is a bug and will be fixed in due course.
The Validation Report
The ValidateDatasets class produces a validation report for both data and metadata validations.
This is designed to be fairly simple to interpret, while still being machine-readable.
The following headings which may be found in the report have the following meanings:
1. Metadata Report (with Examples) These are considered non-fatal errors that will need either a minor correction or can be ignored.
variables.time: {'type':'missing'...}- The time variable is missing from the specified product.dims.all_dims: {'type':'order'}- The ordering of dimensions is not consistent across products.attributes {'type':'ignore'...}- Attributes that have been ignored. These may have already been edited.attributes {'type':'missing'...}- Attributes that are missing from the specified product file.attributes {'type':'not_equal'...}- Attributes that are not equal across products.
2. Data Report These are typically considered fatal errors that require further examination, possibly new developments to the pipeline or changes to the native data structures.
size_errors- The size of the array is not consistent between products.dim_errors- Arrays have inconsistent dimensions (where not ignored).dtype/precision- Variables/Dimensions have been cast to new dtypes/precisions, most often 32-bit to 64-bit precision.dim_size_errors- The dimensions are consistent for a variable but their sizes are not.data_errors- The data arrays do not match across products, this is the most fatal of all validation errors. The validator should give an idea of which array comparisons failed.data_errors: {'type':'growbox_exceeded'...}- The variable in question could not be validated as no area could be identified that is not empty of values.
BypassSwitch Options
Certain non-fatal errors may be bypassed using the Bypass flag:
Format: -b "D"
Default: "D" # Highlighted by a '*'
"D" - * Skip driver failures - Pipeline tries different options for NetCDF (default).
- Only need to turn this skip off if all drivers fail (KerchunkDriverFatalError).
"F" - Skip scanning (fasttrack) and go straight to compute. Required if running compute before scan
is attempted.
"L" - Skip adding links in compute (download links) - this will be required on ingest.
"S" - Skip errors when running a subset within a group. Record the error then move onto the next dataset.
Custom Pipeline Errors
A summary of the custom errors that are experienced through running the pipeline.
- exception padocc.core.errors.AggregationError(agg_dims, verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkExceptionAggregation dimension(s) are not properly arranged
- exception padocc.core.errors.ArchiveConnectError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkExceptionConnection to the CEDA Archive could not be established
- exception padocc.core.errors.ChunkDataError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkExceptionOverflow Error from pandas during decoding of chunk information, most likely caused by bad data retrieval.
- exception padocc.core.errors.ComputeError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkExceptionCompute stage failed - likely due to invalid config/use of the classes
- exception padocc.core.errors.ConcatFatalError(var: str | None = None, chunk1: int | None = None, chunk2: int | None = None, verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkExceptionChunk sizes differ between refs - files cannot be concatenated
- exception padocc.core.errors.ConcatenationError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkExceptionVariables could not be concatenated over time and are not duplicates - no known solution
- exception padocc.core.errors.ExpectMemoryError(required: int = 0, current: str = '', verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkExceptionThe process is expected to run out of memory given size estimates.
- exception padocc.core.errors.ExpectTimeoutError(required: int = 0, current: str = '', verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkExceptionThe process is expected to time out given timing estimates.
- exception padocc.core.errors.KerchunkDecodeError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkExceptionDecoding of Kerchunk file failed - likely a time array issue.
- exception padocc.core.errors.KerchunkDriverFatalError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkExceptionAll drivers failed (NetCDF3/Hdf5/Tiff) - run without driver bypass to assess the issue with each driver type.
- exception padocc.core.errors.KerchunkException(proj_code: str | None, groupdir: str | None)
Bases:
ExceptionGeneral Exception type.
- exception padocc.core.errors.MissingDataError(reason, verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkExceptionData missing from kerchunk product
- exception padocc.core.errors.MissingKerchunkError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkExceptionKerchunk file not found.
- exception padocc.core.errors.MissingVariableError(vtype: str = '$', verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkExceptionA variable is missing from the environment or set of arguments.
- exception padocc.core.errors.NoOverwriteError(verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkExceptionOutput file already exists and the process does not have forceful overwrite (-f) set.
- exception padocc.core.errors.NoValidTimeSlicesError(message: str = 'kerchunk', verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkExceptionUnable to find any time slices to test within the object.
- exception padocc.core.errors.PartialDriverError(filenums: int | None = None, verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkExceptionAll drivers failed (NetCDF3/Hdf5/Tiff) for one or more files within the list
- exception padocc.core.errors.SourceNotFoundError(sfile: str | None, verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkExceptionSource File could not be located.
- exception padocc.core.errors.ValidationError(report_err: str | None = None, verbose: int = 0, proj_code: str | None = None, groupdir: str | None = None)
Bases:
KerchunkExceptionOne or more checks within validation have failed - most likely elementwise comparison of data.
- padocc.core.errors.error_handler(err: Exception, logger: Logger, phase: str, subset_bypass: bool = False, jobid: str | None = None, status_fh: object | None = None, agg_shorthand: str = '') str
This function should be used at top-level loops over project codes ONLY - not within the main body of the package.
Single slurm job failed - raise Error
Single serial job failed - raise Error
One of a set of tasks failed - print error for that dataset as traceback.
- Parameters:
err – (Exception) Error raised within some part of the pipeline.
logger – (logging.Logger) Logging operator for any messages.
subset_bypass – (bool) Skip raising an error if this operation is part of a sequence.
jobid – (str) The ID of the SLURM job if present.
status_fh – (object) Padocc Filehandler to update status.
- padocc.core.errors.worst_error(report: dict, bypass: dict = None) str
Determine the worst error level and return as a string.