Compute Module

class pipeline.compute.KerchunkConverter(clogger=None, bypass_driver=False, ctype=None, verbose=1)

Bases: object

Class for converting a single file to a Kerchunk reference object

convert_to_zarr(nfile: str, extension=False, **kwargs) → None

Perform conversion to zarr with exceptions for bypassing driver errors.

Parameters:

nfile – (str) Path to a local native file of an appropriate type to be converted.
extension – (str) File extension relating to file type if known. All extensions/drivers will be tried first, subsequent files in the same dataset will use whatever extension worked for the first file as a starting point.

Returns:

The output of performing a driver if successful, None if the driver is unsuccessful. Errors will be bypassed if the bypass_driver option is selected for this class.

grib_to_zarr(gfile: str, **kwargs) → dict: Wrapper for converting GRIB type files to Kerchunk

hdf5_to_zarr(nfile: str, **kwargs) → dict: Wrapper for converting NetCDF4/HDF5 type files to Kerchunk

load_individual_ref(cache_ref: str) → dict

Wrapper for getting proj_file cache_ref contents

Returns:: Dictionary of refs if successful, None or raised error otherwise.

ncf3_to_zarr(nfile: str, **kwargs) → dict: Wrapper for converting NetCDF3 type files to Kerchunk

save_individual_ref(ref: dict, cache_ref: str, forceful=False) → None: Save each individual set of refs created for each file immediately to reduce loss of progress in the event of a failure somewhere in processing.

tiff_to_zarr(tfile: str, **kwargs) → dict: Wrapper for converting GeoTiff type files to Kerchunk

try_all_drivers(nfile: str, **kwargs) → dict

Safe creation allows for known issues and tries multiple drivers

Returns:: dictionary of Kerchunk references if successful, raises error otherwise if unsuccessful.

class pipeline.compute.KerchunkDSProcessor(proj_code, **kwargs)

Bases: ProjectProcessor

Kerchunk Dataset Processor Class, capable of processing a single dataset’s worth of input files into a single concatenated Kerchunk file.

add_download_link(refs: dict) → dict: Add the download link to each of the Kerchunk references

add_kerchunk_history(attrs: dict) → dict: Add kerchunk variables to the metadata for this dataset, including creation/update date and version/revision number.

combine_and_save(refs: dict, zattrs: dict) → None: Concatenation of ref data for different kerchunk schemes

construct_virtual_dim(refs: dict) → None: Construct a Virtual dimension for stacking multiple files where no suitable concatenation dimension is present.

create_refs() → None: Organise creation and loading of refs - Load existing cached refs - Create new refs - Combine metadata and global attributes into a single set - Coordinate combining and saving of data

data_to_json(refs: dict, zattrs: dict) → None: Concatenating to JSON-format Kerchunk file

data_to_parq(refs: dict) → None: Concatenating to Parquet-format Kerchunk store

load_temp_zattrs() → dict: Load global attributes from a ‘temporary’ cache file.

perform_shape_checks(ref: dict) → None: Check the shape of each variable for inconsistencies which will require a thorough validation process.

save_metadata(zattrs: dict) → dict: Cache metadata global attributes in a temporary file.

class pipeline.compute.ProjectProcessor(proj_code, workdir=None, thorough=False, forceful=False, verb=0, mode=None, version_no='trial-', concat_msg='See individual files for more details', bypass=<pipeline.utils.BypassSwitch object>, groupID=None, limiter=None, dryrun=True, ctype=None, fh=None, logid=None, skip_concat=False, logger=None, new_version=None, **kwargs)

Bases: object

Processing for a single Project, using Zarr/Kerchunk/COG, all class attributes common to these three processes kept in one place.

check_time_attributes(times: dict) → dict

Takes dict of time attributes with lists of values - Sort time arrays - Assume time_coverage_start, time_coverage_end, duration (2 or 3 variables)

This Class method is common to all zarr-like conversion types.

clean_attr_array(allzattrs: dict) → dict

Collect global attributes from all refs: - Determine which differ between refs and apply changes

This Class method is common to all zarr-like conversion types.

clean_attrs(zattrs: dict) → dict

Ammend any saved attributes post-combining - Not currently implemented, may be unnecessary

This Class method is common to all zarr-like conversion types.

collect_details() → dict: Collect kwargs for combining and any special attributes - save to detail file. Common class method for all conversion types.

correct_metadata(allzattrs: dict) → dict

General function for correcting metadata - Combine all existing metadata in standard way (cleaning arrays) - Add updates and remove removals specified by configuration

This Class method is common to all zarr-like conversion types.

determine_dim_specs(objs: list) → None: Perform identification of identical_dims and concat_dims here.

find_concat_dims(ds_examples: list, logger=<pipeline.logs.FalseLogger object>) → None

Find dimensions to use when combining for concatenation - Dimensions which change over the set of files must be concatenated together - Dimensions which do not change (typically lat/lon) are instead identified as identical_dims

This Class method is common to all conversion types.

find_identical_dims(ds_examples: list, logger=<pipeline.logs.FalseLogger object>) → None

Find dimensions and variables that are identical across the set of files. - Variables which do not change (typically lat/lon) are identified as identical_dims and not concatenated over the set of files. - Variables which do change are concatenated as usual.

This Class method is common to all conversion types.

get_timings() → dict

Export timed values if refs were all created from scratch. Ref loading invalidates timings so returns None if any refs were loaded not created - common class method for all conversion types.

Returns:: Dictionary of timing values if successful and refs were not loaded. If refs were loaded, timings are invalid so returns None.

set_filelist() → None: Get the list of files from the filelist for this dataset and set to ‘filelist’ list - common class method for all conversion types.

class pipeline.compute.ZarrDSRechunker(proj_code, mem_allowed='100MB', preferences=None, **kwargs)

Bases: ProjectProcessor

Rechunk input data types directly into zarr using Pangeo Rechunker. - If refs already exist from previous Kerchunk runs, can use these to inform rechunker. - Otherwise will have to start from scratch.

obtain_file_subset() → None: Quick function for obtaining a subset of the whole fileset. Originally used to open all the files using Xarray for concatenation later.

pipeline.compute.compute_config(args, logger, fh=None, logid=None, **kwargs) → None

serves as main point of configuration for processing/conversion runs. Can set up kerchunk or zarr configurations, check required files are present.

Parameters:

args – (obj) Set of command line arguments supplied by argparse.
logger – (obj) Logging object for info/debug/error messages. Will create a new logger object if not given one.
fh – (str) Path to file for logger I/O when defining new logger.
logid – (str) If creating a new logger, will need an id to distinguish this logger from other single processes (typically n of N total processes.)
overide_type – (str) Set as JSON/parq/zarr to specify output cloud format type to use.

Returns:

None

pipeline.compute.configure_kerchunk(args, logger, fh=None, logid=None): Configure all required steps for Kerchunk processing. - Check if output files already exist. - Configure timings post-run.