ProjectOperation Core and Mixin Behaviours

Source code for individual project operations and mixin behaviours.

class padocc.core.project.ProjectOperation(proj_code: str, workdir: str, groupID: str = None, first_time: bool = None, ft_kwargs: dict = None, logger: ~logging.Logger = None, bypass: ~padocc.core.utils.BypassSwitch = <padocc.core.utils.BypassSwitch object>, label: str = None, fh: str = None, logid: str = None, verbose: int = 0, forceful: bool = None, dryrun: bool = None, thorough: bool = None, mem_allowed: str | None = None, remote_s3: dict | str | None = None)

PADOCC Project Operation class.

Able to access project files and perform some simple functions. Single-project operations always inherit from this class (e.g. Scan, Compute, Validate)

complete_project(move_to: str) → None

Move project to a completeness directory

Parameters:: move_to – (str) Path to completeness directory to extract content.

delete_project(ask: bool = True)

Delete a project

Parameters:: ask – (bool) Will ask an ‘are you sure’ message if not False.

property dir: Project directory property, relative to workdir.

file_exists(file: str)

Check if a named file exists (without extension).

This can be any generic filehandler attached.

classmethod help(func: ~typing.Callable = <function print_fmt_str>)

Public user functions for the project operator.

Parameters:: func – (Callable) provide an alternative to ‘print’ function for displaying help information.

info(): Display some info about this particular project.

migrate(newgroupID: str)

Migrate this project to a new group.

Moves the whole project directory on the filesystem and moves all associated filehandlers (individually).

Parameters:: newgroupID – (str) ID of new group to move this project to.

run(mode: str = 'kerchunk', bypass: BypassSwitch | None = None, forceful: bool = None, thorough: bool = None, verbose: bool = None, dryrun: bool = None, parallel: bool = False, **kwargs) → str

Main function for running any project operation.

All subclasses act as plugins for this function, and require a _run method called from here. This means all error handling with status logs and log files can be dealt with here.

To find the parameters for a specific operation (e.g. compute with kerchunk mode), see the additional parameters of run in the source code for the phase you are running. In this example, see padocc.phases.compute:KerchunkDS._run

Parameters:

mode – (str) Cloud format to use for any operations. Default value is ‘kerchunk’ and any changes via the ‘cloud_format’ parameter to this project are taken into account. Note: Setting the mode for a specific operation using THIS argument, will reset the cloud format stored property for this class.
bypass – (BypassSwitch) instance of BypassSwitch class containing multiple bypass/skip options for specific events. See utils.BypassSwitch.
forceful – (bool) Continue with processing even if final output file already exists.
dryrun – (bool) If True will prevent output files being generated or updated and instead will demonstrate commands that would otherwise happen.
thorough – (bool) From args.quality - if True will create all files from scratch, otherwise saved refs from previous runs will be loaded.

save_files() → None: Save all filehandlers associated with this group.

class padocc.core.mixins.dataset.DatasetHandlerMixin

Mixin class for properties relating to opening products.

This is a behavioural Mixin class and thus should not be directly accessed. Where possible, encapsulated classes should contain all relevant parameters for their operation as per convention, however this is not the case for mixin classes. The mixin classes here will explicitly state where they are designed to be used, as an extension of an existing class.

Use case: ProjectOperation [ONLY]

add_s3_config(remote_s3: dict | str | None = None) → None

Add remote_s3 configuration for this project

Parameters:: remote_s3 – (dict | str) Remote s3 config argument, either dictionary or path to a json file on disk. It is not advised to enter credentials here, see the documentation in Extra Features for more details.

catalog_ceda(final_location: str, api_key: str, collection: str, name_replace: str | None = None): Catalog the output product of this project.

property cfa_dataset: Dataset

Gets the product filehandler for the CFA dataset.

The CFA filehandler is currently read-only, and can be used to open an xarray representation of the dataset.

property cfa_path: str: Path to the CFA object for this project.

property dataset: KerchunkFile | GenericStore | CFADataset | None

Gets the product filehandler corresponding to cloud format.

Generic dataset property, links to the correct cloud format, given the Project’s cloud_format property with other configurations applied.

property dataset_attributes: dict: Fetch a dictionary of the metadata for the dataset.

classmethod help(func: ~typing.Callable = <built-in function print>)

Helper function to describe basic functions from this mixin

Parameters:: func – (Callable) provide an alternative to ‘print’ function for displaying help information.

property kfile: KerchunkFile | None: Retrieve the kfile filehandler or create if not present

property kstore: KerchunkStore | None: Retrieve the kstore filehandler or create if not present

remove_attribute(attribute: str, target: str = 'dataset') → None

Remove an attribute within a dataset representation’s metadata.

Parameters:

attribute – (str) The name of an attribute within the metadata property of the corresponding filehandler.
target – (str) The target product filehandler, uses the generic dataset filehandler if not otherwise specified.

remove_s3_config(): Remove remote_s3 configuration from this project

save_ds_filehandlers()

Save all dataset files that already exist

Product filehandlers include kerchunk files, stores (via parquet) and zarr stores. The CFA filehandler is not currently editable, so is not included here.

update_attribute(attribute: str, value: Any, target: str = 'dataset') → None

Update an attribute within a dataset representation’s metadata.

Parameters:

attribute – (str) The name of an attribute within the metadata property of the corresponding filehandler.
value – (Any) The new value to set for this attribute.
target – (str) The target product filehandler, uses the generic dataset filehandler if not otherwise specified.

write_to_s3(credentials: dict | str, bucket_id: str, name_overwrite: str | None = None, dataset_type: str = 'zstore', write_as: str = 'zarr', s3_kwargs: dict = None, **zarr_kwargs) → None: Write one of the active dataset objects to an s3 zarr store

property zstore: ZarrStore | None: Retrieve the filehandler for the zarr store

class padocc.core.mixins.directory.DirectoryMixin(workdir: str, groupID: str = None, forceful: bool = None, dryrun: bool = None, thorough: bool = None, logger: Logger = None, bypass: BypassSwitch = None, label: str = None, fh: str = None, logid: str = None, verbose: int = 0)

Container class for Operations which require functionality to create directories (workdir, groupdir, cache etc.)

This Mixin enables all child classes the ability to manipulate the filesystem to create new directories as required, and handles the so-called fh-kwargs, which relate to forceful overwrites of filesystem objects, skipping creation or starting from scratch, all relating to the filesystem.

This is a behavioural Mixin class and thus should not be directly accessed. Where possible, encapsulated classes should contain all relevant parameters for their operation as per convention, however this is not the case for mixin classes. The mixin classes here will explicitly state where they are designed to be used, as an extension of an existing class.

Use case: ProjectOperation, GroupOperation

property groupdir: Group directory property

classmethod help(func: ~typing.Callable = <built-in function print>): No public methods

class padocc.core.mixins.properties.PropertiesMixin

Properties relating to the ProjectOperation class that are stored separately for convenience and easier debugging.

This is a behavioural Mixin class and thus should not be directly accessed. Where possible, encapsulated classes should contain all relevant parameters for their operation as per convention, however this is not the case for mixin classes. The mixin classes here will explicitly state where they are designed to be used, as an extension of an existing class.

Use case: ProjectOperation [ONLY]

apply_defaults(defaults: dict, target: str = 'dataset', remove: list | None = None): Apply a default selection of attributes to a dataset.

property cloud_format: str

Obtain the cloud format for this project.

Check multiple options from base and detail configs to find the cloud format for this project. The default is to use kerchunk.

property complete_product: str

Return the name of the actual dataset.

Products are referred to by revision only within the project directory, but on completion these will be copied out of the pipeline, where they are renamed with the project code and revision for the actual dataset.

property file_type: str: Return the file type for this project.

get_stac_representation(stac_mapping: dict) → dict

Apply all required substitutions to the stac representation.

Parameters:: stac_mapping – (dict) A padocc-map-compliant dictionary for extracting properties into a dictionary for STAC record-making.

classmethod help(func: ~typing.Callable = <built-in function print>)

Helper function displays basic functions for use.

Parameters:: func – (Callable) provide an alternative to ‘print’ function for displaying help information.

major_version_increment()

Increment the major X.y part of the version number.

Use this function for major changes to the cloud file - e.g. replacement of source file data.

minor_version_increment(addition: str | None = None)

Increment the minor x.Y number for the version.

Use this function for when properties of the cloud file have been changed.

Parameters:: addition – (str) Reason for version change; attribute change or otherwise.

property outpath: str

Path to the output product.

Takes into account the cloud format and type. Extension is applied via the Filehandler that this string is applied to.

property outproduct: str

File/directory name for the output product.

Revision takes into account cloud format and type where applicable.

property remote: bool: Determine if this project is remotely-accessible

property revision: str: Revision takes into account cloud format and type.

set_concat_dims(dims: list): Function to override the concat_kwargs for this project.

set_identical_dims(dims: list): Function to override the concat_kwargs for this project.

property source_format: str

Get the source format of the files.

This is determined during the scanning process. Note: This returns the driver used in the kerchunk scanning process if that step has been completed.

property version_no: str

Get the version number from the base config file.

This property is read-only, but currently can be forcibly overwritten by editing the base config.

class padocc.core.mixins.status.StatusMixin

Methods relating to the ProjectOperation class, in terms of determining the status of previous runs.

This is a behavioural Mixin class and thus should not be directly accessed. Where possible, encapsulated classes should contain all relevant parameters for their operation as per convention, however this is not the case for mixin classes. The mixin classes here will explicitly state where they are designed to be used, as an extension of an existing class.

Use case: ProjectOperation [ONLY]

get_last_run() → tuple: Get the tuple-value for this projects last run.

get_last_status() → str: Gets the last line of the correct log file

get_log_contents(phase: str) → str

Get the contents of the log file as a string

Parameters:: phase – (str) Phased operation from which to pull logs.

get_report() → dict: Get the validation report if present for this project.

classmethod help(func: ~typing.Callable = <built-in function print>)

Helper function displays basic functions for use.

Parameters:: func – (Callable) provide an alternative to ‘print’ function for displaying help information.

set_last_run(phase: str, time: str) → None

Set the phase and time of the last run for this project.

Parameters:

phase – (str) Phased operation of last run.
time – (str) Timestamp for operation.

show_log_contents(phase: str, halt: bool = False, func: ~typing.Callable = <built-in function print>)

Format the contents of the log file to print.

Parameters:

phase – (str) Phased operation to pull log data from.
halt – (bool) Stop and display log data, wait for input before continuing.
func – (Callable) provide an alternative to ‘print’ function for displaying help information.

update_status(phase: str, status: str, jobid: str = '') → None

Update the status of a project

Status updates performed via the status log filehandler, during phased operation of the pipeline.

Parameters:

phase – (str) Phased operation being performed.
status – (str) Status of phased operation outcome
jobid – (str) ID of SLURM job in which this operation has taken place.