Padocc Filehandlers

Filehandlers are an integral component of PADOCC on the filesystem. The filehandlers connect directly to files within the pipeline directories for different groups and projects and provide a seamless environment for fetching and saving values to these files.

Filehandlers act like their respective data-types in most or all methods. For example the JSONFileHandler acts like a dictionary, but with extra methods to close and save the loaded data. Filehandlers can also be easily migrated or removed from the filesystem as part of other processes.

class padocc.core.filehandlers.CFADataset(filepath: str, identifier: str, **kwargs)

Bases: LoggedOperation

Basic handler for CFA dataset

Added behaviours

  1. Open dataset - opens the CFA dataset

close() None

Set the meta attribute for this dataset.

get_meta() dict

Get the metadata/attributes for this dataset.

open_dataset(**kwargs) Dataset

Open the CFA Dataset [READ-ONLY]

set_meta(new_value: dict) None

Set the whole meta attribute for this dataset.

Parameters:

new_value – (dict) New metadata contents.

spawn_copy(copy: str)

Spawn a copy of this file (not filehandler)

Parameters:

copy – (str) For the CFA filehandler, copy should be the full path to the new location, minus the extension. This should include the version number at the point of release.

update_history(addition: str, new_version: str) None

Update the history with a new addition.

Sets the new version/revision automatically.

Parameters:
  • addition – (str) Message to add to dataset history.

  • new_version – (str) New version the message applies to.

class padocc.core.filehandlers.CSVFileHandler(dir: str, filename: str, **kwargs)

Bases: ListFileHandler

CSV File handler for padocc config files

update_status(phase: str, status: str, jobid: str = '') None

Update formatted status for this log with the phase and status

Parameters:
  • phase – (str) The phase for which this project is being operated.

  • status – (str) The status of the current run (e.g. Success, Failed, Fatal)

  • jobid – (str) The jobID of this run if present.

class padocc.core.filehandlers.FileIOMixin(dir: str, filename: str, logger: Logger | FalseLogger | None = None, label: str | None = None, fh: str | None = None, logid: str | None = None, dryrun: bool = False, forceful: bool = False, thorough: bool = False, verbose: int = 0)

Bases: LoggedOperation

Class for containing Filehandler behaviour which is exactly identical for all Filehandler subclasses.

Identical behaviour

  1. Contains:

    ‘item’ in fh

  2. Create/save file:

Filehandlers intrinsically know the file they are attached to so there are no attributes passed to either of these.

fh.create_file() fh.close()

  1. Get/set:

    contents = fh.get() fh.set(contents)

create_file() None

Create the file if not on dryrun.

property file: str

Returns the full filename attribute.

file_exists() bool

Return true if the file is found.

property filepath: str

Returns the full filepath attribute.

move_file(new_dir: str, new_name: str | None = None, new_extension: str | None = None) None

Migrate the file to a new location.

Parameters:
  • new_dir – (str) New directory for filehandler being moved.

  • new_name – (str) New name for filehandler if required.

  • new_extension – (str) New extension if required (e.g. changing log-type).

remove_file() None

Remove the file on the filesystem if not on dryrun

class padocc.core.filehandlers.GenericStore(parent_dir: str, store_name: str, metadata_name: str = '.zattrs', extension: str = 'zarr', logger: Logger | FalseLogger | None = None, label: str | None = None, fh: str | None = None, logid: str | None = None, dryrun: bool = False, forceful: bool = False, thorough: bool = False, verbose: int = 0)

Bases: LoggedOperation

Filehandler for Generic stores in Padocc - enables Filesystem operations on component files.

Behaviours (Applies to Metadata)

  1. Length - length of metadata keyset

  2. Contains - metadata contains key (as with dict)

  3. Indexable - Get/set a specific property.

  4. Get/set_meta - Get/set the whole metadata set.

  5. Clear - clears all files in the store.

clear() None

Remove all components of the store

close() None

Close the meta filehandler for this store

get_meta()

Obtain the metadata dictionary

property is_empty: bool

Check if the store contains any data

set_meta(values: dict)

Reset the metadata dictionary

Parameters:

values – (dict) Complete set of metadata for this store.

spawn_copy(copy: str)

Spawn a copy of this store (not filehandler)

Parameters:

copy – (str) New full path + name for external copy of the store (minus extension).

property store_path: str

Assemble the store path

update_history(addition: str, new_version: str) None

Update the history with a new addition.

Sets the new version/revision automatically.

Parameters:
  • addition – (str) Message to add to dataset history.

  • new_version – (str) New version the message applies to.

class padocc.core.filehandlers.JSONFileHandler(dir: str, filename: str, conf: dict | None = None, init_value: dict | None = None, **kwargs)

Bases: FileIOMixin

JSON File handler for padocc config files.

Dictionary Behaviour

  1. Indexable - index by key (as normal)

  2. Contains - key in dict (as normal)

  3. Length - length of the key set (as normal)

Added Behaviour

  1. Iterable - iterate over the keys.

  2. Get/set - get/set the whole value.

  3. Create_file - Specific for JSON files.

close() None

Save the content of the filehandler

create_file() None

JSON files require entry of a single dict on creation.

get(index: str | None = None, default: str | None = None) str | dict | None

Safe method to get a value from this filehandler.

Parameters:
  • index – (str) Key in dictionary.

  • default – (str) Default value for this item in the dictionary.

pop(index: str, default: Any = None) Any

Wrapper for pop function of a dict.

set(value: dict) None

Set the value of the whole dictionary.

Parameters:

value – (dict) New value to set for this filehandler.

class padocc.core.filehandlers.KerchunkFile(dir: str, filename: str, conf: dict | None = None, init_value: dict | None = None, **kwargs)

Bases: JSONFileHandler

Filehandler for Kerchunk file, enables substitution/replacement for local/remote links, and updating content.

Add the download link to this Kerchunk File.

Parameters:
  • sub – (str) Substitution value to be replaced.

  • replace – (str) Replacement value in download links.

get_meta() dict | None

Obtain the metadata dictionary

open_dataset(fsspec_kwargs: dict | None = None, retry: bool = False, **kwargs) Dataset

Open the kerchunk file as a dataset

Parameters:
  • fsspec_kwargs – (dict) Kwargs applied to fsspec mapper.

  • retry – (bool) Unused property for multiple tries when searching for kerchunk dataset.

set_meta(values: dict)

Reset the metadata dictionary.

Parameters:

values – (dict) Fully replace all zattrs in kerchunk dataset.

spawn_copy(copy: str)

Spawn a copy of this file (not filehandler)

Parameters:

copy – (str) Path to new copy location and filename (minus extension).

update_history(addition: str, new_version: str) None

Update the history with a new addition.

Sets the new version/revision automatically.

Parameters:
  • addition – (str) Message to add to dataset history.

  • new_version – (str) Specific version number for the history entry being applied.

class padocc.core.filehandlers.KerchunkStore(parent_dir: str, store_name: str, **kwargs)

Bases: GenericStore

Filehandler for Kerchunk stores using parquet in PADOCC. Enables setting metadata attributes and will allow combining stores in future.

Added behaviours

  1. Open dataset - opens the kerchunk store.

open_dataset(rfs_kwargs: dict | None = None, **parquet_kwargs) Dataset

Open the Parquet Store as an xarray dataset

class padocc.core.filehandlers.ListFileHandler(dir: str, filename: str, extension: str | None = None, init_value: list | None = None, **kwargs)

Bases: FileIOMixin

Filehandler for string-based Lists in Padocc.

List Behaviour

  1. Append - works the same as with normal lists.

  2. Pop - remove a specific value (works as normal).

  3. Contains - (x in y) works as normal.

  4. Length - (len(x)) works as normal.

  5. Iterable - (for x in y) works as normal.

  6. Indexable - (x[0]) works as normal

Added behaviour

  1. Close - close and save the file.

  2. Get/Set - Get or set the whole value.

append(newvalue: str | list) None

Add a new value to the internal list.

Parameters:

newvalue – (str|list) New value to append to current list.

close() None

Save the content of the filehandler

get() list

Get the current value

remove(oldvalue: str) None

Remove a value from the internal list

Parameters:

oldvalue – (str) Remove past value from list.

set(value: list[str, list]) None

Reset the value as a whole for this filehandler.

Parameters:

value – (list) Reset the _value property for this filehandler to the new value.

class padocc.core.filehandlers.LogFileHandler(dir: str, filename: str, extra_path: str = '', **kwargs)

Bases: ListFileHandler

Log File handler for padocc phase logs.

property filepath: str

Returns the full filepath attribute.

class padocc.core.filehandlers.ZarrStore(parent_dir: str, store_name: str, remote_s3: dict | None = None, **kwargs)

Bases: GenericStore

Filehandler for Zarr stores in PADOCC. Enables manipulation of Zarr store on filesystem and setting metadata attributes.

Added Behaviours

  1. Open dataset - open the zarr store.

  2. Write to s3 - write a disk-based zarr store to s3.

get_meta() dict

Override super function in case of remote s3.

open_dataset(**zarr_kwargs) Dataset

Open the ZarrStore as an xarray dataset

property store: str | object

Returns the store path or s3 store object as required.

write_to_s3(credentials: dict | str, bucket_id: str, name_overwrite: str | None = None, s3_kwargs: dict = None, ds: Dataset | None = None, **zarr_kwargs)

Write zarr store to an S3 Object Store bucket directly from padocc

Utilities

class padocc.core.utils.BypassSwitch(switch: str = 'D')

Bases: object

Switch container class for multiple error switches.

Class to represent all bypass switches throughout the pipeline. Requires a switch string which is used to enable/disable specific pipeline switches stored in this class.

padocc.core.utils.apply_substitutions(subkey: str, subs: dict | None = None, content: list | None = None)

Apply substitutions to all elements in the provided content list.

Parameters:
  • subkey – (str) The key to extract from the provided set of substitutions. This is in the case were substitutions are specified for different levels of input files.

  • subs – (dict) The substitutions applied to the content.

  • content – (list) The set of filepaths to apply substitutions.

padocc.core.utils.deformat_float(item: str) str

Format byte-value with proper units.

Parameters:

item – (str) Byte value to format into a float.

padocc.core.utils.extract_file(input_file: str) list

Extract content from a padocc-external file.

Use filehandlers for files within the pipeline.

Parameters:

input_file – (str) Pipeline-external file.

padocc.core.utils.extract_json(input_file: str) list

Extract content from a padocc-external file.

Use filehandlers for files within the pipeline.

Parameters:

input_file – (str) Pipeline-external file.

padocc.core.utils.find_closest(num: int, closest: float) int

Find divisions for a dimension for rechunking purposes.

Used in Zarr rechunking and conversion.

Parameters:
  • num – (int) Typically the size of the dimension

  • closest – (float) Find a divisor closest to this value.

padocc.core.utils.format_float(value: float) str

Format byte-value with proper units.

Parameters:

value – (float) Number of bytes (avg), format to a string representation.

padocc.core.utils.format_str(string: Any, length: int, concat: bool = False, shorten: bool = False) str

Simple function to format a string to a correct length.

Parameters:
  • string – (str) Message to format into a string of exact length.

  • length – (int) Accepted length of string.

  • concat – (bool) If True, will add elipses for overrunning strings.

  • shorten – (bool) If True will allow shorter messages, otherwise will fill with whitespace.

padocc.core.utils.format_tuple(tup: tuple[list[int]]) str

Transform tuple to string representation

Parameters:

tup – (tuple) Tuple object to be rendered to string.

padocc.core.utils.get_attribute(env: str, args, value: str) str

Assemble environment variable or take from passed argument. Find value of variable from Environment or ParseArgs object, or reports failure.

Parameters:
  • env – (str) Name of environment variable.

  • args – (obj) Set of command line arguments supplied by argparse.

  • var – (str) Name of argparse parameter to check.

Returns:

Value of either environment variable or argparse value.

padocc.core.utils.list_groups(workdir: str, func: ~typing.Callable = <built-in function print>)

List groups in the existing working directory

padocc.core.utils.make_tuple(item: Any) tuple

Make any object into a tuple.

Parameters:

item – (Any) Insert item into a tuple if not already one.

padocc.core.utils.mem_to_val(value: str) float

Convert a value in Bytes to an integer number of bytes.

Parameters:

value – (str) Convert number of bytes (XB) to float.

padocc.core.utils.print_fmt_str(string: str, help_length: int = 40, concat: bool = True, shorten: bool = False)

Replacement for callable function in help methods.

This print-replacement adds whitespace between functions and their help descriptions.

Parameters:
  • string – (str) Message to format into a string of exact length.

  • help_length – (int) Accepted length of string.

  • concat – (bool) If True, will add elipses for overrunning strings.

  • shorten – (bool) If True will allow shorter messages, otherwise will fill with whitespace.

Logging

class padocc.core.logs.FalseLogger

Bases: object

Supplementary class where a logger is not wanted but is required for some operations.

class padocc.core.logs.LoggedOperation(logger: Logger | FalseLogger | None = None, label: str | None = None, fh: str | None = None, logid: str | None = None, forceful: bool = None, dryrun: bool = None, thorough: bool = None, verbose: int = 0)

Bases: object

Allows inherritance of logger objects without creating new ones.

classmethod help(func: ~typing.Callable = <built-in function print>)

No public methods.

padocc.core.logs.init_logger(verbose: int, name: str, fh: str = None, logid: str = None) Logger

Logger object init and configure with standardised formatting.

Parameters:
  • verbose – (int) Level of verbosity for log messages (see core.init_logger).

  • name – (str) The label to apply to the logger object.

  • fh – (str) Path to logfile for logger object generated in this specific process.

  • logid – (str) ID of the process within a subset, which is then added to the name of the logger - prevents multiple processes with different logfiles getting loggers confused.

Returns:

A new logger object.

padocc.core.logs.reset_file_handler(logger: Logger, verbose: int, fh: str) Logger

Reset the file handler for an existing logger object.

Parameters:
  • logger – (logging.Logger) An existing logger object.

  • verbose – (int) The logging level to reapply.

  • fh – (str) Address to new file handler.

Returns:

A new logger object with a new file handler.