Scanner Module

pipeline.scan.format_float(value: int, logger) → str: Format byte-value with proper units

pipeline.scan.format_seconds(seconds: int) → str: Convert time in seconds to MM:SS

pipeline.scan.get_seconds(time_allowed: str) → int: Convert time in MM:SS to seconds

pipeline.scan.perform_safe_calculations(std_vars: list, cpf: list, volms: list, nfiles: int, logger) → tuple

Perform all calculations safely to mitigate errors that arise during data collation.

Parameters:

std_vars – (list) A list of the variables collected, which should be the same across all input files.
cpf – (list) The chunks per file recorded for each input file.
volms – (list) The total data size recorded for each input file.
nfiles – (int) The total number of files for this dataset
logger – (obj) Logging object for info/debug/error messages.

Returns:

Average values of: chunks per file (cpf), number of variables (num_vars), chunk size (avg_chunk), spatial resolution of each chunk assuming 2:1 ratio lat/lon (spatial_res), totals of NetCDF and Kerchunk estimate data amounts, number of files, total number of chunks and the addition percentage.

pipeline.scan.safe_format(value: int, fstring: str) → str: Attempt to format a string given some fstring template. - Handles issues by returning ‘’, usually when value is None initially.

pipeline.scan.scan_config(args, logger, fh=None, logid=None, **kwargs) → None

Configure scanning and access main section, ensure a few key variables are set then run scan_dataset.

Parameters:

args – (obj) Set of command line arguments supplied by argparse.
logger – (obj) Logging object for info/debug/error messages. Will create a new logger object if not given one.
fh – (str) Path to file for logger I/O when defining new logger.
logid – (str) If creating a new logger, will need an id to distinguish this logger from other single processes (typically n of N total processes.)

Returns:

None

pipeline.scan.scan_dataset(args, logger) → None: Main process handler for scanning phase

pipeline.scan.scan_kerchunk(args, logger, nfiles, limiter): Function to perform scanning with output Kerchunk format.

pipeline.scan.scan_zarr(args, logger, nfiles, limiter): Function to perform scanning with output Zarr format.

pipeline.scan.summarise_json(identifier, ctype: str, logger=None, proj_dir=None) → tuple: Open previously written JSON cached files and perform analysis.

pipeline.scan.write_skip(proj_dir: str, proj_code: str, logger) → None: Quick function to write a ‘skipped’ detail file.