CFA Creator
The cfapyx engine also contains a class for use when creating CFA-netCDF files. These automatically follow the Aggregation conventions for CF-1.12, and the aggregation files produced will have this as the CF convention value in conventions.
Note
CF Convention 1.12 is added to the aggregation file, but only applies to the aggregation file and not the fragment files. For fragment files compliant at a higher convention than 1.12, this convention will be carried over to the aggregation file, but as a minimum all files produced by the Creator tool will be 1.12 - this also enables interpreting of these files by the CFDM at that version and also cf-python.
Logging
To enable the logger for cfapyx, import the logging package and set the basic config for logging operations. Log level and log message format should be specified here.
import logging
logging.basicConfig(level=logging.DEBUG, format='%(levelname)s [%(name)s]: %(message)s')
Create
To then use the creator functionality from this package, use the following.
from cfapyx import CFANetCDF
ds = CFANetCDF(
set_of_files, # The set of files to be aggregated
concat_msg = 'See individual files for more details', # Replaces attributes that differ
)
- The aggregated dataset can then be created using the
createmethod. Additional kwargs are: updates: Update the values of global attributes with new values.
removals: Remove/Ignore some attributes in the Aggregated file.
agg_dims: If the aggregation dimensions are known, state them here. This will improve performance if there are many dimensions that are not aggregated.
ds.create(
updates = {'tracking_id':'1'},
removals = ["PI's dog's name"],
agg_dims = ['time']
)
It is advised that you check how variables and dimensions have been arranged by cfapyx, which are evident from the properties below:
ds.agg_dims
# - The dimensions that are aggregated across the set of files
ds.coord_dims
# - Dimensions which also link to a variable (e.g lat/lon)
ds.pure_dims
# - Dimensions which have a size but no variable/array component.
ds.aggregated_vars
# - Variables which change across the aggregation dimension(s)
ds.identical_vars
# - Variables which do not change across the aggregation dimension(s)
ds.scalar_vars
# - Single-valued variables with no dimensions.
Write
Finally the created dataset can be written to a CFA-netCDF file using the following.
ds.write(output_file)
This file may be read into Xarray as a familiar xarray dataset with:
xarray_ds = xarray.open_dataset(output_file, engine='CFA')
Where the engine is required to decode the aggregation instructions contained in the CFA-netCDF file. Note that
without this engine the aggregation instructions will be displayed but not decoded.
Extension/Parallelisation
The CFANetCDF object now supports the extension/concatenation of multiple aggregation files, enabling the use of parallelisation. Here the syntax extension applies to dimensions that are already aggregated and will be extended by combining multiple aggregations. All aggregation dimensions must be already established in each partial aggregation file (i.e all files must be time-aggregated or otherwise), as the functionality to aggregate per-dimension in parallel is not supported.
To make use of this functionality, simply apply the create/write process above to batches of fragment files at any size required, then feed the resulting aggregation files back into a new CFANetCDF creator object to merge the aggregations.
from cfapyx import CFANetCDF
ds = CFANetCDF(
set_of_agg_files, # The set of aggregation files to be combined/extended
concat_msg = 'See individual files for more details', # Replaces attributes that differ
)
Given a sufficient setup with use of parallel jobs (and a dependent final job) it is now possible to parallelise the aggregation process to an arbitrary arrangement of jobs.
Note
For the most time-efficient parallel arrangement, use file batches of size root(n) where n is the total number of files. This creates a set of root(n) partial aggregation files, which then minimises the effective total aggregation time to just 2*root(n).