The CFA-Xarray Engine

cfapyx is registered as an available backend to xarray when installed into your virtual environment, which means that xarray will collect the package and add it to a list of backends on performing import xarray.

1. Entrypoint

The CFANetCDFBackendEntrypoint() class provides a method called open_dataset which feeds the output of open_cfa_dataset directly back into xarray. There are other methods which may be useful to implement in the future, but for now only the basic opener is used.

Note

Most CFA classes in this module inherit from equivalent NetCDF4 classes directly from Xarray, since only a small number of additional or override methods are required. There is therefore a large number of methods with CFA classes that are not used directly within this package.

Opening a CFA dataset requires opening the CFA-netCDF file into a CFADataStore object (inherits from Xarray’s NetCDF4DataStore) which is then used as the store attribute in later methods inherited from parent Xarray/NetCDF classes.

2. Datastore

The CFA Datastore handles global and group variables where a group is present, and determines which variables are aggregated variables, which are decoded in open_cfa_variable The set of variables are returned from get_variables as an xarray FrozenDict object of xarray Variable objects. Each Variable is defined using the dimensions, attributes and encoding from the CFA file, but the data is more complicated.

3. Data from Fragments

Xarray comes with an included LazilyIndexedArray wrapper for containing indexable Array-like objects. The data is indexed via this first level of wrapper so the data isn’t loaded immediately upon creating the Xarray variable. cfapyx provides an Array-like wrapper class called FragmentArrayWrapper for handling calls to the whole array of all fragments. At this point we are still dealing with what looks like a single Array in xarray, rather than a set of distributed Fragments. cfapyx handles these fragments as individual CFAPartition objects when no chunks are present, and ChunkWrapper objects otherwise. These are provided to the higher level Xarray wrapper by way of a Dask array, which is created as the data attribute of each variable.

This means most of the complicated maths to relate an array partition to specific fragments is actually handled by Dask, the only thing the CFAPartition objects have to do is act like indexable Arrays that only load the data when they are indexed. Since there are several Array-like objects in cfapyx, these all inherit from classes in cfapyx.partition, in the order ArrayLike -> SuperLazyArrayLike -> ArrayPartition. The CFAPartition class is an example of an ArrayPartition while the FragmentArrayWrapper and ChunkWrapper classes are merely SuperLazyArrayLike children.

Loading each array partition is done using the python netCDF4 library to load the associated fragment file for this partition, selecting the specific variable (from `location`) and loading a section of the array as a numpy array - depending on the partition structure. Performing the slice operations on the netCDF4 dataset at the latest point ensures the minimal amount of memory chunks are loaded from the file, since any slicing operations are optimised by Dask before being applied to the source data.

4. Result

All of the above is abstracted from the user into the simple open_dataset command from Xarray. The dataset which is built looks like any other xarray dataset, which includes the loaded attributes and a Dask array for each variable. This only becomes an array of ‘real’ data following a compute() or plot() method (or equivalent), otherwise the data has yet to be loaded. Performing slices or reduction methods is handled entirely by Dask such that the operations are combined and applied at the latest stage for optimal performance.

If you would like to read more about lazy loading in general, see this SaturnCloud Blog or visit the Dask Documentation.

For more detail specifically on CFA, see either the CFA specifications or our page on the Inspiration for CFA. The CFA specifications version 0.6.2 will be added to CF-conventions 1.12 in November 2024 with some minor changes.

Finally if you would like to find out about alternative ways of handling distributed data, specifically for cloud applications, see one of the following:

padocc (CEDA): Pipeline to Aggregate Data for Optimal Cloud Capabilities
Kerchunk: Reference format
Zarr: Cloud Optimised format
Other Cloud Optimised formats