Orientation
The Item Generator is part of a framework defined by the Asset Scanner and is build to be modular and extensible. This can be confusing for new users but this guide helps to act as an orientation to help new users understand what this package can do and how the pieces fit together.
- There are various pluggable pieces:
Input plugins
Output plugins
Processors
Pre/Post processors
These pieces should allow you to construct a workflow which works for your use case and provide python entrypoints to allow you to write your own plugins. The Asset Scanner package stores some common input and output plugins (PRs welcome). This package, Item Generator, contains some processors which are used to extract attributes from files and passes them to the output plugin. You can read more about the processors, and how pre/post processors work here.
The item generator has two levels of configuration. Global configuration, passed at the command line on invocation, which defines the input and ouput plugins and things like logging configuration.
An example can be found here.
The second level of configuration comes in the form of item-descriptions. These YAML files describe the workflow for extracting facets and other metadata to build the STAC Item. Background for item-descriptions can be found here and a guide for how to build, and test these files is here.
The different available processors which can construct these workflows are found here.
The CEDA repository containing these item-descriptions can be used as an example. An example which includes extracting metadata from the NetCDF header is sentinel5
datasets:
- /neodc/sentinel5p/data
collection:
id: Ic93XnsBhuk7QqVbSFwS
facets:
extraction_methods:
- name: regex
description: Extract facets from the file path
inputs:
regex: '^\/(?:[^/]*/)(?P<platform>\w*)(?:[^/]*/){3}(?P<product_version>[0-9v.]+)/'
- name: regex
description: Extract facets from the filename
inputs:
regex: '^(?:[^_]*_){2}(?P<processing_level>[^_]+)__(?P<variable>[^_]+)_{4}(?P<start_datetime>[0-9T]+)_(?P<end_datetime>[0-9T]+)_(?P<orbit>\d+)(?:[^_]*_){3}(?P<datetime>[0-9T]+)'
pre_processors:
- name: filename_reducer
post_processors:
- name: isodate_processor
inputs:
date_keys:
- start_datetime
- end_datetime
- datetime
- name: header_extract
description: Extract header metadata
inputs:
attributes:
- institution
- sensor
aggregation_facets:
- platform
- processing_level
- variable
- product_version
- datetime
The “extraction_methods” are the workflow. In the example above I extract some facets from the file path, some from the file name and some from the header. To run regex on the filename, I use the filename_reducer and to convert my extracted dates to ISO 8601 format, I run the isodate_processor.
As all of these “assets” are treated individually, we need a way to make sure they end up together. The aggregation facets are used to generate a STAC item ID. So for the linked example, all assets which return the same value for platform, processing_level, variable, product_version and datetime, will be considered 1 STAC Item and be assigned the same ID.
This works in Elasticsearch because each individual elasticsearch document has the same id and are merged in an upsert. If you are using another storage system, it will require an aggregation step to join these together. Even with elasticsearch, lists are not merged in an upsert, but we have not had to deal with this yet.