Processors
Processors take a file and return a dictionary of extracted information. They can be chained, one after the other and the results are merged such that arrays are appended to and key:value pairs are overwritten by subsequent write operations.
Some processors can also take Pre Processors and Post Processors. Pre-processors modify the input arguments whilst post-processors modify the output from the main processor.
Core Processors
Processor Name |
Description |
---|---|
Takes a filepath string and a list of attributes and returns a dictionary of the values extracted from the file header. |
|
Takes an input string and a regex with named capture groups and returns a dictionary of the values extracted using the named capture groups. |
|
Extracts attributes from an xml formatted ISO19115 record at a given URL. Supports URL templating. |
|
Extracts attributes from an xml formatted ISO19115 record at a given URL. Supports URL templating. |
Header Extract
- class item_generator.extraction_methods.header_extract.HeaderExtract(**kwargs)
Processor Name
header_extract
Accepts Pre-processors
Accepts Post-processors
- Description:
Takes a filepath string and a list of attributes and returns a dictionary of the values extracted from the file header.
- Configuration Options:
attributes
: A list of attributes to match for from the file headerpost_processors
: List of post_processors to applyoutput_key
: When the metadata is returned, this key determines where the metadata is fit in the response. Dot separated strings can be used to created nested attributes.default: 'properties'
- Example configuration:
- name: header_extract inputs: attributes: - institution - sensor - platform
Regex
- class item_generator.extraction_methods.regex_extract.RegexExtract(**kwargs)
Processor Name
regex
Accepts Pre-processors
Accepts Post-processors
- Description:
Takes an input string and a regex with named capture groups and returns a dictionary of the values extracted using the named capture groups.
- Configuration Options:
regex
: The regular expression to match against the filepathpre_processors
: List of pre-processors to applypost_processors
: List of post_processors to applyoutput_key
: When the metadata is returned, this key determines where the metadata is fit in the response. Dot separated strings can be used to created nested attributes.default: 'properties'
- Example configuration:
- name: regex inputs: regex: '^(?:[^_]*_){2}(?P<datetime>\d*)' pre_processors: - name: filename_reducer post_processors: - name: isodate_processor inputs: date_key: datetime
- run(filepath: str, source_media: str = 'POSIX', **kwargs) dict
The action of running the processor and returning an output :param filepath: Path to object :param media_source: Media type for the target object :param kwargs: free kwargs passed to the processor. :return: dict
ISO 19115 Extract
- class item_generator.extraction_methods.iso19115_extract.ISO19115Extract(**kwargs)
Processor Name
iso19115
Accepts Pre-processors
Accepts Post-processors
- Description:
Takes a URL template and calls out to URL to retrieve the iso19115 record. Use pre-processors to inject additional kwargs which are passed to the URL template.
- Configuration Options:
url_template
:REQUIRED
String template to build the URL. Template uses the python string template format.extraction_keys
: List of keys to retrieve from the response.pre_processors
: List of pre-processors to applypost_processors
: List of post_processors to applyoutput_key
: When the metadata is returned, this key determines where the metadata is fit in the response. Dot separated strings can be used to created nested attributes.default: 'properties'
- Extraction Keys:
Extraction keys should be a map.
Name
Description
name
Name of the outputted attribute
key
Access key to extract the required data. Passed to xml.etree.ElementTree.find() and also supports xpath formatted accessors
- Example:
- name: start_datetime key: './/gml:beginPosition'
- Example configuration:
- name: iso19115 inputs: url_template: 'api.catalogue.ceda.ac.uk/export/xml/$uuid.xml' extraction_keys: - name: start_datetime key: './/gml:beginPosition' pre_processors: - name: ceda_observation post_processors: - name: isodate_processor inputs: date_key: datetime
- run(filepath: str, source_media: str = 'POSIX', **kwargs) dict
The action of running the processor and returning an output :param filepath: Path to object :param media_source: Media type for the target object :param kwargs: free kwargs passed to the processor. :return: dict
XML Extract
- class item_generator.extraction_methods.xml_extract.XMLExtract(**kwargs)
Processor Name
xml_extract
Accepts Pre-processors
Accepts Post-processors
- Description:
Processes XML documents to extract metadata
- Configuration Options:
extraction_keys
: List of keys to retrieve from the document.filter_expr
: Regex to match against files to limit the attempts to known filesnamespaces
: Map of namespacespre_processors
: List of pre-processors to applypost_processors
: List of post_processors to applyoutput_key
: When the metadata is returned, this key determines where the metadata is fit in the response. Dot separated strings can be used to created nested attributes.default: 'properties'
- Extraction Keys:
Extraction keys should be a map.
Name
Description
name
Name of the outputted attribute
key
Access key to extract the required data. Passed to xml.etree.ElementTree.find() and also supports xpath formatted accessors
attribute
Allows you to select from the element attribute. In the absence of this value, the default behaviour is to access the text value of the key. In some cases, you might want to access and attribute of the element.
- Example:
- name: start_datetime key: './/gml:beginPosition'
- Example configuration:
- name: xml_extract inputs: filter_expr: '\.manifest$' extraction_keys: - name: start_datetime key: './/gml:beginPosition'
- run(filepath: str, source_media: str = 'POSIX', **kwargs) dict
The action of running the processor and returning an output :param filepath: Path to object :param media_source: Media type for the target object :param kwargs: free kwargs passed to the processor. :return: dict
Third-Party Processors
The plugin nature lends itself to third-party plugins. If you develop a plugin which might be useful for others’ workflows. Please make a pull request to add it to this table.
Processor Name |
Description |
Vendor |
---|---|---|
Pre Processors
Pre processors operate on the input arguments for the main processor.
They can be used to manipuate the input arguments for a given processor to modify its behaviour.
Processor Name |
Description |
---|---|
Takes a file path and returns the filename using |
|
Takes a file path and returns the uuid from the CEDA Catalogue. |
Filename Reducer
- class item_generator.extraction_methods.preprocessors.ReducePathtoName(**kwargs)
Processor Name:
filename_reducer
- Description:
Takes a file path and returns the filename using os.path.basename.
Example Configuration:
pre_processors: - name: filename_reducer
CEDA Observation
- class item_generator.extraction_methods.preprocessors.CEDAObservation(**kwargs)
Processor Name:
ceda_observation
- Description:
Takes a file path and returns the ceda observation record.
- Configuration Options:
url_template
:REQUIRED
URL string template to build url. Template uses the python string template format.
Example Configuration:
pre_processors: - name: ceda_observation inputs: url_template: http://api.catalogue.ceda.ac.uk/api/v0/obs/get_info$filepath
Third-Party Pre-processors
The plugin nature lends itself to third-party plugins. If you develop a plugin which might be useful for others’ workflows. Please make a pull request to add it to this table.
Processor Name |
Description |
Vendor |
---|---|---|
Post Processors
Post processors operate on the output from a main processor.
They are used using the same interface as a main processor process
but they accept the result of the previous step as part of the process
signature.
Processor Name |
Description |
---|---|
In some cases, you may wish to map the header attributes to different facets. This method takes a map and converts the facet labels into those specified. |
|
Takes the source dict and the key to access the date and converts the date to ISO 8601 Format. |
|
Automatically converts year, month, day, hour, minunte, second keys into an ISO 8601 date. |
|
Converts coordinates from a dictionary into RFC 7946, section 5 formatted coordinates |
|
string_join |
Join facets together to create a new value. |
Facet Map Processor
- class item_generator.extraction_methods.postprocessors.FacetMapProcessor(**kwargs)
Processor Name:
facet_map
- Description:
In some cases, you may wish to map the header attributes to different facets. This method takes a map and converts the facet labels into those specified.
- Configuration Options:
term_map
: Dictionary of terms to map
Example Configuration:
post_processors: - name: facet_map inputs: term_map: time_coverage_start: start_time
ISO Date Processor
- class item_generator.extraction_methods.postprocessors.ISODateProcessor(**kwargs)
Processor Name:
isodate_processor
- Description:
Takes the source dict and the key to access the date and converts the date to ISO 8601 Format.
e.g.
YYYY-MM-DDTHH:MM:SS.ffffff
, if microsecond is not 0YYYY-MM-DDTHH:MM:SS
, if microsecond is 0If the date format cannot be parsed, it is removed from the source dict with an error logged.
- Configuration Options:
date_keys
: REQUIRED List keys to the date value. Using a list allows processing of multiple dates.format
: Optional format string. Default behaviour uses dateutil.parser.parse. If a format string is suppled, this will change to use datetime.datetime.strptime.
Example Configuration:
post_processors: - name: isodate_processor inputs: date_keys: - key: date format: '%Y%m'
Date Combinator Processor
- class item_generator.extraction_methods.postprocessors.DateCombinatorProcessor(**kwargs)
Processor Name:
date_combinator
- Description:
Used to automatically join date components to create an ISO 8601 date. E.g. - year (required) - month - day - hour - minutes - seconds
Note
If you are only expecting to extract <year>-<month> make sure to include a format string. Dateutil.parser.parse will use the current day to fill the blank rather than 01. e.g.
2021-03
->2021-03-29T00:00:00
. Using format:%Y-%m
will result in2021-03
->2021-03-01T00:00:00
.- Configuration Options:
destructive
: Whether the keys are removed from the output when combined.DEFAULT: true
output_key
: Name of the key you would like to output.DEFAULT: datetime
format
: Format string to parse date to isodate. Date template is:${year}-${month}-${day}T${hour}:${minute}:${second}
The format string is passed to datetime.datetime.strptime
Example Configuration:
post_processors: - name: date_combinator inputs: destructive: true format: '%Y-%m' output_key: datetime
STAC BBOX Processor
- class item_generator.extraction_methods.postprocessors.BBOXProcessor(**kwargs)
Processor Name:
stac_bbox
- Description:
Accepts a dictionary of coordinate values and converts to RFC 7946, section 5 formatted bbox.
- Configuration Options:
key_list
:REQUIRED
list of keys to convert to bbox array. Ordering is respected.
Example Configuration:
post_processors: - name: stac_bbox inputs: key_list: - west - south - east - north
String Join Processor
- class item_generator.extraction_methods.postprocessors.StringJoinProcessor(**kwargs)
Processor Name:
string_join
- Description:
Accepts a dictionary. String values are popped from the dictionary and are put back into the dictionary with the
output_key
specified.- Configuration Options:
key_list
:REQUIRED
list of keys to convert to bbox array. Ordering is respected.delimiter
:REQUIRED
text delimiter to put between stringsoutput_key
:REQUIRED
name of the key you would like to output
Example Configuration:
post_processors: - name: string_join inputs: key_list: - year - month - day delimiter: "-" output_key: datetime
Third-Party Post-processors
The plugin nature lends itself to third-party plugins. If you develop a plugin which might be useful for others’ workflows. Please make a pull request to add it to this table.
Processor Name |
Description |
Vendor |
---|---|---|