Processors

Processors take a file and return a dictionary of extracted information. They can be chained, one after the other and the results are merged such that arrays are appended to and key:value pairs are overwritten by subsequent write operations.

Some processors can also take Pre Processors and Post Processors. Pre-processors modify the input arguments whilst post-processors modify the output from the main processor.

Core Processors

Processor Name	Description
header_extract	Takes a filepath string and a list of attributes and returns a dictionary of the values extracted from the file header.
regex	Takes an input string and a regex with named capture groups and returns a dictionary of the values extracted using the named capture groups.
iso19115	Extracts attributes from an xml formatted ISO19115 record at a given URL. Supports URL templating.
xml_extract	Extracts attributes from an xml formatted ISO19115 record at a given URL. Supports URL templating.

Header Extract

class item_generator.extraction_methods.header_extract.HeaderExtract(**kwargs)

Processor Name	`header_extract`
Accepts Pre-processors
Accepts Post-processors

Description:

Takes a filepath string and a list of attributes and returns a dictionary of the values extracted from the file header.

Configuration Options:

attributes: A list of attributes to match for from the file header
post_processors: List of post_processors to apply
output_key: When the metadata is returned, this key determines where the metadata is fit in the response. Dot separated strings can be used to created nested attributes. default: 'properties'

Example configuration:

- name: header_extract
  inputs:
    attributes:
        - institution
        - sensor
        - platform

Regex

class item_generator.extraction_methods.regex_extract.RegexExtract(**kwargs)

Processor Name	`regex`
Accepts Pre-processors
Accepts Post-processors

Description:

Takes an input string and a regex with named capture groups and returns a dictionary of the values extracted using the named capture groups.

Configuration Options:

regex: The regular expression to match against the filepath
pre_processors: List of pre-processors to apply
post_processors: List of post_processors to apply
output_key: When the metadata is returned, this key determines where the metadata is fit in the response. Dot separated strings can be used to created nested attributes. default: 'properties'

Example configuration:

- name: regex
  inputs:
    regex: '^(?:[^_]*_){2}(?P<datetime>\d*)'
  pre_processors:
    - name: filename_reducer
  post_processors:
    - name: isodate_processor
      inputs:
        date_key: datetime

run(filepath: str, source_media: str = 'POSIX', **kwargs) → dict: The action of running the processor and returning an output :param filepath: Path to object :param media_source: Media type for the target object :param kwargs: free kwargs passed to the processor. :return: dict

ISO 19115 Extract

class item_generator.extraction_methods.iso19115_extract.ISO19115Extract(**kwargs)

Processor Name	`iso19115`
Accepts Pre-processors
Accepts Post-processors

Description:

Takes a URL template and calls out to URL to retrieve the iso19115 record. Use pre-processors to inject additional kwargs which are passed to the URL template.

Configuration Options:

url_template: REQUIRED String template to build the URL. Template uses the python string template format.
extraction_keys: List of keys to retrieve from the response.
pre_processors: List of pre-processors to apply
post_processors: List of post_processors to apply
output_key: When the metadata is returned, this key determines where the metadata is fit in the response. Dot separated strings can be used to created nested attributes. default: 'properties'

Extraction Keys:

Extraction keys should be a map.

Name	Description
`name`	Name of the outputted attribute
`key`	Access key to extract the required data. Passed to xml.etree.ElementTree.find() and also supports xpath formatted accessors

Example:

- name: start_datetime
  key: './/gml:beginPosition'

Example configuration:

- name: iso19115
  inputs:
    url_template: 'api.catalogue.ceda.ac.uk/export/xml/$uuid.xml'
    extraction_keys:
      - name: start_datetime
        key: './/gml:beginPosition'
  pre_processors:
    - name: ceda_observation
  post_processors:
    - name: isodate_processor
      inputs:
        date_key: datetime

run(filepath: str, source_media: str = 'POSIX', **kwargs) → dict: The action of running the processor and returning an output :param filepath: Path to object :param media_source: Media type for the target object :param kwargs: free kwargs passed to the processor. :return: dict

XML Extract

class item_generator.extraction_methods.xml_extract.XMLExtract(**kwargs)

Processor Name	`xml_extract`
Accepts Pre-processors
Accepts Post-processors

Description:

Processes XML documents to extract metadata

Configuration Options:

extraction_keys: List of keys to retrieve from the document.
filter_expr: Regex to match against files to limit the attempts to known files
namespaces: Map of namespaces
pre_processors: List of pre-processors to apply
post_processors: List of post_processors to apply
output_key: When the metadata is returned, this key determines where the metadata is fit in the response. Dot separated strings can be used to created nested attributes. default: 'properties'

Extraction Keys:

Extraction keys should be a map.

Name	Description
`name`	Name of the outputted attribute
`key`	Access key to extract the required data. Passed to xml.etree.ElementTree.find() and also supports xpath formatted accessors
`attribute`	Allows you to select from the element attribute. In the absence of this value, the default behaviour is to access the text value of the key. In some cases, you might want to access and attribute of the element.

Example:

- name: start_datetime
  key: './/gml:beginPosition'

Example configuration:

- name: xml_extract
  inputs:
    filter_expr: '\.manifest$'
    extraction_keys:
      - name: start_datetime
        key: './/gml:beginPosition'

run(filepath: str, source_media: str = 'POSIX', **kwargs) → dict: The action of running the processor and returning an output :param filepath: Path to object :param media_source: Media type for the target object :param kwargs: free kwargs passed to the processor. :return: dict

Third-Party Processors

The plugin nature lends itself to third-party plugins. If you develop a plugin which might be useful for others’ workflows. Please make a pull request to add it to this table.

Processor Name	Description	Vendor

Pre Processors

Pre processors operate on the input arguments for the main processor.

They can be used to manipuate the input arguments for a given processor to modify its behaviour.

Processor Name	Description
filename_reducer	Takes a file path and returns the filename using `os.path.basename`.
ceda_observation	Takes a file path and returns the uuid from the CEDA Catalogue.

Filename Reducer

class item_generator.extraction_methods.preprocessors.ReducePathtoName(**kwargs)

Processor Name: filename_reducer

Description:: Takes a file path and returns the filename using os.path.basename.

Example Configuration:

pre_processors:
  - name: filename_reducer

CEDA Observation

class item_generator.extraction_methods.preprocessors.CEDAObservation(**kwargs)

Processor Name: ceda_observation

Description:

Takes a file path and returns the ceda observation record.

Configuration Options:

url_template: REQUIRED URL string template to build url. Template uses the python string template format.

Example Configuration:

pre_processors:
  - name: ceda_observation
    inputs:
      url_template: http://api.catalogue.ceda.ac.uk/api/v0/obs/get_info$filepath

Third-Party Pre-processors

The plugin nature lends itself to third-party plugins. If you develop a plugin which might be useful for others’ workflows. Please make a pull request to add it to this table.

Processor Name	Description	Vendor

Post Processors

Post processors operate on the output from a main processor. They are used using the same interface as a main processor process but they accept the result of the previous step as part of the process signature.

Processor Name	Description
facet_map	In some cases, you may wish to map the header attributes to different facets. This method takes a map and converts the facet labels into those specified.
isodate_processor	Takes the source dict and the key to access the date and converts the date to ISO 8601 Format.
date_combinator	Automatically converts year, month, day, hour, minunte, second keys into an ISO 8601 date.
stac_bbox	Converts coordinates from a dictionary into RFC 7946, section 5 formatted coordinates
string_join	Join facets together to create a new value.

ISO Date Processor

class item_generator.extraction_methods.postprocessors.ISODateProcessor(**kwargs)

Processor Name: isodate_processor

Description:

Takes the source dict and the key to access the date and converts the date to ISO 8601 Format.

e.g.

YYYY-MM-DDTHH:MM:SS.ffffff, if microsecond is not 0 YYYY-MM-DDTHH:MM:SS, if microsecond is 0

If the date format cannot be parsed, it is removed from the source dict with an error logged.

Configuration Options:

date_keys: REQUIRED List keys to the date value. Using a list allows processing of multiple dates.
format: Optional format string. Default behaviour uses dateutil.parser.parse. If a format string is suppled, this will change to use datetime.datetime.strptime.

Example Configuration:

post_processors:
    - name: isodate_processor
      inputs:
        date_keys:
          - key: date
        format: '%Y%m'

Date Combinator Processor

class item_generator.extraction_methods.postprocessors.DateCombinatorProcessor(**kwargs)

Processor Name: date_combinator

Description:

Used to automatically join date components to create an ISO 8601 date. E.g. - year (required) - month - day - hour - minutes - seconds

Note

If you are only expecting to extract <year>-<month> make sure to include a format string. Dateutil.parser.parse will use the current day to fill the blank rather than 01. e.g. 2021-03 -> 2021-03-29T00:00:00. Using format: %Y-%m will result in 2021-03 -> 2021-03-01T00:00:00.

Configuration Options:

destructive: Whether the keys are removed from the output when combined. DEFAULT: true
output_key: Name of the key you would like to output. DEFAULT: datetime
format: Format string to parse date to isodate. Date template is: ${year}-${month}-${day}T${hour}:${minute}:${second} The format string is passed to datetime.datetime.strptime

Example Configuration:

post_processors:
    - name: date_combinator
      inputs:
        destructive: true
        format: '%Y-%m'
        output_key: datetime

STAC BBOX Processor

class item_generator.extraction_methods.postprocessors.BBOXProcessor(**kwargs)

Processor Name: stac_bbox

Description:

Accepts a dictionary of coordinate values and converts to RFC 7946, section 5 formatted bbox.

Configuration Options:

key_list: REQUIRED list of keys to convert to bbox array. Ordering is respected.

Example Configuration:

post_processors:
    - name: stac_bbox
      inputs:
        key_list:
           - west
           - south
           - east
           - north

String Join Processor

class item_generator.extraction_methods.postprocessors.StringJoinProcessor(**kwargs)

Processor Name: string_join

Description:

Accepts a dictionary. String values are popped from the dictionary and are put back into the dictionary with the output_key specified.

Configuration Options:

key_list: REQUIRED list of keys to convert to bbox array. Ordering is respected.
delimiter: REQUIRED text delimiter to put between strings
output_key: REQUIRED name of the key you would like to output

Example Configuration:

post_processors:
    - name: string_join
      inputs:
        key_list:
           - year
           - month
           - day
        delimiter: "-"
        output_key: datetime

Third-Party Post-processors

The plugin nature lends itself to third-party plugins. If you develop a plugin which might be useful for others’ workflows. Please make a pull request to add it to this table.

Processor Name	Description	Vendor

Processors

Core Processors

Header Extract

Regex

ISO 19115 Extract

XML Extract

Third-Party Processors

Pre Processors

Filename Reducer

CEDA Observation

Third-Party Pre-processors

Post Processors

Facet Map Processor

ISO Date Processor

Date Combinator Processor

STAC BBOX Processor

String Join Processor

Third-Party Post-processors