Processors

Processors take a file and return a dictionary of extracted information. They can be chained, one after the other and the results are merged such that arrays are appended to and key:value pairs are overwritten by subsequent write operations.

Some processors can also take Pre Processors and Post Processors. Pre-processors modify the input arguments whilst post-processors modify the output from the main processor.

Core Processors

Processor Name

Description

header_extract

Takes a filepath string and a list of attributes and returns a dictionary of the values extracted from the file header.

regex

Takes an input string and a regex with named capture groups and returns a dictionary of the values extracted using the named capture groups.

iso19115

Extracts attributes from an xml formatted ISO19115 record at a given URL. Supports URL templating.

xml_extract

Extracts attributes from an xml formatted ISO19115 record at a given URL. Supports URL templating.

Header Extract

class item_generator.extraction_methods.header_extract.HeaderExtract(**kwargs)

Processor Name

header_extract

Accepts Pre-processors

Accepts Post-processors

Description:

Takes a filepath string and a list of attributes and returns a dictionary of the values extracted from the file header.

Configuration Options:
  • attributes: A list of attributes to match for from the file header

  • post_processors: List of post_processors to apply

  • output_key: When the metadata is returned, this key determines where the metadata is fit in the response. Dot separated strings can be used to created nested attributes. default: 'properties'

Example configuration:
- name: header_extract
  inputs:
    attributes:
        - institution
        - sensor
        - platform

Regex

class item_generator.extraction_methods.regex_extract.RegexExtract(**kwargs)

Processor Name

regex

Accepts Pre-processors

Accepts Post-processors

Description:

Takes an input string and a regex with named capture groups and returns a dictionary of the values extracted using the named capture groups.

Configuration Options:
  • regex: The regular expression to match against the filepath

  • pre_processors: List of pre-processors to apply

  • post_processors: List of post_processors to apply

  • output_key: When the metadata is returned, this key determines where the metadata is fit in the response. Dot separated strings can be used to created nested attributes. default: 'properties'

Example configuration:
- name: regex
  inputs:
    regex: '^(?:[^_]*_){2}(?P<datetime>\d*)'
  pre_processors:
    - name: filename_reducer
  post_processors:
    - name: isodate_processor
      inputs:
        date_key: datetime
run(filepath: str, source_media: str = 'POSIX', **kwargs) dict

The action of running the processor and returning an output :param filepath: Path to object :param media_source: Media type for the target object :param kwargs: free kwargs passed to the processor. :return: dict

ISO 19115 Extract

class item_generator.extraction_methods.iso19115_extract.ISO19115Extract(**kwargs)

Processor Name

iso19115

Accepts Pre-processors

Accepts Post-processors

Description:

Takes a URL template and calls out to URL to retrieve the iso19115 record. Use pre-processors to inject additional kwargs which are passed to the URL template.

Configuration Options:
  • url_template: REQUIRED String template to build the URL. Template uses the python string template format.

  • extraction_keys: List of keys to retrieve from the response.

  • pre_processors: List of pre-processors to apply

  • post_processors: List of post_processors to apply

  • output_key: When the metadata is returned, this key determines where the metadata is fit in the response. Dot separated strings can be used to created nested attributes. default: 'properties'

Extraction Keys:

Extraction keys should be a map.

Name

Description

name

Name of the outputted attribute

key

Access key to extract the required data. Passed to xml.etree.ElementTree.find() and also supports xpath formatted accessors

Example:
- name: start_datetime
  key: './/gml:beginPosition'
Example configuration:
- name: iso19115
  inputs:
    url_template: 'api.catalogue.ceda.ac.uk/export/xml/$uuid.xml'
    extraction_keys:
      - name: start_datetime
        key: './/gml:beginPosition'
  pre_processors:
    - name: ceda_observation
  post_processors:
    - name: isodate_processor
      inputs:
        date_key: datetime
run(filepath: str, source_media: str = 'POSIX', **kwargs) dict

The action of running the processor and returning an output :param filepath: Path to object :param media_source: Media type for the target object :param kwargs: free kwargs passed to the processor. :return: dict

XML Extract

class item_generator.extraction_methods.xml_extract.XMLExtract(**kwargs)

Processor Name

xml_extract

Accepts Pre-processors

Accepts Post-processors

Description:

Processes XML documents to extract metadata

Configuration Options:
  • extraction_keys: List of keys to retrieve from the document.

  • filter_expr: Regex to match against files to limit the attempts to known files

  • namespaces: Map of namespaces

  • pre_processors: List of pre-processors to apply

  • post_processors: List of post_processors to apply

  • output_key: When the metadata is returned, this key determines where the metadata is fit in the response. Dot separated strings can be used to created nested attributes. default: 'properties'

Extraction Keys:

Extraction keys should be a map.

Name

Description

name

Name of the outputted attribute

key

Access key to extract the required data. Passed to xml.etree.ElementTree.find() and also supports xpath formatted accessors

attribute

Allows you to select from the element attribute. In the absence of this value, the default behaviour is to access the text value of the key. In some cases, you might want to access and attribute of the element.

Example:
- name: start_datetime
  key: './/gml:beginPosition'
Example configuration:
- name: xml_extract
  inputs:
    filter_expr: '\.manifest$'
    extraction_keys:
      - name: start_datetime
        key: './/gml:beginPosition'
run(filepath: str, source_media: str = 'POSIX', **kwargs) dict

The action of running the processor and returning an output :param filepath: Path to object :param media_source: Media type for the target object :param kwargs: free kwargs passed to the processor. :return: dict

Third-Party Processors

The plugin nature lends itself to third-party plugins. If you develop a plugin which might be useful for others’ workflows. Please make a pull request to add it to this table.

Processor Name

Description

Vendor

Pre Processors

Pre processors operate on the input arguments for the main processor.

They can be used to manipuate the input arguments for a given processor to modify its behaviour.

Processor Name

Description

filename_reducer

Takes a file path and returns the filename using os.path.basename.

ceda_observation

Takes a file path and returns the uuid from the CEDA Catalogue.

Filename Reducer

class item_generator.extraction_methods.preprocessors.ReducePathtoName(**kwargs)

Processor Name: filename_reducer

Description:

Takes a file path and returns the filename using os.path.basename.

Example Configuration:

pre_processors:
  - name: filename_reducer

CEDA Observation

class item_generator.extraction_methods.preprocessors.CEDAObservation(**kwargs)

Processor Name: ceda_observation

Description:

Takes a file path and returns the ceda observation record.

Configuration Options:

Example Configuration:

pre_processors:
  - name: ceda_observation
    inputs:
      url_template: http://api.catalogue.ceda.ac.uk/api/v0/obs/get_info$filepath

Third-Party Pre-processors

The plugin nature lends itself to third-party plugins. If you develop a plugin which might be useful for others’ workflows. Please make a pull request to add it to this table.

Processor Name

Description

Vendor

Post Processors

Post processors operate on the output from a main processor. They are used using the same interface as a main processor process but they accept the result of the previous step as part of the process signature.

Processor Name

Description

facet_map

In some cases, you may wish to map the header attributes to different facets. This method takes a map and converts the facet labels into those specified.

isodate_processor

Takes the source dict and the key to access the date and converts the date to ISO 8601 Format.

date_combinator

Automatically converts year, month, day, hour, minunte, second keys into an ISO 8601 date.

stac_bbox

Converts coordinates from a dictionary into RFC 7946, section 5 formatted coordinates

string_join

Join facets together to create a new value.

Facet Map Processor

class item_generator.extraction_methods.postprocessors.FacetMapProcessor(**kwargs)

Processor Name: facet_map

Description:

In some cases, you may wish to map the header attributes to different facets. This method takes a map and converts the facet labels into those specified.

Configuration Options:
  • term_map: Dictionary of terms to map

Example Configuration:

post_processors:
    - name: facet_map
      inputs:
        term_map:
            time_coverage_start: start_time

ISO Date Processor

class item_generator.extraction_methods.postprocessors.ISODateProcessor(**kwargs)

Processor Name: isodate_processor

Description:

Takes the source dict and the key to access the date and converts the date to ISO 8601 Format.

e.g.

YYYY-MM-DDTHH:MM:SS.ffffff, if microsecond is not 0 YYYY-MM-DDTHH:MM:SS, if microsecond is 0

If the date format cannot be parsed, it is removed from the source dict with an error logged.

Configuration Options:
  • date_keys: REQUIRED List keys to the date value. Using a list allows processing of multiple dates.

  • format: Optional format string. Default behaviour uses dateutil.parser.parse. If a format string is suppled, this will change to use datetime.datetime.strptime.

Example Configuration:

post_processors:
    - name: isodate_processor
      inputs:
        date_keys:
          - key: date
        format: '%Y%m'

Date Combinator Processor

class item_generator.extraction_methods.postprocessors.DateCombinatorProcessor(**kwargs)

Processor Name: date_combinator

Description:

Used to automatically join date components to create an ISO 8601 date. E.g. - year (required) - month - day - hour - minutes - seconds

Note

If you are only expecting to extract <year>-<month> make sure to include a format string. Dateutil.parser.parse will use the current day to fill the blank rather than 01. e.g. 2021-03 -> 2021-03-29T00:00:00. Using format: %Y-%m will result in 2021-03 -> 2021-03-01T00:00:00.

Configuration Options:
  • destructive: Whether the keys are removed from the output when combined. DEFAULT: true

  • output_key: Name of the key you would like to output. DEFAULT: datetime

  • format: Format string to parse date to isodate. Date template is: ${year}-${month}-${day}T${hour}:${minute}:${second} The format string is passed to datetime.datetime.strptime

Example Configuration:

post_processors:
    - name: date_combinator
      inputs:
        destructive: true
        format: '%Y-%m'
        output_key: datetime

STAC BBOX Processor

class item_generator.extraction_methods.postprocessors.BBOXProcessor(**kwargs)

Processor Name: stac_bbox

Description:

Accepts a dictionary of coordinate values and converts to RFC 7946, section 5 formatted bbox.

Configuration Options:
  • key_list: REQUIRED list of keys to convert to bbox array. Ordering is respected.

Example Configuration:

post_processors:
    - name: stac_bbox
      inputs:
        key_list:
           - west
           - south
           - east
           - north

String Join Processor

class item_generator.extraction_methods.postprocessors.StringJoinProcessor(**kwargs)

Processor Name: string_join

Description:

Accepts a dictionary. String values are popped from the dictionary and are put back into the dictionary with the output_key specified.

Configuration Options:
  • key_list: REQUIRED list of keys to convert to bbox array. Ordering is respected.

  • delimiter: REQUIRED text delimiter to put between strings

  • output_key: REQUIRED name of the key you would like to output

Example Configuration:

post_processors:
    - name: string_join
      inputs:
        key_list:
           - year
           - month
           - day
        delimiter: "-"
        output_key: datetime

Third-Party Post-processors

The plugin nature lends itself to third-party plugins. If you develop a plugin which might be useful for others’ workflows. Please make a pull request to add it to this table.

Processor Name

Description

Vendor