Orientation

The STAC Generator is a framework for generating STAC catalogs and is built to be modular and extensible. This can be confusing for new users but this guide aims to act as an orientation to help new users understand what this package can do and how the pieces fit together.

There are various pluggable pieces:

Inputs
Outputs
Extraction Methods
Mappings

These pieces should allow you to construct a workflow which works for your use case and provide python entry points to allow you to write your own plugins. The STAC Generator package stores some inputs which can be used to read from a range of different sources messages of STAC objects to genertate. The item, and collection generators take these messages and extract the required facets to buil the relevant STAC object using a variety of extraction methods. These generated objects can then be passed to a range of outputs.

The generators have two levels of configuration. Global configuration, passed at the command line on invocation, which defines the inputs, ouputs, logging etc.

An example can be found here.

The second level of configuration comes in the form of recipes. These YAML files describe the workflow for extracting facets and other metadata to build the items and collections of the STAC Catalog. Background for recipes can be found here and a guide for how to build, and test these files is here.

The different available extraction methods which can construct these workflows are found here.

The CEDA repository containing these recipes can be used as an example. An example which includes extracting metadata from the NetCDF header is sentinel5

paths:
  - /neodc/sentinel_ard/data/sentinel_2

type: item

# This will be run over the meta files, example: neodc/sentinel_ard/data/sentinel_2/2018/07/05/S2A_20180705_lat57lon375_T30VVJ_ORB123_utm30n_osgb_vmsk_sharp_rad_srefdem_stdsref_meta.xml
id:
  # Use full path minus the extension for ID
  - method: default
    inputs:
      defaults:
        item_id: $instance_id

extraction_methods:
 # Extract information from the meta file
  - method: xml
    inputs:
      extraction_keys:
        - name: east
          key: .//gmd:eastBoundLongitude/gco:Decimal
        - name: west
          key: .//gmd:westBoundLongitude/gco:Decimal
        - name: north
          key: .//gmd:northBoundLatitude/gco:Decimal
        - name: south
          key: .//gmd:southBoundLatitude/gco:Decimal
        - name: start_datetime
          key: .//gml:beginPosition
        - name: end_datetime
          key: .//gml:beginPosition
        - name: supInfo
          key: .//gmd:supplementalInformation/gco:CharacterString
        - name: EPSG
          key: .//gmd:referenceSystemInfo/gmd:MD_ReferenceSystem/gmd:referenceSystemIdentifier/gmd:RS_Identifier/gmd:code/gco:CharacterString
      namespaces:
        gmd: http://www.isotc211.org/2005/gmd
        gml: http://www.opengis.net/gml
        gco: http://www.isotc211.org/2005/gco

  # Extract the variables from the supInfo field
  - method: regex
    inputs:
      regex: 'ESA file name: (?P<esa_file_name>.*)'
      input_term: supInfo

  - method: regex
    inputs:
      regex: 'Mean_Sun_Angle_Zenith: (?P<Mean_Sun_Angle_Zenith>.*)'
      input_term: supInfo

  - method: regex
    inputs:
      regex: 'Mean_Sun_Angle_Azimuth: (?P<Mean_Sun_Angle_Azimuth>.*)'
      input_term: supInfo

  # Extract the manifest path info
  - method: regex
    inputs:
      regex: 'neodc\/sentinel_ard\/data\/sentinel_2\/(?P<year>\d{4})\/(?P<month>\d{2})\/(?P<day>\d{2})\/S2(?P<satellite>[abAB]{1}).*'
      input_term: uri

  - method: lambda
    inputs:
      function: 'lambda satellite: satellite.lower()'
      input_args:
        - $satellite
      output_key: satellite

  # Generate path to the manifest file
  - method: string_template
    inputs:
      template: '/neodc/sentinel2{satellite}/data/L1C_MSI/{year}/{month}/{day}/{esa_file_name}.manifest'
      output_key: manifest_file

  # Extract information from the manifest file
  - method: xml
    inputs:
      input_term: manifest_file
      extraction_keys:
        - name: Instrument Family Name
          key: .//safe:platform/safe:instrument/safe:familyName
        - name: Instrument Family Name Abbreviation
          key: .//safe:platform/safe:instrument/safe:familyName
          attribute: abbreviation
        - name: Platform Number
          key: .//safe:platform/safe:number
        - name: NSSDC Identifier
          key: .//safe:platform/safe:nssdcIdentifier
        - name: Start Relative Orbit Number
          key: .//safe:orbitReference/safe:relativeOrbitNumber
        - name: Start Orbit Number
          key: .//safe:orbitReference/safe:orbitNumber
        - name: Ground Tracking Direction
          key: .//safe:orbitReference/safe:orbitNumber
          attribute: groundTrackDirection
        - name: Instrument Mode
          key: .//safe:platform/safe:instrument/safe:mode
        - name: Coordinates
          key: .//safe:frameSet/safe:footPrint/gml:coordinates
      namespaces:
        safe: http://www.esa.int/safe/sentinel/1.1
        gml: http://www.opengis.net/gml

  - method: regex
    inputs:
      regex: '(?P<path_root>.+?)_vmsk_sharp_rad_srefdem_stdsref_meta\.'

  - method: lambda
    inputs:
      function: 'lambda coords_string: [[float(i), float(k)]for i,k in zip(coords_string.strip().split()[1::2], coords_string.strip().split()[0::2])]'
      input_args:
        - $Coordinates
      output_key: coords

  - method: geometry_polygon
    inputs:
      coordinates_term: coords

  - method: geometry_to_bbox
    inputs:
      type: polygon

  - method: string_template
    inputs:
      template: '{esa_file_name}.SAFE/MTD_MSIL1C.xml'
      output_key: inner_file

  - method: string_template
    inputs:
      template: '/neodc/sentinel2{satellite}/data/L1C_MSI/{year}/{month}/{day}/{esa_file_name}.zip'
      output_key: zip_file

  - method: open_zip
    inputs:
      zip_file: $zip_file
      inner_file: $inner_file
      output_key: esa_product

  - method: xml
    inputs:
      input_term: esa_product
      extraction_keys:
        - name: Cloud Coverage Assessment
          key: .//psd-14:Quality_Indicators_Info/Cloud_Coverage_Assessment
        - name: Product Type
          key: .//psd-14:General_Info/Product_Info/PRODUCT_TYPE
        - name: Datatake Type
          key: .//psd-14:General_Info/Product_Info/Datatake/DATATAKE_TYPE
      namespaces:
        psd-14: https://psd-14.sentinel2.eo.esa.int/PSD/User_Product_Level-1C.xsd


  - method: string_template
    inputs:
      template: '{path_root}.*.tif'
      output_key: data_regex

  - method: string_template
    inputs:
      template: '{path_root}.*_thumbnail.jpg'
      output_key: thumbnail_regex

  - method: string_template
    inputs:
      template: '{path_root}.*_meta.xml'
      output_key: metadata_regex

  - method: elasticsearch_assets
    inputs:
      search_field: path
      regex_term: data_regex
      fields:
        - name: size
        - name: location
      extraction_methods:
        - method: default
          inputs:
            defaults:
              roles: ["data"]

  - method: elasticsearch_assets
    inputs:
      search_field: path
      regex_term: thumbnail_regex
      fields:
        - name: size
        - name: location
      extraction_methods:
        - method: default
          inputs:
            defaults:
              roles: ["thumbnail"]

  - method: elasticsearch_assets
    inputs:
      search_field: path
      regex_term: metadata_regex
      fields:
        - name: size
        - name: location
      extraction_methods:
        - method: default
          inputs:
            defaults:
              roles: ["metadata"]

  - method: rename_assets
    inputs:
      rename:
        - name: cog
          regex: '.*_stdsref.tif'
        - name: cloud
          regex: '.*_clouds.tif'
        - name: cloud_probability
          regex: '.*_clouds_prob.tif'
        - name: topographic_shadow
          regex: '.*_toposhad.tif'
        - name: metadata
          regex: '.*_meta.xml'
        - name: thumbnail
          regex: '.*_thumbnail.jpg'
        - name: saturated_pixels
          regex: '.*_sat.tif'
        - name: valid_pixels
          regex: '.*_valid.tif'
      output_key: data_regex

  - method: lambda
    inputs:
      function: 'lambda assets: {asset_key: asset_value | {"href": "https://dap.ceda.ac.uk" + asset_value["href"]} for asset_key, asset_value in sorted(assets.items())}'
      input_args:
        - $assets
      output_key: assets

  - method: lambda
    inputs:
      function: 'lambda path_root: path_root.replace("/badc/sentinel1b/data", "").replace("/badc/sentinel1a/data", "").strip("/").replace("/", ".")'
      input_args:
        - $path_root
      output_key: instance_id

  - method: iso_date
    inputs:
      date_keys:
        - start_datetime
        - end_datetime
      formats:
        - '%Y-%m-%dT%H%M%SZ'

  - method: datetime_bound_to_centroid

  # Clean up unneeded terms
  - method: remove
    inputs:
      keys:
        - supInfo
        - year
        - month
        - day
        - manifest_file
        - west
        - south
        - east
        - north
        - path_root
        - data_regex
        - thumbnail_regex
        - metadata_regex
        - Coordinates
        - coords
        - satellite
        - zip_file
        - inner_file
        - esa_product
        - uri

member_of:
  - recipes/collection/sentinel2_ARD.yaml

The “extraction_methods” are the workflow. In the example shows the xml extaction method being used to extract some facets from a meta data file, then this information is then manipulated by several different extaction methods including retrieving a list of assets from CEDA’s elasticsearch index.

The extraction methods can also be used for collection generation but typically this will be aggregation of their items.