Building a workflow

Building an item-generation workflow consists of 4 mains steps:

Write an item_description file to describe the workflow
Test the workflow on a subset of data
Index that subset of data to check it works as expected
Index full dataset

Parts 1 and 2 will likely go round in a loop, whilst you are developing the workflow file, with several iterations until you are happy.

1. Write a workflow file (item-description)

A basic item-description consists of 3 sections:

datasets
collection
facets

An example item-description can be found here

The extraction methods section describes how the facets are extracted from the data.

Files are aggregated into items based on the aggregation facets

Warning

All files you want to end up together, should have the same aggregation facets. If you get your aggregation facets from the filename, and not all files you want to group together have the same filename convention (e.g. metadata files) then they will end up independent.

To check your item-description works as expected, you will need to run it.

2. Running the item-description on a subset of data

To run your workflow, you will need to create a config file. This will define an input path and output to standard out.

Example configuration

item_descriptions:
  root_directory: /etc/item-generator/item_descriptions/descriptions
inputs:
  - name: file_system
    path: /badc/faam/data/2005/b069-jan-05
outputs:
  - name: standard_out
    namespace: assets
  - name: standard_out
    namespace: facets

You should choose a filepath with a relatively small number of files to make iteration quick and allow you to make tweaks.

The item-generator outputs two things:

item content with the result of the extraction_methods
An ID for the asset. This is the item id to be assigned to the asset so that they can be linked.

Note

If you wish to hide the asset output while testing, including just the namespace facets will ignore the asset output.

You can then run your workflow using:

asset_scanner <path_to_config_file>

usage: asset_scanner [-h] conf

Run the asset scanner as configured

positional arguments:
  conf        Path to a yaml configuration file

optional arguments:
  -h, --help  show this help message and exit

Note

It is likely that this will be an iterative process to make sure that the correct assets end up together and that all the facets are extracted as desired.

3. Indexing the data

Caution

Have you indexed the assets? Things may not work fully if the assets have not been indexed as well.

This step is as simple as changing your output plugin to point to the final destination. If you had ignored the asset output, make sure to add it back in at this stage.

Here is an example for the elasticsearch output making use of additional kwargs:

- name: elasticsearch
  namespace: facets
  connection_kwargs:
    hosts: [host1]
    headers:
      x-api-key: <api_key>
    use_ssl: true
    verify_certs: false
    ssl_show_warn: false
  index:
      name: ceda-items-2021-06-09
- name: elasticsearch
  namespace: assets
  connection_kwargs:
    hosts: [host1]
    headers:
      x-api-key: <api_key>
    use_ssl: true
    verify_certs: false
    ssl_show_warn: false
  index:
      name: ceda-assets-2021-06-09

Once this works as expected…

4. Indexing the full dataset

This is done by increasing the scope of the input plugin. In the example we used the path /badc/faam/data/2005/b069-jan-05. If our description file covered /badc/faam/data we could now expand our input to cover /badc/faam/data.

Note

The higher up the tree you put the input, the longer it will take. You might wish to consider splitting the run into smaller segments and running in parallel.