Building a workflow
- Building an item-generation workflow consists of 4 mains steps:
Write an item_description file to describe the workflow
Test the workflow on a subset of data
Index that subset of data to check it works as expected
Index full dataset
Parts 1 and 2 will likely go round in a loop, whilst you are developing the workflow file, with several iterations until you are happy.
1. Write a workflow file (item-description)
- A basic item-description consists of 3 sections:
datasets
collection
facets
An example item-description can be found here
The extraction methods section describes how the facets are extracted from the data.
Files are aggregated into items based on the aggregation facets
Warning
All files you want to end up together, should have the same aggregation facets. If you get your aggregation facets from the filename, and not all files you want to group together have the same filename convention (e.g. metadata files) then they will end up independent.
To check your item-description works as expected, you will need to run it.
2. Running the item-description on a subset of data
To run your workflow, you will need to create a config file. This will define an input path and output to standard out.
Example configuration
item_descriptions: root_directory: /etc/item-generator/item_descriptions/descriptions inputs: - name: file_system path: /badc/faam/data/2005/b069-jan-05 outputs: - name: standard_out namespace: assets - name: standard_out namespace: facets
You should choose a filepath with a relatively small number of files to make iteration quick and allow you to make tweaks.
- The item-generator outputs two things:
item content with the result of the extraction_methods
An ID for the asset. This is the item id to be assigned to the asset so that they can be linked.
Note
If you wish to hide the asset output while testing, including just the namespace
facets
will ignore the asset output.
You can then run your workflow using:
asset_scanner <path_to_config_file>
usage: asset_scanner [-h] conf
Run the asset scanner as configured
positional arguments:
conf Path to a yaml configuration file
optional arguments:
-h, --help show this help message and exit
Note
It is likely that this will be an iterative process to make sure that the correct assets end up together and that all the facets are extracted as desired.
3. Indexing the data
Caution
Have you indexed the assets? Things may not work fully if the assets have not been indexed as well.
This step is as simple as changing your output plugin to point to the final destination. If you had ignored the asset output, make sure to add it back in at this stage.
Here is an example for the elasticsearch output making use of additional kwargs:
- name: elasticsearch
namespace: facets
connection_kwargs:
hosts: [host1]
headers:
x-api-key: <api_key>
use_ssl: true
verify_certs: false
ssl_show_warn: false
index:
name: ceda-items-2021-06-09
- name: elasticsearch
namespace: assets
connection_kwargs:
hosts: [host1]
headers:
x-api-key: <api_key>
use_ssl: true
verify_certs: false
ssl_show_warn: false
index:
name: ceda-assets-2021-06-09
Once this works as expected…
4. Indexing the full dataset
This is done by increasing the scope of the input plugin.
In the example we used the path /badc/faam/data/2005/b069-jan-05
. If our
description file covered /badc/faam/data
we could now expand our input to cover
/badc/faam/data
.
Note
The higher up the tree you put the input, the longer it will take. You might wish to consider splitting the run into smaller segments and running in parallel.