Deployment

The NLDS is deployed as a collection of containers on the JASMIN rancher cluster wigbiorg, a public-facing Kubernetes cluster hosted by JASMIN.

Due to the microservice architecture the level to apply the containerisation was immediately clear - each microservice sits in its own container. There are therefore nine different containers that make up the deployment of the NLDS, eight for the consumers and one additional container for the FastAPI server:

  1. FastAPI Server

  2. Worker (router)

  3. Indexer

  4. Catalog

  5. Transfer-Put

  6. Transfer-Get

  7. Logging

  8. Archive-Put

  9. Archive-Get

The FastAPI server is defined and deployed in the nlds-server-deploy repository in gitlab and the latter 8 are similarly defined and deployed in the nlds-consumers-deploy repository. All have subtly different configurations and dependencies, which can be gleaned in detail by looking at the configuration yaml files and helm chart in the repos, but are also, mercifully, described below.

Note

All of the following describes the deployment set up for the production environment. The setup for the staging/beta testing environment is very similar but not quite the same, so the differences will be summarised in the Staging Deployment section.

Images

The above containers do not all run on the same image, but are sub-divided into three specific roles:

  1. Generic Server: nlds/app

  2. Generic Consumer: nlds-consumers/consumer

  3. Tape Consumer: nlds-consumers/archiver

The string after each of these corresponds to the image’s location on CEDA’s Harbor registry (and therefore what tag/registry address to use to docker pull each of them). As may be obvious, the FastAPI server runs on the Generic Server image and contains an installation of asgi, building upon the asgi base-image, to actually run the server. The rest run on the Generic Consumer image, which has an installation of the NLDS repo, along with its dependencies, to allow it to run a given consumer. The only dependency which isn’t included is xrootd as it is a very large and long installation process and unnecessary to the running of the non-tape consumers. Therefore the Tape Consumer image was created, which appropriately builds upon the Geneic Consumer image with an additional installation of xrootd with which to run tape commands. The two tape consumers, Archive-Put and Archive-Get, run on containers using this image.

The two consumer containers run as the user NLDS, which is an official JASMIN user at uid=7054096 and is baked into the container (i.e. unconfigurable). Relatedly, every container runs with config associating the NLDS user with supplemental groups, the list of which constitutes every group-workspace on JASMIN. The list was generated with the command:

ldapsearch -LLL -x -H ldap://homer.esc.rl.ac.uk -b "ou=ceda,ou=Groups,o=hpc,dc=rl,dc=ac,dc=uk"

This will need to be periodically rerun and the output reformatted to update the list of supplementalGroups in this config file.

Each of the containers will also have specific config and specific deployment setup to help the container perform its particular task its particular task.

Common Deployment Configurations

There are several common deployment configurations (CDC) required to perform tasks, which some, or all, of the containers make use of to function.

The most commonly used is the nslcd pod which provides the containers with up-to-date uid and gid information from the LDAP servers. This directly uses the nslcd image developed for the notebook server, and runs as a side-car in every deployed pod to periodically poll the LDAP servers to provide names and permissions information to the main container in the pod (the consumer) so that file permissions can be handled properly. In other words, it ensures the passwd file on the consumer container is up to date, and therefore that the aforementioned supplementalGroups are properly respected.

Another CDC used across all pods is the rabbit configuration, details of which can be found in Server config.

An additional CDC used by the microservices which require reading from or writing to the JASMIN filesystem are the filesystem mounts, which will mount the group workspaces (in either read or write mode) onto the appropriate path (/gws or /group_workspaces). This is used by the following containers:

  • Transfer-Put

  • Transfer-Get

  • Indexer

Note

It is the intention to eventually include several more directories into the mounting (/xfc, /home) but this is not currently possible with the version of Kubernetes installed on wigbiorg

A further CDC is the PostgreSQL configuration, which is obviously required by the database-interacting consumers (Catalog and Monitor) and, again, fully described in Server config. The production system uses the databases nlds_catalog and nlds_monitor on the Postgres server db5.ceda.ac.uk hosted and maintained by CEDA. However, an additional part of this configuration is running any database migrations so the database schema is kept up to date. This will be discussed in more detail in section Migrations.

There are some slightly more complex deployment configurations involved in the rest of the setup, which are described below.

API Server

The NLDS API server, as mentioned above, was written using FastAPI. In a local development environment this is served using uvicorn, but for the production deployment the base-image base-image is used, which runs the server instead with gunicorn. They are functionally identical so this is not a problem per se, just something to be aware of. The NLDS API helm deployment is an extension of the standard FastAPI helm chart.

On production, this API server sits facing the public internet behind an NGINX reverse-proxy, handled by the standard nginx helm chart in the cedaci/helm-charts repo. It is served to the domain https://nlds.jasmin.ac.uk, with the standard NLDS API endpoints extending from that (such as /docs, /system/status). The NLDS API also has an additional endpoint (/probe/healthz) for the Kubernetes liveness probe to periodically ping to ensure the API is alive, and that the appropriate party is notified if it goes down. Please note, this is not a deployment specific endpoint and will also exist on any local development instances.

Tape Keys

The CERN Tape Archive (CTA) instance at STFC requires the use of authentication to access the different tape pools and tape instances. This is done through Kerberos on the backend and requires the use of a forwardable keytab file with appropriate permissions. From the perspective of the NLDS this is actually quite simple, Scientific Computing (SCD) provide a string to put into a keytab (text) file which describes the CTA user and authentication and must have unix octal permissions 600 (i.e. strictly user read-writable). Finally two xrootd-specific environment variables must be created:

XrdSecPROTOCOL=sss
XrdSecSSSKT=path/to/keytab/file

The problem arises with the use of Kubernetes, wherein the keytab content string must be kept secret. This is handled in the CEDA gitlab deployment process through the use of git-crypt (see here for more details) to encrypt and Kubernetes secrets to decrypt at deployment time. Unfortunately permissions can’t be set, or changed, on files made by Kubernetes secrets, so to get the keytab in the right place with the right permissions the deployment utilises an init-container to copy the secret key to a new file and then alter permissions on it to 600.

Migrations

As described in Database Migrations with Alembic, the NLDS uses Alembic for database migrations. During the deployment these are done as an initial step before any of the consumers are updated, so that nothing attempts to use the new schema before the database has been migrated, and this is implemented through two mechanisms in the deployment:

  1. An init-container on the catalog, which has the config for both the catalog and montioring DBs, which has alembic installed and calls:

    alembic upgrade head
    
  2. The catalog container deployment running first (alongside the logging) before all the other container deployments.

This means that if the database migration fails for whatever reason, the whole deployment stops and the migration issue can be investigated through the logs.

Logging with Fluentbit

The logging for the NLDS, as laid out in the specification, was originally designed to concentrate logs onto a single container for ease of perusal. Unfortunately, due to constraints of the Kubernetes version employed, the container has only limited, temporary storage capacity (the memory assigned from the cluster controller) and no means of attaching a more persistent volume to store logs in long-term.

The, relatively new, solution that exists on the CEDA cluster is the use of fluentd, and more precisely fluentbit, to aggregate logs from the NLDS logging microservice and send them to a single external location running fluentd – currently the stats-collection virtual machine run on JASMIN. Each log sent to the fluentd service is tagged with a string representing the particular microservice log file it was collected from, e.g. the logs from the indexer microservice on the staging deployment are tagged as:

nlds_staging_index_q_log

This is practically achieved through the use of a sidecar – a further container running in the same pod as the logging container – running the fluentbit image as defined by the fluentbit helm chart. The full fluentbit config, including the full list of tags, can be found in the logging config yamls. When received by the fluentd server, each tagged log is collated into a larger log file for help with debugging at some later date. The log files on the logging microservice’s container are rotated according to size, and so should not exceed the pod’s allocated memory limit.

Note

The fluentbit service is still in its infancy and subject to change at short notice as the system & helm chart get more widely adopted. For example, the length of time log files are kept on the stats machine has not been finalised yet.

While the above is true for long term log storage, the rancher interface for the Kubernetes cluster can still be used to check the output logs of each consumer in the standard way for quick diagnosis of problems with the NLDS.

Scaling

A core part of the design philosophy of the NLDS was its microservice architecture, which allows for any of the microservices to be scaled out in an embarrassingly parallelisable way to meet changing demand. This is easily achieved in Kubernetes through simply spinning up additional containers for a given microservice using the replicaCount parameter. By default this value is 1 but has been increased for certain microservices deemed to be bottlenecks during beta testing, notably the Transfer-Put microservice where it is set to 8 and the Transfer-Get where is set to 2.

Note

While correct at time of writing, these values are subject to change – it may be that other microservices are found which require scaling and those above do not require as many replicas as currently allocated.

An ideal solution would be to automatically scale the deployments based on the size of a Rabbit queue for a given microservice, and while this is in theory possible, this was not possible with the current installation of Kubernetes without additional plugins, namely Prometheus.

The other aspect of scaling is the resource requested by each of the pods, which have current default values and an exception of greater resource for the transfer processors. The values for these were arrived at by using the command:

kubectl top pod -n {NLDS_NAMESPACE}

within the kubectl shell on the appropriate rancher cluster (accessible via the shell button in the top right, or shortcut Ctrl + `). {NLDS_NAMESPACE} will need to be replaced with the appropriate namespace for the cluster you are on, i.e.:

kubectl top pod -n nlds                     # on wigbiorg
kubectl top pod -n nlds-consumers-master    # for consumers on staging cluster
kubectl top pod -n nlds-api-master          # for api-server on staging cluster

and, as before, these will likely need to be adjusted as understanding of the actual resource use of each of the microservices evolves.

Changing ownership of files

A unique problem arose in beta testing where the NLDS was not able to change ownership of the files downloaded during a get to the user that requested them from within a container that was not allowed to run as root. As such, a solution was required which allowed a very specific set of privileges to be escalated without leaving any security vulnerabilities open.

The solution found was to include an additional binary in the Generic Consumer image - chown_nlds - which has the setuid permissions bit set and is therefore able to change directories. To minimise exposed attack surface, the binary was compiled from a rust script which allows only the chown-ing of files owned by the NLDS user (on JASMIN uid=7054096). Additionally, the target must be a file or directory and the uid being changed to must be greater than 1024 to avoid clashes with system ``uid``s. This binary will only execute on any containers where the appropriate security context is set, notably:

securityContext:
    allowPrivilegeEscalation: true
    add:
        - CHOWN

which in the NLDS deployment helm chart is only set for the Transfer-Get containers/pods.

Archive Put Cronjob

The process by which the archive process is started has been automated for this deployment, running as a Kubernetes cronjob every 12 hours at midnight and midday. The Helm config controlling this can be seen here. This cronjob will simply call the send_archive_next() entry point, which sends a message directly to the RabbitMQ exchange for routing to the Catalog.

Staging Deployment

As alluded to earlier, there are two versions of the NLDS running: (a) the production system on wigbiorg, and (b) the staging/beta testing system on the staging cluster (ceda-k8s). These have similar but slightly different configurations, the details of which are summarised in the below table. Like everything on this page, this was true at the time of writing (2024-03-06).

Staging vs. Production Config

System

Staging

Production

Tape

Pre-production instance (antares-preprod-fac.stfc.ac.uk)

Pre-production instance (antares-preprod-fac.stfc.ac.uk)

Database

on db5 - nlds_{db_name}_staging

on db5 - nlds_{db_name}

Logging

To fluentbit with tags nlds_statging_{service_name}_log

To fluentbit with tags nlds_prod_{service_name}_log

Object store

Uses the cedaproc-o tenancy

Uses nlds-cache-02-o tenancy, nlds-cache-01-o also available

API Server

https://nlds-master.130.246.130.221.nip.io/ (firewalled)

https://nlds.jasmin.ac.uk/ (public, ssl secured)

Updating the deployment

Updating instuctions can be found on the READMEs in the deployment repos, but essentially boil down to changing the git hash in the relevant Dockerfiles, i.e.

  1. Finalise and commit your changes to the nlds github repo.

  2. Take the hash from that commit (the first 8 characters is fine) and replace the value already at ARG GIT_VERSION= in the Dockerfile under /images

  3. Commit the changes and let the CI pipeline do its magic.

If this doesn’t work then larger changes have likely been made that require changes to the helm chart.