Deployment
==========

The NLDS is deployed as a collection of containers on the JASMIN rancher cluster 
``wigbiorg``, a public-facing Kubernetes cluster hosted by JASMIN.

Due to the microservice architecture the level to apply the containerisation was
immediately clear - each microservice sits in its own container. There are 
therefore nine different containers that make up the deployment of the NLDS, 
eight for the consumers and one additional container for the FastAPI server:

1. FastAPI Server
2. Worker (router)
3. Indexer
4. Catalog
5. Transfer-Put
6. Transfer-Get
7. Logging
8. Archive-Put
9. Archive-Get

The FastAPI server is defined and deployed in the `nlds-server-deploy <https://gitlab.ceda.ac.uk/cedadev/nlds-server-deploy>`_
repository in gitlab and the latter 8 are similarly defined and deployed in 
the `nlds-consumers-deploy <https://gitlab.ceda.ac.uk/cedadev/nlds-consumers-deploy>`_ 
repository. All have subtly different configurations and dependencies, which can 
be gleaned in detail by looking at the configuration yaml files and helm chart 
in the repos, but are also, mercifully, described below.

.. note::
    All of the following describes the deployment set up for the `production` 
    environment. The setup for the staging/beta testing environment is very 
    similar but not `quite` the same, so the differences will be summarised in 
    the :ref:`staging` section.


Images
------

The above containers do not all run on the same image, but are sub-divided into 
three specific roles:

1. `Generic Server <https://gitlab.ceda.ac.uk/cedadev/nlds-server-deploy/-/tree/master/images/Dockerfile>`_: ``nlds/app``
2. `Generic Consumer <https://gitlab.ceda.ac.uk/cedadev/nlds-consumers-deploy/-/tree/master/images/consumer/Dockerfile>`_: ``nlds-consumers/consumer``
3. `Tape Consumer <https://gitlab.ceda.ac.uk/cedadev/nlds-consumers-deploy/-/tree/master/images/archiver/Dockerfile>`_: ``nlds-consumers/archiver``

The string after each of these corresponds to the image's location on CEDA's 
Harbor registry (and therefore what tag/registry address to use to ``docker 
pull`` each of them). As may be obvious, the FastAPI server runs on the 
``Generic Server`` image and contains an installation of ``asgi``, building upon 
the ``asgi`` `base-image <https://gitlab.ceda.ac.uk/cedaci/base-images/-/tree/main/asgi>`_, 
to actually run the server. The rest run on the ``Generic Consumer`` image, 
which has an installation of the NLDS repo, along with its dependencies, to 
allow it to run a given consumer. The only dependency which isn't included is 
``xrootd`` as it is a very large and long installation process and unnecessary 
to the running of the non-tape consumers. Therefore the ``Tape Consumer`` image 
was created, which appropriately builds upon the ``Geneic Consumer`` image with 
an additional installation of ``xrootd`` with which to run tape commands. The 
two tape consumers, ``Archive-Put`` and ``Archive-Get``, run on containers using 
this image.

The two consumer containers run as the user NLDS, which is an official JASMIN 
user at ``uid=7054096`` and is baked into the container (i.e. unconfigurable).
Relatedly, every container runs with config associating the NLDS user with 
supplemental groups, the list of which constitutes every group-workspace on 
JASMIN. The list was generated with the command::
    
    ldapsearch -LLL -x -H ldap://homer.esc.rl.ac.uk -b "ou=ceda,ou=Groups,o=hpc,dc=rl,dc=ac,dc=uk"

This will need to be periodically rerun and the output reformatted to update the 
list of ``supplementalGroups`` in `this config file <https://gitlab.ceda.ac.uk/cedadev/nlds-consumers-deploy/-/blob/master/conf/common.yaml?ref_type=heads#L14-515>`_.

Each of the containers will also have specific config and specific deployment 
setup to help the container perform its particular task its particular task.  

Common Deployment Configurations
--------------------------------

There are several common deployment configurations (CDC) required to perform 
tasks, which some, or all, of the containers make use of to function.

The most commonly used is the ``nslcd`` pod which provides the containers with 
up-to-date uid and gid information from the LDAP servers. This directly uses the 
`nslcd <https://gitlab.ceda.ac.uk/jasmin-notebooks/jasmin-notebooks/-/tree/master/images/nslcd>`_ 
image developed for the notebook server, and runs as a side-car in every 
deployed pod to periodically poll the LDAP servers to provide names and 
permissions information to the main container in the pod (the consumer) so that 
file permissions can be handled properly. In other words, it ensures the 
``passwd`` file on the consumer container is up to date, and therefore that the 
aforementioned supplementalGroups are properly respected. 

Another CDC used across all pods is the rabbit configuration, details of which 
can be found in :doc:`server-config/server-config`. 

An additional CDC used by the microservices which require reading from or writing 
to the JASMIN filesystem are the filesystem mounts, which will mount the group 
workspaces (in either read or write mode) onto the appropriate path (``/gws`` or 
``/group_workspaces``). This is used by the following containers:

* Transfer-Put
* Transfer-Get
* Indexer

.. note::
    It is the intention to eventually include several more directories into the 
    mounting (``/xfc``, ``/home``) but this is not currently possible with the 
    version of Kubernetes installed on wigbiorg

A further CDC is the PostgreSQL configuration, which is obviously required by 
the database-interacting consumers (Catalog and Monitor) and, again, fully 
described in :doc:`server-config/server-config`. The production system uses the 
databases ``nlds_catalog`` and ``nlds_monitor`` on the Postgres server 
``db5.ceda.ac.uk`` hosted and maintained by CEDA. However, an additional part of 
this configuration is running any database migrations so the database schema is 
kept up to date. This will be discussed in more detail in section 
:ref:`db_migration`.

There are some slightly more complex deployment configurations involved in the 
rest of the setup, which are described below. 

.. _api_server:

API Server
----------

The NLDS API server, as mentioned above, was written using FastAPI. In a local 
development environment this is served using ``uvicorn``, but for the production 
deployment the `base-image <https://gitlab.ceda.ac.uk/cedaci/base-images/-/tree/main/asgi>`_
base-image is used, which runs the server instead with ``gunicorn``. They are 
functionally identical so this is not a problem per se, just something to be 
aware of. The NLDS API helm deployment is an extension of the standard `FastAPI helm chart <https://gitlab.ceda.ac.uk/cedaci/base-images/-/tree/main/fast-api>`_.

On production, this API server sits facing the public internet behind an NGINX 
reverse-proxy, handled by the standard `nginx helm chart <https://gitlab.ceda.ac.uk/cedaci/helm-charts/-/tree/master/nginx>`_ 
in the ``cedaci/helm-charts`` repo. It is served to the domain 
`https://nlds.jasmin.ac.uk <https://nlds.jasmin.ac.uk>`_, with the standard NLDS 
API endpoints extending from that (such as ``/docs``, ``/system/status``). The 
NLDS API also has an additional endpoint (``/probe/healthz``) for the Kubernetes 
liveness probe to periodically ping to ensure the API is alive, and that the 
appropriate party is notified if it goes down. Please note, this is not a 
deployment specific endpoint and will also exist on any local development 
instances. 


.. _tape_keys:

Tape Keys
---------

The CERN Tape Archive (CTA) instance at STFC requires the use of authentication 
to access the different tape pools and tape instances. This is done through 
Kerberos on the backend and requires the use of a forwardable keytab file with 
appropriate permissions. From the perspective of the NLDS this is actually quite 
simple, Scientific Computing (SCD) provide a string to put into a keytab (text) 
file which describes the CTA user and authentication and must have unix octal 
permissions 600 (i.e. strictly user read-writable). Finally two xrootd-specific 
environment variables must be created::

    XrdSecPROTOCOL=sss
    XrdSecSSSKT=path/to/keytab/file

The problem arises with the use of Kubernetes, wherein the keytab content string 
must be kept secret. This is handled in the CEDA gitlab deployment process 
through the use of git-crypt (see `here <https://gitlab.ceda.ac.uk/cedaci/ci-tools/-/blob/master/docs/setup-kubernetes-project.md#including-deployment-secrets-in-a-project>`__
for more details) to encrypt and Kubernetes secrets to decrypt at deployment 
time. Unfortunately permissions can't be set, or changed, on files made by 
Kubernetes secrets, so to get the keytab in the right place with the right 
permissions the deployment utilises an init-container to copy the secret key to 
a new file and then alter permissions on it to 600.


.. _db_migration:

Migrations
----------

As described in :doc:`development/alembic-migrations`, the NLDS uses Alembic for 
database migrations. During the deployment these are done as an initial step 
before any of the consumers are updated, so that nothing attempts to use the new 
schema before the database has been migrated, and this is implemented through 
two mechanisms in the deployment:

1. An init-container on the catalog, which has the config for both the catalog 
   and montioring DBs, which has alembic installed and calls::
        
        alembic upgrade head

2. The catalog container deployment running first (alongside the logging) before 
   all the other container deployments. 

This means that if the database migration fails for whatever reason, the whole 
deployment stops and the migration issue can be investigated through the logs. 

.. _logging:

Logging with Fluentbit
----------------------

The logging for the NLDS, as laid out in the specification, was originally 
designed to concentrate logs onto a single container for ease of perusal. 
Unfortunately, due to constraints of the Kubernetes version employed, the 
container has only limited, temporary storage capacity (the memory assigned from 
the cluster controller) and no means of attaching a more persistent volume to 
store logs in long-term. 

The, relatively new, solution that exists on the CEDA cluster is the use of 
``fluentd``, and more precisely `fluentbit <https://fluentbit.io/how-it-works/>`_, 
to aggregate logs from the NLDS logging microservice and send them to a single 
external location running ``fluentd`` – currently the stats-collection virtual 
machine run on JASMIN. Each log sent to the ``fluentd`` service is tagged with a 
string representing the particular microservice log file it was collected from, 
e.g. the logs from the indexer microservice on the staging deployment are tagged 
as:: 

    nlds_staging_index_q_log

This is practically achieved through the use of a sidecar – a further container 
running in the same pod as the logging container – running the ``fluentbit`` 
image as defined by the `fluentbit helm chart <https://gitlab.ceda.ac.uk/cedaci/helm-charts>`_. 
The full ``fluentbit`` config, including the full list of tags, can be found 
`in the logging config yamls <https://gitlab.ceda.ac.uk/cedadev/nlds-consumers-deploy/-/tree/master/conf/logger>`_.
When received by the fluentd server, each tagged log is collated into a larger 
log file for help with debugging at some later date. The log files on the 
logging microservice's container are rotated according to size, and so should 
not exceed the pod's allocated memory limit.

.. note::
    The ``fluentbit`` service is still in its infancy and subject to change at 
    short notice as the system & helm chart get more widely adopted. For example, 
    the length of time log files are kept on the stats machine has not been 
    finalised yet. 

While the above is true for long term log storage, the rancher interface for the 
Kubernetes cluster can still be used to check the output logs of each consumer 
in the standard way for quick diagnosis of problems with the NLDS.


.. _scaling:

Scaling
-------

A core part of the design philosophy of the NLDS was its microservice 
architecture, which allows for any of the microservices to be scaled out in an 
embarrassingly parallelisable way to meet changing demand. This is easily 
achieved in Kubernetes through simply spinning up additional containers for a 
given microservice using the ``replicaCount`` `parameter <https://gitlab.ceda.ac.uk/cedadev/nlds-consumers-deploy/-/blob/master/chart/values.yaml?ref_type=heads#L21>`_.
By default this value is 1 but has been increased for certain microservices 
deemed to be bottlenecks during beta testing, notably the `Transfer-Put microservice <https://gitlab.ceda.ac.uk/cedadev/nlds-consumers-deploy/-/blob/master/conf/transfer_put/common.yaml?ref_type=heads#L17>`_
where it is set to 8 and the Transfer-Get where is set to 2. 

.. note::
    While correct at time of writing, these values are subject to change – it 
    may be that other microservices are found which require scaling and those 
    above do not require as many replicas as currently allocated. 

    An ideal solution would be to automatically scale the deployments based on 
    the size of a ``Rabbit`` queue for a given microservice, and while this is 
    `in theory` `possible <https://ryanbaker.io/2019-10-07-scaling-rabbitmq-on-k8s/>`_,
    this was not possible with the current installation of Kubernetes without 
    additional plugins, namely ``Prometheus``.

The other aspect of scaling is the resource requested by each of the pods, which 
have current `default values <https://gitlab.ceda.ac.uk/cedadev/nlds-consumers-deploy/-/blob/master/conf/common.yaml?ref_type=heads#L7>`_
and an exception of greater resource for the transfer processors. The values for 
these were arrived at by using the command::

    kubectl top pod -n {NLDS_NAMESPACE}

.. |sc| raw:: html

    <code class="code docutils literal notranslate">Ctrl + `</code>

within the kubectl shell on the appropriate rancher cluster (accessible via the 
shell button in the top right, or shortcut |sc|). ``{NLDS_NAMESPACE}`` will need 
to be replaced with the appropriate namespace for the cluster you are on, i.e.::

    kubectl top pod -n nlds                     # on wigbiorg
    kubectl top pod -n nlds-consumers-master    # for consumers on staging cluster
    kubectl top pod -n nlds-api-master          # for api-server on staging cluster

and, as before, these will likely need to be adjusted as understanding of the 
actual resource use of each of the microservices evolves. 

.. _chowning:

Changing ownership of files
---------------------------

A unique problem arose in beta testing where the NLDS was not able to change 
ownership of the files downloaded during a ``get`` to the user that requested them
from within a container that was not allowed to run as root. As such, a solution 
was required which allowed a very specific set of privileges to be escalated 
without leaving any security vulnerabilities open.

The solution found was to include an additional binary in the 
``Generic Consumer`` image - ``chown_nlds`` - which has the ``setuid`` 
permissions bit set and is therefore able to change directories. To minimise 
exposed attack surface, the binary was compiled from a `rust script <https://gitlab.ceda.ac.uk/cedadev/nlds-consumers-deploy/-/blob/master/images/consumer/chown_nlds.rs?ref_type=heads>`_ 
which allows only the ``chown``-ing of files owned by the NLDS user (on JASMIN 
``uid=7054096``). Additionally, the target must be a file or directory and the 
``uid`` being changed to must be greater than 1024 to avoid clashes with system 
``uid``s. This binary will only execute on any containers where the appropriate 
security context is set, notably::

    securityContext:
        allowPrivilegeEscalation: true
        add:
            - CHOWN

which in the NLDS deployment helm chart is only set for the ``Transfer-Get`` 
containers/pods.


.. _archive_put:

Archive Put Cronjob
-------------------

The process by which the archive process is started has been automated for this 
deployment, running as a `Kubernetes cronjob <https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/>`_ 
every 12 hours at midnight and midday. The Helm config controlling this can be 
seen `here <https://gitlab.ceda.ac.uk/cedadev/nlds-consumers-deploy/-/blob/master/conf/archive_put/common.yaml?ref_type=heads#L1-3>`_.
This cronjob will simply call the ``send_archive_next()`` entry point, which 
sends a message directly to the RabbitMQ exchange for routing to the Catalog. 


.. _staging:

Staging Deployment
------------------

As alluded to earlier, there are two versions of the NLDS running: (a) the 
production system on wigbiorg, and (b) the staging/beta testing system on the 
staging cluster (``ceda-k8s``). These have similar but slightly different 
configurations, the details of which are summarised in the below table. Like 
everything on this page, this was true at the time of writing (2024-03-06).


.. list-table:: Staging vs. Production Config
   :widths: 20 40 40
   :header-rows: 1

   * - System
     - Staging
     - Production
   * - Tape
     - Pre-production instance (``antares-preprod-fac.stfc.ac.uk``)
     - Pre-production instance (``antares-preprod-fac.stfc.ac.uk``)
   * - Database
     - on ``db5`` - ``nlds_{db_name}_staging``
     - on ``db5`` - ``nlds_{db_name}``
   * - Logging
     - To ``fluentbit`` with tags ``nlds_statging_{service_name}_log``
     - To ``fluentbit`` with tags ``nlds_prod_{service_name}_log``
   * - Object store
     - Uses the ``cedaproc-o`` tenancy 
     - Uses ``nlds-cache-02-o`` tenancy, ``nlds-cache-01-o`` also available
   * - API Server
     - `https://nlds-master.130.246.130.221.nip.io/ <https://nlds-master.130.246.130.221.nip.io/docs>`_ (firewalled)
     - `https://nlds.jasmin.ac.uk/ <https://nlds.jasmin.ac.uk/docs>`_ (public, ssl secured)


Updating the deployment
-----------------------

Updating instuctions can be found on the READMEs in the deployment repos, but 
essentially boil down to changing the git hash in the relevant Dockerfiles, i.e.

1. Finalise and commit your changes to the `nlds github repo <https://github.com/cedadev/nlds>`_.
2. Take the hash from that commit (the first 8 characters is fine) and replace the value already at ``ARG GIT_VERSION=`` in the Dockerfile under ``/images``
3. Commit the changes and let the CI pipeline do its magic. 

If this doesn't work then larger changes have likely been made that require 
changes to the helm chart.