Container Orchestration With Docker Swarm

NLP Cloud is a service I have contributed to recently. It is an API based on spaCy and HuggingFace transformers in order to propose Named Entity Recognition (NER), sentiment analysis, text classification, summarization, and much more. It is using several interesting technologies under the hood so I thought I would create a series of articles about this. This first one will be about container orchestration and how we are implementing it thanks to Docker Swarm. Hope it will be useful!

Why Container Orchestration

NLP Cloud is using tons of containers, mainly because each NLP model is running inside its own container. Not only do pre-trained models have their own containers, but also each user’s custom model has a dedicated container. It is very convenient for several reasons:

It is easy to run an NLP model on the server that has the best resources for it. Machine learning models are very resource hungry: they consume a lot of memory, and it is sometimes interesting to run them on a GPU (in case you are using NLP transformers for example). It is then best to deploy them onto a machine with specific hardware.
Horizontal scalability can be ensured by simply adding more replicas of the same NLP model
High availability is made easier thanks to redundancy and automatic failover
It helps lowering costs: scaling horizontally on a myriad of small machines is much more cost effective than scaling vertically on a couple of big machines

Of course setting up such an architecture takes time and skills but in the end it often pays off when you’re building a complex application.

Why Docker Swarm

Docker Swarm is usually opposed to Kubernetes and Kubernetes is supposed to be the winner of the container orchestration match today. But things are not so simple…

Kubernetes has tons of settings that make it perfect for very advanced use cases, but this versatility comes at a cost: Kubernetes is hard to install, configure, and maintain. It is actually so hard that today most companies using Kuberbetes are actually using a managed version of Kubernetes, on GCP for example, and cloud providers don’t all have the same implementation of Kubernetes in their managed offer.

Let’s not forget that Google intially built Kubernetes for their internal needs, the same way that Facebook built React for their own needs too. But you might not have to manage the same complexity for your project, and so many projects could be delivered faster and be maintained more easily by using simpler tools…

At NLP Cloud, we have a lot of containers but we do not need the complex advanced configuration capabilities of Kubernetes. We do not want to use a managed version of Kubernetes either: first for cost reasons, but also because we want to avoid vendor lock-in, and lastly for privacy reasons.

Docker Swarm also has an interesting advantage: it integrates seamlessly with Docker and Docker Compose. It makes configuration a breeze and for teams already used to working with Docker it creates no additional difficulty.

Install the Cluster

Let’s say we want to have 5 servers in our cluster:

1 manager node that will orchestrate the whole cluster. It will also host the database (just an example, the DB could perfectly be on a worker too).
1 worker node that will host our Python/Django backoffice
3 worker nodes that will host the replicated FastAPI Python API serving an NLP model

We are deliberately omitting the reverse proxy that will load balance requests to the right nodes as it will be the topic of a next blog post.

Provision the Servers

Order 5 servers where you want. It can be OVH, Digital Ocean, AWS, GCP… doesn’t matter.

It’s important for you to think about the performance of each server depending on what it will be dedicated to. For example, for the node hosting a simple backoffice you might not need huge performance. For the node hosting the reverse proxy (not addressed in this tutorial) you might need more CPU than usual. And for the API nodes serving the NLP model you might want a lot of RAM, and maybe even GPU.

Install a Linux distribution on each server. I would go for the latest Ubuntu LTS version as far as I’m concerned.

On each server, install the Docker engine.

Now give each server a human friendly hostname. It will be usefull so next time you ssh into the server you will see this hostname in your prompt, which is a good practice in order to avoid working on the wrong server… But it will also be used by Docker Swarm as the name for the node. Run the following on each server:

echo <node name> /etc/hostname; hostname -F /etc/hostname

On the manager, login to your Docker registry so Docker Swarm can pull your images (no need to do this on the worker nodes):

docker login

Initialize the Cluster and Attach Nodes

On the manager node, run:

docker swarm init --advertise-addr <server IP address>

--advertise-addr <server IP address> is only needed if your server has several IP addresses on the same interface so Docker knows which one to choose.

Then, in order to attach worker nodes, run the following on the manager:

docker swarm join-token worker

The output will be something like docker swarm join --token SWMTKN-1-5tl7ya98erd9qtasdfml4lqbosbhfqv3asdf4p13-dzw6ugasdfk0arn0 172.173.174.175:2377

Copy this output and paste it to a worker node. Then repeat the join-token operation for each worker node.

You should now be able to see all your nodes by running:

docker node ls

Give Labels to your Nodes

It’s important to label your nodes properly as you will need these labels later in order for Docker Swarm to determine on which node should a container be deployed. If you do not specify which node you want your container to be deployed to, Docker Swarm will deploy it on any node available. This is clearly not what you want.

Let’s say that your backoffice requires few resources and is basically stateless. So the latter can be deployed to any cheap worker node. Your API is stateless too but, on the contrary, it is memory hungry and requires specific hardware dedicated to machine learning, so you want to deploy it only to any machine learning worker node. Last of all, your database is not stateless so it always has to be deployed to the very same server: let’s say this server will be our manager node (but it could very well be a worker node too).

Do the below on the manager.

The manager will host the database so give it the “database” label:

docker node update --label-add type=database <manager name>

Give the “cheap” label to the worker that has poor performances and that will host the backoffice:

docker node update --label-add type=cheap <backoffice worker name>

Last of all, give the “machine-learning” label to all the workers that will host NLP models:

docker node update --label-add type=machine-learning <api worker 1 name>
docker node update --label-add type=machine-learning <api worker 2 name>
docker node update --label-add type=machine-learning <api worker 3 name>

Set Up Configuration With Docker Compose

If you used Docker Compose already you will most likely find the transition to Swarm fairly easy.

If you do not add anything to an existing docker-compose.yml file it will work with Docker Swarm but basically your containers will be deployed anywhere without your control, and they won’t be able to talk to each other.

Network

In order for containers to communicate, they should be on the same virtual network. For example a Python/Django application, a FastAPI API, and a PostgreSQL database should be on the same network to work together. We will manually create the main_network network later right before deploying, so let’s use it now in our docker-compose.yml:

version: "3.8"

networks:
  main_network:
    external: true

services:
  backoffice:
    image: <path to your custom Django image>
    depends_on:
      - database
    networks:
      - main_network
  api:
    image: <path to your custom FastAPI image>
    depends_on:
      - database
    networks:
      - main_network
  database:
    image: postgres:13
    environment:
      - POSTGRES_USER=user
      - POSTGRES_PASSWORD=password
      - POSTGRES_DB=db_name
    volumes:
      - /local/path/to/postgresql/data:/var/lib/postgresql/data
    networks:
      - main_network

Deployment Details

Now you want to tell Docker Swarm which server each service will be deployed to. This is where you are going to use the labels that you created earlier.

Basically all this is about using the constraints directive like this:

version: "3.8"

networks:
  main_network:
    external: true

services:
  backoffice:
    image: <path to your custom Django image>
    depends_on:
      - database
    networks:
      - main_network
    deploy:
      placement: 
        constraints:
          - node.role == worker
          - node.labels.type == cheap
  api:
    image: <path to your custom FastAPI image>
    depends_on:
      - database
    networks:
      - main_network
    deploy:
      placement: 
        constraints:
          - node.role == worker
          - node.labels.type == machine-learning
  database:
    image: postgres:13
    environment:
      - POSTGRES_USER=user
      - POSTGRES_PASSWORD=password
      - POSTGRES_DB=db_name
    volumes:
      - /local/path/to/postgresql/data:/var/lib/postgresql/data
    networks:
      - main_network
    deploy:
      placement: 
        constraints:
          - node.role == manager
          - node.labels.type == database

Resources Reservation and Limitation

It can be dangerous to ship your containers as is for 2 reasons:

The orchestrator might deploy them to a server that doesn’t have enough resources available (because other containers consume the whole memory available for example)
One of your containers might consume more resources than expected and eventually cause troubles to the host server. For example if your machine learning model happens to consume too much memory, it can cause the host to trigger the OOM protection and start killing processes in order to free some RAM. By default the Docker engine is among the very last processes to be killed by the host but if it has to happen it means that all your containers on this host will shut down…

In order to mitigate the above, you can use the reservations and limits directives:

reservations makes sure a container is deployed only if the target server has enough resources available. If it hasn’t, the orchestrator won’t deploy it until the necessary ressources are available.
limits prevents a container from consuming too many resources once it is deployed somewhere.

Let’s say we want our API container - embedding a machine learning model - to be deployed only if 5GB of RAM and half the CPU are available. Let’s also say the API can consume up to 10GB of RAM and 80% of the CPU. Here’s what we should do:

version: "3.8"

networks:
  main_network:
    external: true

services:
  api:
    image: <path to your custom FastAPI image>
    depends_on:
      - database
    networks:
      - main_network
    deploy:
      placement: 
        constraints:
          - node.role == worker
          - node.labels.type == machine-learning
      resources:
        limits:
          cpus: '0.8'
          memory: 10G
        reservations:
          cpus: '0.5'
          memory: 5G

Replication

In order to implement horizontal scalability, you might want to replicate some of your stateless applications. You just need to use the replicas directive for this. For example let’s say we want our API to have 3 replicas, here’s how to do it:

version: "3.8"

networks:
  main_network:
    external: true

services:
  api:
    image: <path to your custom FastAPI image>
    depends_on:
      - database
    networks:
      - main_network
    deploy:
      placement: 
        constraints:
          - node.role == worker
          - node.labels.type == machine-learning
      resources:
        limits:
          cpus: '0.8'
          memory: 10G
        reservations:
          cpus: '0.5'
          memory: 5G
      replicas: 3

More settings are available for more control on your cluster orchestration. Don’t hesitate to refer to the docs for more details.

Secrets

Docker Compose has a built-in convenient way to manage secrets by storing each secret into an external individual file. Thus these files are not part of your configuration and can even be encrypted if necessary, which is great for security.

Let’s say you want to secure the PostgreSQL DB credentials.

First create 3 secret files on your local machines:

Create a db_name.secret file and put the DB name in it
Create a db_user.secret file and put the DB user in it
Create a db_password.secret file and put the DB password in it

Then in your Docker Compose file you can use the secrets this way:

version: "3.8"

networks:
  main_network:
    external: true

secrets:
  db_name:
    file: "./secrets/db_name.secret"
  db_user:
    file: "./secrets/db_user.secret"
  db_password:
    file: "./secrets/db_password.secret"

services:
  database:
    image: postgres:13
    secrets:
      - "db_name"
      - "db_user"
      - "db_password"
    # Adding the _FILE prefix makes the Postgres image to automatically
    # detect secrets and properly load them from files.
    environment:
      - POSTGRES_USER_FILE=/run/secrets/db_user
      - POSTGRES_PASSWORD_FILE=/run/secrets/db_password
      - POSTGRES_DB_FILE=/run/secrets/db_name
    volumes:
      - /local/path/to/postgresql/data:/var/lib/postgresql/data
    deploy:
      placement:
        constraints:
          - node.role == manager
          - node.labels.type == database
    networks:
      - main_network

Secret files are automatically injected into the containers in /run/secrets by Docker Compose. Careful though: these secrets are located in files, not in environment variables. So you then need to manually open these files and read the secrets. The PostgreSQL image has a convenient feature: if you append the _FILE suffix to the environment variable, the image will automatically read the secrets from files.

Staging VS Production

You most likely want to have at least 2 different types of Docker Compose configurations:

1 for your local machine that will be used both for the Docker images creation, but also as a staging environment
1 for production

You have 2 choices. Either leverage the Docker Compose inheritance feature so you only have to write one big docker-compose.yml base file and then write an additional small staging.yml file dedicated to staging and another additional small production.yml file dedicated to production.

In the end at NLP Cloud we ended up realizing that our staging and production configurations were so different that it was easier to just maintain 2 different big files: one for staging and one for production. The main reason is that our production environment uses Docker Swarm but our staging environment doesn’t, so playing with both is pretty impractical.

Deploy

Now we assume that you have locally built your images and pushed them to your Docker registry. Let’s say we only have one single production.yml file for production.

Copy your production.yml file to the server using scp:

scp production.yml <server user>@<server IP>:/remote/path/to/project

Copy your secrets too (and make sure to upload them to the folder you declared in the secrets section of your Docker Compose file):

scp /local/path/to/secrets <server user>@<server IP>:/remote/path/to/secrets

Manually create the network that we’re using in our Docker Compose file. Please note it’s also possible not to do it and let Docker Swarm automatically create it if it’s declared in your Docker Compose file. But we noticed it’s creating erratic behaviors when recreating the stack because Docker does not recreate the network fast enough.

docker network create --driver=overlay main_network

You also need to create the volumes directories manually. The only volume we have in this tuto is for the database. So let’s create it on the node hosting the DB (i.e. the manager):

mkdir -p /local/path/to/postgresql/data

Ok everything is set, so now is time to deploy the whole stack!

docker stack deploy --with-registry-auth -c production.yml <stack name>

The --with-registry-auth option is needed if you need to pull images located on password protected registries.

Wait a moment as Docker Swarm is now pulling all the images and installing them on the nodes. Then check if everything went fine:

docker service ls

You should see something like the following:

ID             NAME                       MODE         REPLICAS   IMAGE
f1ze8qgf24c7   <stack name>_backoffice    replicated   1/1        <path to your custom Python/Django image>     
gxboram56dka   <stack name>_database      replicated   1/1        postgres:13      
3y1nmb6g2xoj   <stack name>_api           replicated   3/3        <path to your custom FastAPI image>      

The important thing is that REPLICAS should all be at their maximum. Otherwise it means that Docker is still pulling or installing your images, or that something went wrong.

Manage the Cluster

Now that your cluster is up and running, here are a couple of usefull things you might want to do to administer your cluster:

See all applications and where they are deployed: docker stack ps <stack name>
See applications on a specific node: docker node ps <node name>
See logs of an application: docker service logs <stack name>_<service name>
Completely remove the whole stack: docker stack rm <stack name>

Everytime you want to deploy a new image to the cluster, first upload it to your registry, and just run the docker stack deploy command again on the manager.

Conclusion

As you can see, setting up a Docker Swarm cluster is far from complex, especially when you think about the actual complexity that has to be handled under the hood in such distributed sytems.

Of course many more options are available and you will most likely want to read the documention. Also we did not talk about the reverse proxy/load balancing aspect but it’s an important one. In a next tutorial we will see how to achieve this with Traefik.

At NLP Cloud our configuration is obviously much more complex than what we showed above, and we had to face several tricky challenges in order for our architecture to be both fast and easy to maintain. For example, we have so many machine learning containers that manually writing the configuration file for each container was not an option, so new auto generation mechanisms had to be implemented.

If you are interested in having more in-depth details please don’t hesitate to ask, it will be pleasure to share.