Traefik Reverse Proxy with Docker Compose and Docker Swarm

Reading time ~8 minutes

My last article about Docker Swarm was the first of a series of articles I wanted to write about the stack used behind NLP Cloud. NLP Cloud is an API based on spaCy and HuggingFace transformers in order to propose Named Entity Recognition (NER), sentiment analysis, text classification, summarization, and much more. One challenge is that each model runs inside its own container, and new models are added to the cluster on a regular basis. So we need a reverse proxy which is both efficient and flexible in front of all these containers.

The solution we chose is Traefik.

I thought it would be interesting to write an article about how we implemented Traefik and why we chose it over standard reverse proxies like Nginx.

Why Traefik

Traefik is still a relatively new reverse proxy solution compared to Nginx or Apache, but it’s been gaining a lot of popularity. Traefik’s main advantage is that it seamlessly integrates with Docker, Docker Compose and Docker Swarm (and even Kubernetes and more): basically your whole Traefik configuration can be in your docker-compose.yml file which is very handy, and, whenever you add new services to your cluster, Traefik discovers them on the fly without having to restart anything.

So Traefik makes maintainability easier and is good from a high-availability standpoint.

It is developed in Go while Nginx is coded in C so I guess it makes a slight difference in terms of performance, but nothing that I could perceive, and in my opinion it is negligible compared to the advantages it gives you.

Traefik takes kind of a learning curve though and, even if their documentation is pretty good, it is still easy to make mistakes and hard to find where the problem is coming from, so let me give you a couple of ready-to-use examples below.

Install Traefik

Basically you don’t have much to do here. Traefik is just another Docker image you’ll need to add to your cluster as a service in your docker-compose.yml:

version: '3.8'
services:
    traefik:
        image: traefik:v2.3

There are several ways to integrate Traefik but, like I said above, we are going to go for the Docker Compose integration.

Basic Configuration

90% of the Traefik’s configuration is done through Docker labels.

Let’s say we have 3 services:

More details about spaCy NLP models here and FastAPI here.

Here is a basic local staging configuration routing the requests to the correct services in your docker-compose.yml:

version: '3.8'
services:
    traefik:
        image: traefik:v2.4
        ports:
            - "80:80"
        command:
            - --providers.docker
        volumes:
            - /var/run/docker.sock:/var/run/docker.sock:ro
    corporate:
        image: <your corporate image>
        labels:
            - traefik.http.routers.corporate.rule=Host(`localhost`)
    en_core_web_sm:
        image: <your en_core_web_sm model API image>
        labels:
            - traefik.http.routers.en_core_web_sm.rule=Host(`api.localhost`) && PathPrefix(`/en_core_web_sm`)
    en_core_web_lg:
        image: <your en_core_web_lg model API image>
        labels:
            - traefik.http.routers.en_core_web_lg.rule=Host(`api.localhost`) && PathPrefix(`/en_core_web_lg`)

You can now access your corporate website at http://localhost, your en_core_web_sm model at http://api.localhost/en_core_web_sm, and your en_core_web_lg model at http://api.localhost/en_core_web_lg.

As you can see it’s dead simple.

It was for our local staging only, so we now want to do the same for production in a Docker Swarm cluster:

version: '3.8'
services:
    traefik:
        image: traefik:v2.4
        ports:
            - "80:80"
        command:
            - --providers.docker.swarmmode
        volumes:
            - /var/run/docker.sock:/var/run/docker.sock:ro
        deploy:
            placement:
                constraints:
                    - node.role == manager
    corporate:
        image: <your corporate image>
        deploy:
            labels:
                - traefik.http.routers.corporate.rule=Host(`nlpcloud.io`)
    en_core_web_sm:
        image: <your en_core_web_sm model API image>
        deploy:
            labels:
                - traefik.http.services.en_core_web_sm.loadbalancer.server.port=80
                - traefik.http.routers.en_core_web_sm.rule=Host(`api.nlpcloud.io`) && PathPrefix(`/en_core_web_sm`)
    en_core_web_lg:
        image: <your en_core_web_lg model API image>
        deploy:
            labels:
                - traefik.http.services.en_core_web_lg.loadbalancer.server.port=80
                - traefik.http.routers.en_core_web_lg.rule=Host(`api.nlpcloud.io`) && PathPrefix(`/en_core_web_lg`)

You can now access your corporate website at http://nlpcloud.io, your en_core_web_sm model at http://api.nlpcloud.io/en_core_web_sm, and your en_core_web_lg model at http://api.nlpcloud.io/en_core_web_lg.

It’s still fairly simple but the important things to notice are the following:

  • We should explicitely use the docker.swarmmode provider instead of docker
  • Labels should now be put in the deploy section
  • We need to manually declare the port of each service by using the loadbalancer directive (this has to be done manually because of Docker Swarm lacking the port auto discovery feature)
  • We have to make sure that Traefik will be deployed on a manager node of the Swarm by using constraints

You now have a fully fledged cluster thanks to Docker Swarm and Traefik. Now it’s likely that you have specific requirements and no doubt that the Trafik documentation will help. But let me show you a couple of features we use at NLP Cloud.

Forwarded Authentication

Let’s say your NLP API endpoints are protected and users need a token to reach them. A good solution for this use case is to leverage Traefik’s ForwardAuth.

Basically Traefik will forward all the user requests to a dedicated page you created for the occasion. This page will take care of checking the headers of the request (and maybe extract an authentication token for example) and determine whether the user has the right to access the resource. If it has, the page should return an HTTP 2XX code.

If a 2XX code is returned, Traefik will then make the actual request to the final API endpoint. Otherwise, it will return an error.

Please note that, for performance reasons, Traefik only forwards the user request headers to your authentication page, not the request body. So it’s not possible to authorize a user request based on the body of the request.

Here’s how to achieve it:

version: '3.8'
services:
    traefik:
        image: traefik:v2.4
        ports:
            - "80:80"
        command:
            - --providers.docker.swarmmode
        volumes:
            - /var/run/docker.sock:/var/run/docker.sock:ro
        deploy:
            placement:
                constraints:
                    - node.role == manager
    corporate:
        image: <your corporate image>
        deploy:
            labels:
                - traefik.http.routers.corporate.rule=Host(`nlpcloud.io`)
    en_core_web_sm:
        image: <your en_core_web_sm model API image>
        deploy:
            labels:
                - traefik.http.services.en_core_web_sm.loadbalancer.server.port=80
                - traefik.http.routers.en_core_web_sm.rule=Host(`api.nlpcloud.io`) && PathPrefix(`/en_core_web_sm`)
                - traefik.http.middlewares.forward_auth_api_en_core_web_sm.forwardauth.address=https://api.nlpcloud.io/auth/
                - traefik.http.routers.en_core_web_sm.middlewares=forward_auth_api_en_core_web_sm
    api_auth:
        image: <your api_auth image>
        deploy:
            labels:
                - traefik.http.services.en_core_web_sm.loadbalancer.server.port=80
                - traefik.http.routers.en_core_web_sm.rule=Host(`api.nlpcloud.io`) && PathPrefix(`/auth`)

At NLP Cloud, the api_auth service is actually a Django + Django Rest Framework image in charge of authenticating the requests.

Custom Error Pages

Maybe you don’t want to show raw Traefik error pages to users. If so, it’s possible to replace error pages with your custom error pages.

Traefik does not keep any custom error page in memory, but it can use error pages served by one of your services. When contacting your service in order to retrieve the custom error page, Traefik also passes the HTTP error code as a positional argument, so you can show different error pages based on the initial HTTP error.

Let’s says we have a small static website served by Nginx that serves your custom error pages. We want to use its error pages for HTTP errors from 400 to 599. Here’s how you would do it:

version: '3.8'
services:
    traefik:
        image: traefik:v2.4
        ports:
            - "80:80"
        command:
            - --providers.docker.swarmmode
        volumes:
            - /var/run/docker.sock:/var/run/docker.sock:ro
        deploy:
            placement:
                constraints:
                    - node.role == manager
            labels:
                - traefik.http.middlewares.handle-http-error.errors.status=400-599
                - traefik.http.middlewares.handle-http-error.errors.service=errors_service
                - traefik.http.middlewares.handle-http-error.errors.query=/{status}.html
    corporate:
        image: <your corporate image>
        deploy:
            labels:
                - traefik.http.routers.corporate.rule=Host(`nlpcloud.io`)
                - traefik.http.routers.corporate.middlewares=handle-http-error
    errors_service:
        image: <your static website image>
        deploy:
            labels:
                - traefik.http.routers.corporate.rule=Host(`nlpcloud.io/errors`)

For example thanks to the example above, a 404 error would now use this page: http://nlpcloud.io/errors/404.html

HTTPS

A cool feature from Traefik is that is can automatically provision and use TLS certificates with Let’s Encrypt.

They have a nice tutorial about how to set it up with Docker so I’m just pointing you to the right resource: https://doc.traefik.io/traefik/user-guides/docker-compose/acme-tls/

Raising Upload Size Limit

The default upload size limit is pretty low for performance reasons (I think it’s 4194304 bytes but I’m not 100% sure as it’s not in their docs).

In order to increase it, you need to use the maxRequestBodyBytes directive:

version: '3.8'
services:
    traefik:
        image: traefik:v2.4
        ports:
            - "80:80"
        command:
            - --providers.docker.swarmmode
        volumes:
            - /var/run/docker.sock:/var/run/docker.sock:ro
        deploy:
            placement:
                constraints:
                    - node.role == manager
    corporate:
        image: <your corporate image>
        deploy:
            labels:
                - traefik.http.routers.corporate.rule=Host(`nlpcloud.io`)
                - traefik.http.middlewares.upload-limit.buffering.maxRequestBodyBytes=20000000
                - traefik.http.routers.backoffice.middlewares=upload-limit

In the example above, we raised the upload limit to 20MB.

But don’t forget that uploading a huge file all at once is not necessarily the best option. Instead you want to cut the file in chunks and upload each chunk independantly. I might write an article about this in the future.

Debugging

There are a couple of things you can enable to help you debug Traefik.

First thing is to enable the debugging mode which will show you tons of stuffs about what Traefik is doing.

Second thing is to enable access logs in order to see all incoming HTTP requests.

Last of all, Traefik provides a cool built-in dashboard that helps debug your configuration. It is really useful as it is sometimes tricky to understand why things are not working.

In order to turn on the above features, you could do the following:

version: '3.8'
services:
    traefik:
        image: traefik:v2.4
        ports:
            - "80:80"
        command:
            - --providers.docker.swarmmode
            - --log.level=DEBUG
            - --accesslog
            - --api.dashboard=true
        volumes:
            - /var/run/docker.sock:/var/run/docker.sock:ro
        deploy:
            placement:
                constraints:
                    - node.role == manager
            labels:
                - traefik.http.routers.dashboard.rule=Host(`dashboard.nlpcloud.io`)
                - traefik.http.routers.dashboard.service=api@internal
                - traefik.http.middlewares.auth.basicauth.users=<your basic auth user>.<your basic auth hashed password>.
                - traefik.http.routers.dashboard.middlewares=auth

In this example we enabled debugging, access logs, and the dashboard that can be accessed at http://dashboard.nlpcloud.io with basic auth.

Conclusion

As you can see, Traefik is perfectly integrated with your Docker Compose configuration. If you want to change the config for a service, or add or remove services, just modify your docker-compose.yml and redeploy your Docker Swarm cluster. New changes will be taken into account, and services that were not modified don’t even have to restart, which is great for high availability.

I will keep writing a couple of articles about the stack we use at NLP Cloud. I think next one will be about our frontend and how we are using HTMX instead of big javascript frameworks.

If any question don’t hesitate to ask!

Existe aussi en français

API Rate Limiting With Traefik, Docker, Go, and Caching

Limiting API usage based on advanced rate limiting rule is not so easy. In order to achieve this behind the NLP Cloud API, we're using a combination of Docker, Traefik (as a reverse proxy) and local caching within a Go script. When done correctly, you can considerably improve the performance of your rate limiting and properly throttle API requests without sacrificing speed of the requests. Continue reading