Production-Ready Machine Learning NLP API with FastAPI and spaCy

FastAPI is a new Python API framework that is more and more used in production today. We are using FastAPI under the hood behind NLP Cloud. NLP Cloud is an API based on spaCy and HuggingFace transformers in order to propose Named Entity Recognition (NER), sentiment analysis, text classification, summarization, and much more. FastAPI helped us quickly build a fast and robust machine learning API serving NLP models.

Let me tell you why we made such a choice, and show you how to implement an API based on FastAPI and spaCy for Named Entity Recognition (NER).

Why FastAPI?

Until recently, I’ve always used Django Rest Framework for Python APIs. But FastAPI is proposing several interesting features:

  • It is very fast
  • It is well documented
  • It is easy to use
  • It automatically generates API schemas for you (like OpenAPI)
  • It uses type validation with Pydantic under the hood. For a Go developer like myself who is used to static typing, it’s very cool to able to leverage type hints like this. It makes the code clearer, and less error-prone.

FastAPI’s performances are supposed to make it a great candidate for machine learning APIs. Given that we’re serving a lot of demanding NLP models based on spaCy and transformers at NLP Cloud, FastAPI is a great solution.

Set Up FastAPI

The first option you have is to install FastAPI and Uvicorn (the ASGI server in front of FastAPI) by yourself:

pip install fastapi[all]

As you can see, FastAPI is running behind an ASGI server, which means it can natively work with asynchronous Python requests with asyncio.

Then you can run your app with something like this:

uvicorn main:app

Another option is to use one the Docker images generously provided by Sebastián Ramírez, the creator of FastAPI. These images are maintained and work out of the box.

For example the Uvicorn + Gunicorn + FastAPI image adds Gunicorn to the stack in order to handle parallel processes. Basically Uvicorn handles multiple parallel requests within one single Python process, and Gunicorn handles multiple parallel Python processes.

The application is supposed to automatically start with docker run if you properly follow the image documentation.

These images are customizable. For example, you can tweak the number of parallel processes created by Gunicorn. It’s important to play with such parameters depending on the resources demanded by your API. If your API is serving a machine learning model that takes several GBs of memory, you might want to decrease Gunicorn’s default concurrency, otherwise your application will quickly consume too much memory.

Simple FastAPI + spaCy API for NER

Let’s say you want to create an API endpoint that is doing Named Entity Recognition (NER) with spaCy. Basically, NER is about extracting entities like name, company, job title… from a sentence. More details about NER here if needed.

This endpoint will take a sentence as an input, and will return a list of entities. Each entity is made up of the position of the first character of the entity, the last position of the entity, the type of the entity, and the text of the entity itself.

The endpoint will be queried with POST requests this way:

curl "http://127.0.0.1/entities" \
  -X POST \
  -d '{"text":"John Doe is a Go Developer at Google"}'

And it will return something like this:

[
  {
    "end": 8,
    "start": 0,
    "text": "John Doe",
    "type": "PERSON"
  },
  {
    "end": 25,
    "start": 13,
    "text": "Go Developer",
    "type": "POSITION"
  },
  {
    "end": 35,
    "start": 30,
    "text": "Google",
    "type": "ORG"
  },
]

Here is how we could do it:

import spacy
from fastapi import FastAPI
from pydantic import BaseModel
from typing import List

model = spacy.load("en_core_web_lg")

app = FastAPI()

class UserRequestIn(BaseModel):
    text: str

class EntityOut(BaseModel):
    start: int
    end: int
    type: str
    text: str

class EntitiesOut(BaseModel):
    entities: List[EntityOut]

@app.post("/entities", response_model=EntitiesOut)
def read_entities(user_request_in: UserRequestIn):
    doc = model(user_request_in.text)

    return {
        "entities": [
            {
                "start": ent.start_char,
                "end": ent.end_char,
                "type": ent.label_,
                "text": ent.text,
            } for ent in doc.ents
        ]
    }

The first important thing here is that we’re loading the spaCy model. For our example we’re using a large spaCy pre-trained model for the english language. Large models take more memory and more disk space, but give a better accuracy as they were trained on bigger datasets.

model = spacy.load("en_core_web_lg")

Later, we are using this spaCy model for NER by doing the following:

doc = model(user_request_in.text)
# [...]
doc.ents

The second thing, which is an amazing feature of FastAPI, is the ability to force data validation with Pydantic. Basically, you need to declare in advance which will be the format of your user input, and the format of the API response. If you’re a Go developer, you’ll find it very similar to JSON unmarshalling with structs. For example, we are declaring the format of a returned entity this way:

class EntityOut(BaseModel):
    start: int
    end: int
    type: str
    text: str

Note that start and end are positions in the sentence, so they are integers, and type and text are strings. If the API is trying to return an entity that does not implement this format (for example if start is not an integer), FastAPI will raise an error.

As you can see, it is possible to embed a validation class into another one. Here we are returning a list of entities, so we need to declare the following:

class EntitiesOut(BaseModel):
    entities: List[EntityOut]

Some simple types like int and str are built-in, but more complex types like List need to be explicitly imported.

For brevity reasons, the response validation can be implemented within a decorator:

@app.post("/entities", response_model=EntitiesOut)

More Advanced Data Validation

You can do many more advanced validation things with FastAPI and Pydantic. For example, if you need the user input to have a minimum length of 10 characters, you can do the following:

from pydantic import BaseModel, constr

class UserRequestIn(BaseModel):
    text: constr(min_length=10)

Now, what if Pydantic validation passes, but you later realize that there’s something wrong with the data so you want to return an HTTP 400 code?

Simply raise an HTTPException:

from fastapi import HTTPException

raise HTTPException(
            status_code=400, detail="Your request is malformed")

It’s just a couple of examples, you can do much more! Just have a look at the FastAPI and Pydantic docs.

Root Path

It’s very common to run such APIs behind a reverse proxy. For example we’re using the Traefik reverse proxy behind NLPCloud.io.

A tricky thing when running behind a reverse proxy is that your sub-application (here the API) does not necessarily know about the whole URL path. And actually that’s great because it shows that your API is loosely coupled to the rest of your application.

For example here we want our API to believe that the endpoint URL is /entities, but actually the real URL might be something like /api/v1/entities. Here’s how to do it by setting a root path:

app = FastAPI(root_path="/api/v1")

You can also achieve it by passing an extra parameter to Uvicorn in case you’re starting Uvicorn manually:

uvicorn main:app --root-path /api/v1

Conclusion

As you can see, creating an API with FastAPI is dead simple, and the validation with Pydantic makes the code very expressive (and then needs less documentation in return) and less error-prone.

FastAPI comes with great performances and the possibility to use asynchronous requests out of the box with asyncio, which is great for demanding machine learning models. The example above about Named Entity Extraction with spaCy and FastAPI can almost be considered as production-ready (of course the API code is only a small part of a full clustered application). So far, FastAPI has never been the bottleneck in our NLPCloud.io infrastructure.

If you have any question, please don’t hesitate to ask!

Existe aussi en français
Htmx and Django for Single Page Applications

We are not fond of big Javascript frameworks at NLP Cloud. NLP Cloud is an API based on spaCy and HuggingFace transformers in order to propose Named Entity Recognition (NER), sentiment analysis, text classification, summarization, and much more. Our backoffice is very simple. Users can retrieve their API token, upload their custom spaCy models, upgrade their plan, send support messages… Nothing too complex so we didn’t feel the need for Vue.js or React.js for that. We then used this very cool combination of htmx and Django.

Let me show you how it works and tell you more about the advantages of this solution.

What is htmx and why use it?

htmx is the successor of intercooler.js. The concept behind these 2 projects is that you can do all sorts of advanced things like AJAX, CSS transitions, websockets, etc. with HTML only (meaning without writing a single line of Javascript). And the lib is very lite (9kB only).

Another very interesting thing is that, when doing asynchronous calls to your backend, htmx does not expect a JSON response but an HTML fragment response. So basically, contrary to Vue.js or React.js, your frontend does not have to deal with JSON data, but simply replaces some parts of the DOM with HTML fragments already rendered on the server side. So it allows you to 100% leverage your good old backend framework (templates, sessions, authentication, etc.) instead of turning it into a headless framework that only returns JSON. The idea is that the overhead of an HTML fragment compared to JSON is negligible during an HTTP request.

So, to sum up, here is why htmx is interesting when building a single page application (SPA):

  • No Javascript to write
  • Excellent backend frameworks like Django, Ruby On Rails, Laravel… can be fully utilized
  • Very small library (9kB) compared to the Vue or React frameworks
  • No preprocessing needed (Webpack, Babel, etc.) which makes the development experience much nicer

Installation

Installing htmx is just about loading the script like this in your HTML <head>:

<script src="https://unpkg.com/htmx.org@1.2.1"></script>

I won’t go into the details of Django’s installation here as this article essentially focuses on htmx.

Load Content Asynchronously

The most important thing when creating an SPA is that you want everything to load asynchronously. For example, when clicking a menu entry to open a new page, you don’t want the whole webpage to reload, but only the content that changes. Here is how to do that.

Let’s say our site is made up of 2 pages:

  • The token page showing the user his API token
  • The support page basically showing the support email to the user

We also want to display a loading bar while the new page is loading.

Frontend

On the frontend side, you would create a menu with 2 entries. And clicking an entry would show the loading bar and change the content of the page without reloading the whole page.

<progress id="content-loader" class="htmx-indicator" max="100"></progress>
<aside>
    <ul>
        <li><a hx-get="/token" hx-push-url="true"
                hx-target="#content" hx-swap="innerHTML" 
                hx-indicator="#content-loader">Token</a></li>
        <li><a hx-get="/support"
                hx-push-url="true" hx-target="#content" hx-swap="innerHTML"
                hx-indicator="#content-loader">Support</a></li>
    </ul>
</aside>
<div id="content">Hello and welcome to NLP Cloud!</div>

In the example above the loader is the <progress> element. It is hidden by default thanks to its class htmx-indicator. When a user clicks one of the 2 menu entries, it makes the loader visible thanks to hx-indicator="#content-loader".

When a user clicks the token menu entry, it performs a GET asynchronous call to the Django token url thanks to hx-get="/token". Django returns and HTML fragment that htmx puts in <div id="content"></div> thanks to hx-target="#content" hx-swap="innerHTML".

Same thing for the support menu entry.

Even if the page did not reload, we still want to update the URL in the browser in order to help the user understand where he is. That’s why we use hx-push-url="true".

As you can see we now have an SPA that is using HTML fragments behind the hood rather than JSON, with a mere 9kB lib, and only a couple of directives.

Backend

Of course the above does not work without the Django backend.

Here’s your urls.py:

from django.urls import path

from . import views

urlpatterns = [
    path('', views.index, name='index'),
    path('token', views.token, name='token'),
    path('support', views.support, name='support'),
]

Now your views.py:

def index(request):
    return render(request, 'backoffice/index.html')

def token(request):
    api_token = 'fake_token'

    return render(request, 'backoffice/token.html', {'token': api_token})

def support(request):
    return render(request, 'backoffice/support.html')

And last of all, in a templates/backoffice directory add the following templates.

index.html (i.e. basically the code we wrote above, but with Django url template tags):

<!DOCTYPE html>
<html>
    <head>
        <script src="https://unpkg.com/htmx.org@1.2.1"></script>
    </head>

    <body>
        <progress id="content-loader" class="htmx-indicator" max="100"></progress>
        <aside>
            <ul>
                <li><a hx-get="{% url 'home' %}"
                        hx-push-url="true" hx-target="#content" hx-swap="innerHTML"
                        hx-indicator="#content-loader">Home</a></li>
                <li><a hx-get="{% url 'token' %}" hx-push-url="true"
                        hx-target="#content" hx-swap="innerHTML" 
                        hx-indicator="#content-loader">Token</a></li>
            </ul>
        </aside>
        <div id="content">Hello and welcome to NLP Cloud!</div>
    <body>
</html>

token.html:

Here is your API token: {{ token }}

support.html:

For support questions, please contact support@nlpcloud.io

As you can see, all this is pure Django code using routing and templating as usual. No need of an API and Django Rest Framework here.

Allow Manual Page Reloading

The problem with the above is that if a user manually reloads the token or the support page, he will only end up with the HTML fragment instead of the whole HTML page.

The solution, on the Django side, is to render 2 different templates depending on whether the request is coming from htmx or not.

Here is how you could do it.

In your views.py you need to check whether the HTTP_HX_REQUEST header was passed in the request. If it was, it means this is a request from htmx and in that case you can show the HTML fragment only. If it was not, you need to render the full page.

def index(request):
    return render(request, 'backoffice/index.html')

def token(request):
    if request.META.get("HTTP_HX_REQUEST") != 'true':
        return render(request, 'backoffice/token_full.html', {'token': api_token})

    return render(request, 'backoffice/token.html', {'token': api_token})

def support(request):
    if request.META.get("HTTP_HX_REQUEST") != 'true':
        return render(request, 'backoffice/support_full.html')

    return render(request, 'backoffice/support.html')

Now in your index.html template you want to use blocks in order for the index page to be extended by all the other pages:

<!DOCTYPE html>
<html>
    <head>
        <script src="https://unpkg.com/htmx.org@1.2.1"></script>
    </head>

    <body>
        <progress id="content-loader" class="htmx-indicator" max="100"></progress>
        <aside>
            <ul>
                <li><a hx-get="{% url 'home' %}"
                        hx-push-url="true" hx-target="#content" hx-swap="innerHTML"
                        hx-indicator="#content-loader">Home</a></li>
                <li><a hx-get="{% url 'token' %}" hx-push-url="true"
                        hx-target="#content" hx-swap="innerHTML" 
                        hx-indicator="#content-loader">Token</a></li>
            </ul>
        </aside>
        <div id="content">{% block content %}{% endblock %}</div>
    <body>
</html>

Your token.html template is the same as before but now you need to add a second template called token_full.html in case the page is manually reloaded:


{% extends "home/index.html" %}

{% block content %}
    {% include "home/token.html" %}
{% endblock %}

Same for support.html, add a support_full.html file:


{% extends "home/index.html" %}

{% block content %}
    {% include "home/support.html" %}
{% endblock %}

We are basically extending the index.html template in order to build the full page all at once on the server side.

This is a small hack but this is not very complex, and a middleware can even be created for the occasion in order to make things even simpler.

What Else?

We only scratched the surface of htmx. This library (or framework?) includes tons of usefull other features like:

  • You can use the HTTP verb you want for your requests. Use hx-get for GET, hx-post for POST, etc.
  • You can use polling, websockets, and server side events, in order to listen to events coming from the server
  • You can use only a part of the HTML fragment returned by the server (hx-select)
  • You can leverage CSS transitions
  • You can easily work with forms and file uploads
  • You can use htmx’s hyperscript, which is a pseudo Javascript language that can easily be embedded in HTML tags for advanced usage

Conclusion

I’m very enthusiast about this htmx lib as you can see, and I do hope more and more people will realize they don’t necessarily need a huge JS framework for their project.

For the moment I’ve only integrated htmx into small codebases in production, but I’m pretty sure that htmx fits into large projects too. So far it’s been easy to maintain, lightweight, and its seamless integration with backend frameworks like Django is a must!

If some of you use htmx in production, I’d love to hear your feebacks too!

Existe aussi en français
Traefik Reverse Proxy with Docker Compose and Docker Swarm

My last article about Docker Swarm was the first of a series of articles I wanted to write about the stack used behind NLP Cloud. NLP Cloud is an API based on spaCy and HuggingFace transformers in order to propose Named Entity Recognition (NER), sentiment analysis, text classification, summarization, and much more. One challenge is that each model runs inside its own container, and new models are added to the cluster on a regular basis. So we need a reverse proxy which is both efficient and flexible in front of all these containers.

The solution we chose is Traefik.

I thought it would be interesting to write an article about how we implemented Traefik and why we chose it over standard reverse proxies like Nginx.

Why Traefik

Traefik is still a relatively new reverse proxy solution compared to Nginx or Apache, but it’s been gaining a lot of popularity. Traefik’s main advantage is that it seamlessly integrates with Docker, Docker Compose and Docker Swarm (and even Kubernetes and more): basically your whole Traefik configuration can be in your docker-compose.yml file which is very handy, and, whenever you add new services to your cluster, Traefik discovers them on the fly without having to restart anything.

So Traefik makes maintainability easier and is good from a high-availability standpoint.

It is developed in Go while Nginx is coded in C so I guess it makes a slight difference in terms of performance, but nothing that I could perceive, and in my opinion it is negligible compared to the advantages it gives you.

Traefik takes kind of a learning curve though and, even if their documentation is pretty good, it is still easy to make mistakes and hard to find where the problem is coming from, so let me give you a couple of ready-to-use examples below.

Install Traefik

Basically you don’t have much to do here. Traefik is just another Docker image you’ll need to add to your cluster as a service in your docker-compose.yml:

version: '3.8'
services:
    traefik:
        image: traefik:v2.3

There are several ways to integrate Traefik but, like I said above, we are going to go for the Docker Compose integration.

Basic Configuration

90% of the Traefik’s configuration is done through Docker labels.

Let’s say we have 3 services:

  • A corporate website that is simply served as a static website at http://nlpcloud.io
  • An en_core_web_sm spaCy model served through a FastAPI Python API at http://api.nlpcloud.io/en_core_web_sm
  • An en_core_web_lg spaCy model served through a FastAPI Python API at http://api.nlpcloud.io/en_core_web_lg

More details about spaCy NLP models here and FastAPI here.

Here is a basic local staging configuration routing the requests to the correct services in your docker-compose.yml:

version: '3.8'
services:
    traefik:
        image: traefik:v2.4
        ports:
            - "80:80"
        command:
            - --providers.docker
        volumes:
            - /var/run/docker.sock:/var/run/docker.sock:ro
    corporate:
        image: <your corporate image>
        labels:
            - traefik.http.routers.corporate.rule=Host(`localhost`)
    en_core_web_sm:
        image: <your en_core_web_sm model API image>
        labels:
            - traefik.http.routers.en_core_web_sm.rule=Host(`api.localhost`) && PathPrefix(`/en_core_web_sm`)
    en_core_web_lg:
        image: <your en_core_web_lg model API image>
        labels:
            - traefik.http.routers.en_core_web_lg.rule=Host(`api.localhost`) && PathPrefix(`/en_core_web_lg`)

You can now access your corporate website at http://localhost, your en_core_web_sm model at http://api.localhost/en_core_web_sm, and your en_core_web_lg model at http://api.localhost/en_core_web_lg.

As you can see it’s dead simple.

It was for our local staging only, so we now want to do the same for production in a Docker Swarm cluster:

version: '3.8'
services:
    traefik:
        image: traefik:v2.4
        ports:
            - "80:80"
        command:
            - --providers.docker.swarmmode
        volumes:
            - /var/run/docker.sock:/var/run/docker.sock:ro
        deploy:
            placement:
                constraints:
                    - node.role == manager
    corporate:
        image: <your corporate image>
        deploy:
            labels:
                - traefik.http.routers.corporate.rule=Host(`nlpcloud.io`)
    en_core_web_sm:
        image: <your en_core_web_sm model API image>
        deploy:
            labels:
                - traefik.http.services.en_core_web_sm.loadbalancer.server.port=80
                - traefik.http.routers.en_core_web_sm.rule=Host(`api.nlpcloud.io`) && PathPrefix(`/en_core_web_sm`)
    en_core_web_lg:
        image: <your en_core_web_lg model API image>
        deploy:
            labels:
                - traefik.http.services.en_core_web_lg.loadbalancer.server.port=80
                - traefik.http.routers.en_core_web_lg.rule=Host(`api.nlpcloud.io`) && PathPrefix(`/en_core_web_lg`)

You can now access your corporate website at http://nlpcloud.io, your en_core_web_sm model at http://api.nlpcloud.io/en_core_web_sm, and your en_core_web_lg model at http://api.nlpcloud.io/en_core_web_lg.

It’s still fairly simple but the important things to notice are the following:

  • We should explicitely use the docker.swarmmode provider instead of docker
  • Labels should now be put in the deploy section
  • We need to manually declare the port of each service by using the loadbalancer directive (this has to be done manually because of Docker Swarm lacking the port auto discovery feature)
  • We have to make sure that Traefik will be deployed on a manager node of the Swarm by using constraints

You now have a fully fledged cluster thanks to Docker Swarm and Traefik. Now it’s likely that you have specific requirements and no doubt that the Trafik documentation will help. But let me show you a couple of features we use at NLP Cloud.

Forwarded Authentication

Let’s say your NLP API endpoints are protected and users need a token to reach them. A good solution for this use case is to leverage Traefik’s ForwardAuth.

Basically Traefik will forward all the user requests to a dedicated page you created for the occasion. This page will take care of checking the headers of the request (and maybe extract an authentication token for example) and determine whether the user has the right to access the resource. If it has, the page should return an HTTP 2XX code.

If a 2XX code is returned, Traefik will then make the actual request to the final API endpoint. Otherwise, it will return an error.

Please note that, for performance reasons, Traefik only forwards the user request headers to your authentication page, not the request body. So it’s not possible to authorize a user request based on the body of the request.

Here’s how to achieve it:

version: '3.8'
services:
    traefik:
        image: traefik:v2.4
        ports:
            - "80:80"
        command:
            - --providers.docker.swarmmode
        volumes:
            - /var/run/docker.sock:/var/run/docker.sock:ro
        deploy:
            placement:
                constraints:
                    - node.role == manager
    corporate:
        image: <your corporate image>
        deploy:
            labels:
                - traefik.http.routers.corporate.rule=Host(`nlpcloud.io`)
    en_core_web_sm:
        image: <your en_core_web_sm model API image>
        deploy:
            labels:
                - traefik.http.services.en_core_web_sm.loadbalancer.server.port=80
                - traefik.http.routers.en_core_web_sm.rule=Host(`api.nlpcloud.io`) && PathPrefix(`/en_core_web_sm`)
                - traefik.http.middlewares.forward_auth_api_en_core_web_sm.forwardauth.address=https://api.nlpcloud.io/auth/
                - traefik.http.routers.en_core_web_sm.middlewares=forward_auth_api_en_core_web_sm
    api_auth:
        image: <your api_auth image>
        deploy:
            labels:
                - traefik.http.services.en_core_web_sm.loadbalancer.server.port=80
                - traefik.http.routers.en_core_web_sm.rule=Host(`api.nlpcloud.io`) && PathPrefix(`/auth`)

At NLP Cloud, the api_auth service is actually a Django + Django Rest Framework image in charge of authenticating the requests.

Custom Error Pages

Maybe you don’t want to show raw Traefik error pages to users. If so, it’s possible to replace error pages with your custom error pages.

Traefik does not keep any custom error page in memory, but it can use error pages served by one of your services. When contacting your service in order to retrieve the custom error page, Traefik also passes the HTTP error code as a positional argument, so you can show different error pages based on the initial HTTP error.

Let’s says we have a small static website served by Nginx that serves your custom error pages. We want to use its error pages for HTTP errors from 400 to 599. Here’s how you would do it:

version: '3.8'
services:
    traefik:
        image: traefik:v2.4
        ports:
            - "80:80"
        command:
            - --providers.docker.swarmmode
        volumes:
            - /var/run/docker.sock:/var/run/docker.sock:ro
        deploy:
            placement:
                constraints:
                    - node.role == manager
            labels:
                - traefik.http.middlewares.handle-http-error.errors.status=400-599
                - traefik.http.middlewares.handle-http-error.errors.service=errors_service
                - traefik.http.middlewares.handle-http-error.errors.query=/{status}.html
    corporate:
        image: <your corporate image>
        deploy:
            labels:
                - traefik.http.routers.corporate.rule=Host(`nlpcloud.io`)
                - traefik.http.routers.corporate.middlewares=handle-http-error
    errors_service:
        image: <your static website image>
        deploy:
            labels:
                - traefik.http.routers.corporate.rule=Host(`nlpcloud.io/errors`)

For example thanks to the example above, a 404 error would now use this page: http://nlpcloud.io/errors/404.html

HTTPS

A cool feature from Traefik is that is can automatically provision and use TLS certificates with Let’s Encrypt.

They have a nice tutorial about how to set it up with Docker so I’m just pointing you to the right resource: https://doc.traefik.io/traefik/user-guides/docker-compose/acme-tls/

Raising Upload Size Limit

The default upload size limit is pretty low for performance reasons (I think it’s 4194304 bytes but I’m not 100% sure as it’s not in their docs).

In order to increase it, you need to use the maxRequestBodyBytes directive:

version: '3.8'
services:
    traefik:
        image: traefik:v2.4
        ports:
            - "80:80"
        command:
            - --providers.docker.swarmmode
        volumes:
            - /var/run/docker.sock:/var/run/docker.sock:ro
        deploy:
            placement:
                constraints:
                    - node.role == manager
    corporate:
        image: <your corporate image>
        deploy:
            labels:
                - traefik.http.routers.corporate.rule=Host(`nlpcloud.io`)
                - traefik.http.middlewares.upload-limit.buffering.maxRequestBodyBytes=20000000
                - traefik.http.routers.backoffice.middlewares=upload-limit

In the example above, we raised the upload limit to 20MB.

But don’t forget that uploading a huge file all at once is not necessarily the best option. Instead you want to cut the file in chunks and upload each chunk independantly. I might write an article about this in the future.

Debugging

There are a couple of things you can enable to help you debug Traefik.

First thing is to enable the debugging mode which will show you tons of stuffs about what Traefik is doing.

Second thing is to enable access logs in order to see all incoming HTTP requests.

Last of all, Traefik provides a cool built-in dashboard that helps debug your configuration. It is really useful as it is sometimes tricky to understand why things are not working.

In order to turn on the above features, you could do the following:

version: '3.8'
services:
    traefik:
        image: traefik:v2.4
        ports:
            - "80:80"
        command:
            - --providers.docker.swarmmode
            - --log.level=DEBUG
            - --accesslog
            - --api.dashboard=true
        volumes:
            - /var/run/docker.sock:/var/run/docker.sock:ro
        deploy:
            placement:
                constraints:
                    - node.role == manager
            labels:
                - traefik.http.routers.dashboard.rule=Host(`dashboard.nlpcloud.io`)
                - traefik.http.routers.dashboard.service=api@internal
                - traefik.http.middlewares.auth.basicauth.users=<your basic auth user>.<your basic auth hashed password>.
                - traefik.http.routers.dashboard.middlewares=auth

In this example we enabled debugging, access logs, and the dashboard that can be accessed at http://dashboard.nlpcloud.io with basic auth.

Conclusion

As you can see, Traefik is perfectly integrated with your Docker Compose configuration. If you want to change the config for a service, or add or remove services, just modify your docker-compose.yml and redeploy your Docker Swarm cluster. New changes will be taken into account, and services that were not modified don’t even have to restart, which is great for high availability.

I will keep writing a couple of articles about the stack we use at NLP Cloud. I think next one will be about our frontend and how we are using HTMX instead of big javascript frameworks.

If any question don’t hesitate to ask!

Existe aussi en français
Container Orchestration With Docker Swarm

NLP Cloud is a service I have contributed to recently. It is an API based on spaCy and HuggingFace transformers in order to propose Named Entity Recognition (NER), sentiment analysis, text classification, summarization, and much more. It is using several interesting technologies under the hood so I thought I would create a series of articles about this. This first one will be about container orchestration and how we are implementing it thanks to Docker Swarm. Hope it will be useful!

Why Container Orchestration

NLP Cloud is using tons of containers, mainly because each NLP model is running inside its own container. Not only do pre-trained models have their own containers, but also each user’s custom model has a dedicated container. It is very convenient for several reasons:

  • It is easy to run an NLP model on the server that has the best resources for it. Machine learning models are very resource hungry: they consume a lot of memory, and it is sometimes interesting to run them on a GPU (in case you are using NLP transformers for example). It is then best to deploy them onto a machine with specific hardware.
  • Horizontal scalability can be ensured by simply adding more replicas of the same NLP model
  • High availability is made easier thanks to redundancy and automatic failover
  • It helps lowering costs: scaling horizontally on a myriad of small machines is much more cost effective than scaling vertically on a couple of big machines

Of course setting up such an architecture takes time and skills but in the end it often pays off when you’re building a complex application.

Why Docker Swarm

Docker Swarm is usually opposed to Kubernetes and Kubernetes is supposed to be the winner of the container orchestration match today. But things are not so simple…

Kubernetes has tons of settings that make it perfect for very advanced use cases, but this versatility comes at a cost: Kubernetes is hard to install, configure, and maintain. It is actually so hard that today most companies using Kuberbetes are actually using a managed version of Kubernetes, on GCP for example, and cloud providers don’t all have the same implementation of Kubernetes in their managed offer.

Let’s not forget that Google intially built Kubernetes for their internal needs, the same way that Facebook built React for their own needs too. But you might not have to manage the same complexity for your project, and so many projects could be delivered faster and be maintained more easily by using simpler tools…

At NLP Cloud, we have a lot of containers but we do not need the complex advanced configuration capabilities of Kubernetes. We do not want to use a managed version of Kubernetes either: first for cost reasons, but also because we want to avoid vendor lock-in, and lastly for privacy reasons.

Docker Swarm also has an interesting advantage: it integrates seamlessly with Docker and Docker Compose. It makes configuration a breeze and for teams already used to working with Docker it creates no additional difficulty.

Install the Cluster

Let’s say we want to have 5 servers in our cluster:

  • 1 manager node that will orchestrate the whole cluster. It will also host the database (just an example, the DB could perfectly be on a worker too).
  • 1 worker node that will host our Python/Django backoffice
  • 3 worker nodes that will host the replicated FastAPI Python API serving an NLP model

We are deliberately omitting the reverse proxy that will load balance requests to the right nodes as it will be the topic of a next blog post.

Provision the Servers

Order 5 servers where you want. It can be OVH, Digital Ocean, AWS, GCP… doesn’t matter.

It’s important for you to think about the performance of each server depending on what it will be dedicated to. For example, for the node hosting a simple backoffice you might not need huge performance. For the node hosting the reverse proxy (not addressed in this tutorial) you might need more CPU than usual. And for the API nodes serving the NLP model you might want a lot of RAM, and maybe even GPU.

Install a Linux distribution on each server. I would go for the latest Ubuntu LTS version as far as I’m concerned.

On each server, install the Docker engine.

Now give each server a human friendly hostname. It will be usefull so next time you ssh into the server you will see this hostname in your prompt, which is a good practice in order to avoid working on the wrong server… But it will also be used by Docker Swarm as the name for the node. Run the following on each server:

echo <node name> /etc/hostname; hostname -F /etc/hostname

On the manager, login to your Docker registry so Docker Swarm can pull your images (no need to do this on the worker nodes):

docker login

Initialize the Cluster and Attach Nodes

On the manager node, run:

docker swarm init --advertise-addr <server IP address>

--advertise-addr <server IP address> is only needed if your server has several IP addresses on the same interface so Docker knows which one to choose.

Then, in order to attach worker nodes, run the following on the manager:

docker swarm join-token worker

The output will be something like docker swarm join --token SWMTKN-1-5tl7ya98erd9qtasdfml4lqbosbhfqv3asdf4p13-dzw6ugasdfk0arn0 172.173.174.175:2377

Copy this output and paste it to a worker node. Then repeat the join-token operation for each worker node.

You should now be able to see all your nodes by running:

docker node ls

Give Labels to your Nodes

It’s important to label your nodes properly as you will need these labels later in order for Docker Swarm to determine on which node should a container be deployed. If you do not specify which node you want your container to be deployed to, Docker Swarm will deploy it on any node available. This is clearly not what you want.

Let’s say that your backoffice requires few resources and is basically stateless. So the latter can be deployed to any cheap worker node. Your API is stateless too but, on the contrary, it is memory hungry and requires specific hardware dedicated to machine learning, so you want to deploy it only to any machine learning worker node. Last of all, your database is not stateless so it always has to be deployed to the very same server: let’s say this server will be our manager node (but it could very well be a worker node too).

Do the below on the manager.

The manager will host the database so give it the “database” label:

docker node update --label-add type=database <manager name>

Give the “cheap” label to the worker that has poor performances and that will host the backoffice:

docker node update --label-add type=cheap <backoffice worker name>

Last of all, give the “machine-learning” label to all the workers that will host NLP models:

docker node update --label-add type=machine-learning <api worker 1 name>
docker node update --label-add type=machine-learning <api worker 2 name>
docker node update --label-add type=machine-learning <api worker 3 name>

Set Up Configuration With Docker Compose

If you used Docker Compose already you will most likely find the transition to Swarm fairly easy.

If you do not add anything to an existing docker-compose.yml file it will work with Docker Swarm but basically your containers will be deployed anywhere without your control, and they won’t be able to talk to each other.

Network

In order for containers to communicate, they should be on the same virtual network. For example a Python/Django application, a FastAPI API, and a PostgreSQL database should be on the same network to work together. We will manually create the main_network network later right before deploying, so let’s use it now in our docker-compose.yml:

version: "3.8"

networks:
  main_network:
    external: true

services:
  backoffice:
    image: <path to your custom Django image>
    depends_on:
      - database
    networks:
      - main_network
  api:
    image: <path to your custom FastAPI image>
    depends_on:
      - database
    networks:
      - main_network
  database:
    image: postgres:13
    environment:
      - POSTGRES_USER=user
      - POSTGRES_PASSWORD=password
      - POSTGRES_DB=db_name
    volumes:
      - /local/path/to/postgresql/data:/var/lib/postgresql/data
    networks:
      - main_network

Deployment Details

Now you want to tell Docker Swarm which server each service will be deployed to. This is where you are going to use the labels that you created earlier.

Basically all this is about using the constraints directive like this:

version: "3.8"

networks:
  main_network:
    external: true

services:
  backoffice:
    image: <path to your custom Django image>
    depends_on:
      - database
    networks:
      - main_network
    deploy:
      placement: 
        constraints:
          - node.role == worker
          - node.labels.type == cheap
  api:
    image: <path to your custom FastAPI image>
    depends_on:
      - database
    networks:
      - main_network
    deploy:
      placement: 
        constraints:
          - node.role == worker
          - node.labels.type == machine-learning
  database:
    image: postgres:13
    environment:
      - POSTGRES_USER=user
      - POSTGRES_PASSWORD=password
      - POSTGRES_DB=db_name
    volumes:
      - /local/path/to/postgresql/data:/var/lib/postgresql/data
    networks:
      - main_network
    deploy:
      placement: 
        constraints:
          - node.role == manager
          - node.labels.type == database

Resources Reservation and Limitation

It can be dangerous to ship your containers as is for 2 reasons:

  • The orchestrator might deploy them to a server that doesn’t have enough resources available (because other containers consume the whole memory available for example)
  • One of your containers might consume more resources than expected and eventually cause troubles to the host server. For example if your machine learning model happens to consume too much memory, it can cause the host to trigger the OOM protection and start killing processes in order to free some RAM. By default the Docker engine is among the very last processes to be killed by the host but if it has to happen it means that all your containers on this host will shut down…

In order to mitigate the above, you can use the reservations and limits directives:

  • reservations makes sure a container is deployed only if the target server has enough resources available. If it hasn’t, the orchestrator won’t deploy it until the necessary ressources are available.
  • limits prevents a container from consuming too many resources once it is deployed somewhere.

Let’s say we want our API container - embedding a machine learning model - to be deployed only if 5GB of RAM and half the CPU are available. Let’s also say the API can consume up to 10GB of RAM and 80% of the CPU. Here’s what we should do:

version: "3.8"

networks:
  main_network:
    external: true

services:
  api:
    image: <path to your custom FastAPI image>
    depends_on:
      - database
    networks:
      - main_network
    deploy:
      placement: 
        constraints:
          - node.role == worker
          - node.labels.type == machine-learning
      resources:
        limits:
          cpus: '0.8'
          memory: 10G
        reservations:
          cpus: '0.5'
          memory: 5G

Replication

In order to implement horizontal scalability, you might want to replicate some of your stateless applications. You just need to use the replicas directive for this. For example let’s say we want our API to have 3 replicas, here’s how to do it:

version: "3.8"

networks:
  main_network:
    external: true

services:
  api:
    image: <path to your custom FastAPI image>
    depends_on:
      - database
    networks:
      - main_network
    deploy:
      placement: 
        constraints:
          - node.role == worker
          - node.labels.type == machine-learning
      resources:
        limits:
          cpus: '0.8'
          memory: 10G
        reservations:
          cpus: '0.5'
          memory: 5G
      replicas: 3

More

More settings are available for more control on your cluster orchestration. Don’t hesitate to refer to the docs for more details.

Secrets

Docker Compose has a built-in convenient way to manage secrets by storing each secret into an external individual file. Thus these files are not part of your configuration and can even be encrypted if necessary, which is great for security.

Let’s say you want to secure the PostgreSQL DB credentials.

First create 3 secret files on your local machines:

  • Create a db_name.secret file and put the DB name in it
  • Create a db_user.secret file and put the DB user in it
  • Create a db_password.secret file and put the DB password in it

Then in your Docker Compose file you can use the secrets this way:

version: "3.8"

networks:
  main_network:
    external: true

secrets:
  db_name:
    file: "./secrets/db_name.secret"
  db_user:
    file: "./secrets/db_user.secret"
  db_password:
    file: "./secrets/db_password.secret"

services:
  database:
    image: postgres:13
    secrets:
      - "db_name"
      - "db_user"
      - "db_password"
    # Adding the _FILE prefix makes the Postgres image to automatically
    # detect secrets and properly load them from files.
    environment:
      - POSTGRES_USER_FILE=/run/secrets/db_user
      - POSTGRES_PASSWORD_FILE=/run/secrets/db_password
      - POSTGRES_DB_FILE=/run/secrets/db_name
    volumes:
      - /local/path/to/postgresql/data:/var/lib/postgresql/data
    deploy:
      placement:
        constraints:
          - node.role == manager
          - node.labels.type == database
    networks:
      - main_network

Secret files are automatically injected into the containers in /run/secrets by Docker Compose. Careful though: these secrets are located in files, not in environment variables. So you then need to manually open these files and read the secrets. The PostgreSQL image has a convenient feature: if you append the _FILE suffix to the environment variable, the image will automatically read the secrets from files.

Staging VS Production

You most likely want to have at least 2 different types of Docker Compose configurations:

  • 1 for your local machine that will be used both for the Docker images creation, but also as a staging environment
  • 1 for production

You have 2 choices. Either leverage the Docker Compose inheritance feature so you only have to write one big docker-compose.yml base file and then write an additional small staging.yml file dedicated to staging and another additional small production.yml file dedicated to production.

In the end at NLP Cloud we ended up realizing that our staging and production configurations were so different that it was easier to just maintain 2 different big files: one for staging and one for production. The main reason is that our production environment uses Docker Swarm but our staging environment doesn’t, so playing with both is pretty impractical.

Deploy

Now we assume that you have locally built your images and pushed them to your Docker registry. Let’s say we only have one single production.yml file for production.

Copy your production.yml file to the server using scp:

scp production.yml <server user>@<server IP>:/remote/path/to/project

Copy your secrets too (and make sure to upload them to the folder you declared in the secrets section of your Docker Compose file):

scp /local/path/to/secrets <server user>@<server IP>:/remote/path/to/secrets

Manually create the network that we’re using in our Docker Compose file. Please note it’s also possible not to do it and let Docker Swarm automatically create it if it’s declared in your Docker Compose file. But we noticed it’s creating erratic behaviors when recreating the stack because Docker does not recreate the network fast enough.

docker network create --driver=overlay main_network

You also need to create the volumes directories manually. The only volume we have in this tuto is for the database. So let’s create it on the node hosting the DB (i.e. the manager):

mkdir -p /local/path/to/postgresql/data

Ok everything is set, so now is time to deploy the whole stack!

docker stack deploy --with-registry-auth -c production.yml <stack name>

The --with-registry-auth option is needed if you need to pull images located on password protected registries.

Wait a moment as Docker Swarm is now pulling all the images and installing them on the nodes. Then check if everything went fine:

docker service ls

You should see something like the following:

ID             NAME                       MODE         REPLICAS   IMAGE
f1ze8qgf24c7   <stack name>_backoffice    replicated   1/1        <path to your custom Python/Django image>     
gxboram56dka   <stack name>_database      replicated   1/1        postgres:13      
3y1nmb6g2xoj   <stack name>_api           replicated   3/3        <path to your custom FastAPI image>      

The important thing is that REPLICAS should all be at their maximum. Otherwise it means that Docker is still pulling or installing your images, or that something went wrong.

Manage the Cluster

Now that your cluster is up and running, here are a couple of usefull things you might want to do to administer your cluster:

  • See all applications and where they are deployed: docker stack ps <stack name>
  • See applications on a specific node: docker node ps <node name>
  • See logs of an application: docker service logs <stack name>_<service name>
  • Completely remove the whole stack: docker stack rm <stack name>

Everytime you want to deploy a new image to the cluster, first upload it to your registry, and just run the docker stack deploy command again on the manager.

Conclusion

As you can see, setting up a Docker Swarm cluster is far from complex, especially when you think about the actual complexity that has to be handled under the hood in such distributed sytems.

Of course many more options are available and you will most likely want to read the documention. Also we did not talk about the reverse proxy/load balancing aspect but it’s an important one. In a next tutorial we will see how to achieve this with Traefik.

At NLP Cloud our configuration is obviously much more complex than what we showed above, and we had to face several tricky challenges in order for our architecture to be both fast and easy to maintain. For example, we have so many machine learning containers that manually writing the configuration file for each container was not an option, so new auto generation mechanisms had to be implemented.

If you are interested in having more in-depth details please don’t hesitate to ask, it will be pleasure to share.

Existe aussi en français
Crawling large volumes of web pages

Crawling and scraping data from the web is a funny thing. It’s fairly easy to achieve and it’s giving you immediate results. However, scaling from a basic crawler (thanks to a quick Python script for example) to a full speed large volume crawler, is hard to achieve. I’ll try to tell you about a couple of typical challenges one faces when building such a web crawler.

Concurrency

Concurrency is absolutely central to more and more modern applications and it’s especially true to applications that are heavily relying on network access like web crawlers. Indeed, as every HTTP request you’re triggering is taking a lot of time to return, you’d better launch them in parallel rather than sequentially. Basically it means that if you’re crawling 10 web pages taking 1 second each, it will roughly take 1 second overall rather than 10 seconds.

So concurrency is critical to web crawlers, but how to achieve it?

The naive approach, which works well for a small application, is to code a logic that triggers jobs in parallel, wait for all the results, and process them. Typically in Python you would spawn several parallel processes, and in Golang (which is better suited for this kind of thing) you would create goroutines. But handling this manually can quickly become a hassle: as your RAM and CPU resources are limited there’s no way you can crawl millions of web pages in parallel, so how do you handle job queues, and how can you handle retries in case some jobs are failing (and they will for sure) or in case your server is stopping for some reason?

The most robust approach is to leverage a messaging system like RabbitMQ. Every new URL parsed by your application should now be enqueued in RabbitMQ, and every new page your application needs to crawl should be dequeued from RabbitMQ. The amount of concurrent requests you want to reach is just a simple setting in RabbitMQ.

Of course, even when using a messaging system, the choice of the underlying programming language remains important: triggering 100 parallel jobs in Go will cost you much less resources than in Python for example (which is partly why I really like Go!).

Scalability

At some point, no matter how lightweight and optimized your web crawler is, you’ll be limited by hardware resources.

The first solution is to upgrade your server (which is called “vertical scalability”). It’s easy but once you’re reaching a certain level of RAM and CPU, it’s cheaper to favor “horizontal scalability”.

Horizontal scalability is about adding several modest servers to your infrastructure, rather than turning one single server into a supercomputer. Achieving this is harder though because your servers might have to communicate about a shared state, and a refactoring of your application might be needed. Good news is that a web crawler can fairly easily become “stateless”: several instances of your application can be run in parallel and shared information will most likely be located in your messaging system and/or your database. It’s easy then to increase/decrease the number of servers based on the speed you want to achieve. Each server should handle a certain amount of concurrent requests consumed from the messaging server. It’s up to you to define how many concurrent requests each server can handle depending on its RAM/CPU resources.

Containers orchestrators like Kubernetes make horizontal scalability easier. It’s easy to scale up to more instances simply by clicking a button, and you can even let Kubernetes auto-scale your instances for you (always set limits though in order to control the costs).

If you want to have a deeper understanding of the scalability challenges, you should read this amazing book by Martin Kleppmann: Data Intensive Applications.

Data Intensive Applications book

Report Errors Wisely

Tons of ugly things can happen during a crawl: connectivity issues (on the client side and on the server side), network congestion, target page too big, memory limit reached,…

It’s crucial that you handle these errors gracefully and that you report them wisely in order not to get overwhelmed by errors.

A good practice it centralize all errors into Sentry. Some errors are never sent to Sentry because we’re not considering them as critical and we don’t want to be alerted for that. For example, we want to know when an instance is reaching memory issues, but we don’t want to know when a URL cannot be downloaded because of a website timing out (this kind of error is business as usual for a crawler). It’s up to you to fine tune which errors are worth being reported urgently and which one are not.

File Descriptors and RAM Usage

When dealing with web crawlers, it’s worth being familiar with file descriptors. Every HTTP request you’re launching is opening a file descriptor, and a file descriptor is consuming memory.

On Linux systems, the max number of open file descriptors in parallel is capped by the OS in order to avoid breaking the system. Once this limit is reached you won’t be able to concurrently open any new webpage.

You might want to increase this limit but proceed carefully as it could lead to excessive RAM usage.

Avoid Some Traps

Here are 2 typical tricks that drastically help improve performance when crawling large volumes of data:

  • Abort if excessive page size: some pages are too big and should be ignored not only for stability reasons (you don’t want to fill your disk with it) but also for efficiency reasons.
  • Fine tune timeouts wisely: a web request might time out for several reasons and it’s important that you understand the underlying concept in order to adopt different levels of timeouts. See this great Cloudflare article for more details. In Go you can set a timeout when creating a net/http client, but a more idiomatic (and maybe more modern) approach is to use contexts for that purpose.

DNS

When crawling millions of web pages, the default DNS server you’re using is likely to end up rejecting your requests. Then it’s interesting to start using a more robust DNS server like the Google or Cloudflare ones, or even rotate resolution requests among several DNS servers.

Refresh Data

Crawling data once is often of little interest. Data should be refreshed asynchronously on a regular basis using periodic tasks, or synchronously upon a user request.

A recent application I worked on refreshed data asynchronously. Every time we were crawling a domain, we were storing the current date in the database, and then everyday a periodic task was looking for all the domains we had in db that needed to be refreshed. As Cron was too limited for our needs, we were using this more advanced Cron-like tool for Go applications: https://github.com/robfig/cron.

Being Fair

Crawling the web should be done respectfully. It basically means 2 things:

  • don’t crawl a web page if its robots.txt file disallows it
  • don’t hammer a single web server with tons of requests: set a very low concurrency level when you’re crawling several pages from a single domain, and pause for a moment between 2 requests

Conclusion

Setting up a large volume crawler is a fascinating journey which requires both coding and infrastructure considerations.

In this post I’m only scratching the surface but hopefully some of these concepts will help you in your next project! If you have comments or questions, please don’t hesitate to reach out to me, I’ll be pleased.

Existe aussi en français