Crawling large volumes of web pages

Crawling and scraping data from the web is a funny thing. It’s fairly easy to achieve and it’s giving you immediate results. However, scaling from a basic crawler (thanks to a quick Python script for example) to a full speed large volume crawler, is hard to achieve. I’ll try to tell you about a couple of typical challenges one faces when building such a web crawler.

Concurrency

Concurrency is absolutely central to more and more modern applications and it’s especially true to applications that are heavily relying on network access like web crawlers. Indeed, as every HTTP request you’re triggering is taking a lot of time to return, you’d better launch them in parallel rather than sequentially. Basically it means that if you’re crawling 10 web pages taking 1 second each, it will roughly take 1 second overall rather than 10 seconds.

So concurrency is critical to web crawlers, but how to achieve it?

The naive approach, which works well for a small application, is to code a logic that triggers jobs in parallel, wait for all the results, and process them. Typically in Python you would spawn several parallel processes, and in Golang (which is better suited for this kind of thing) you would create goroutines. But handling this manually can quickly become a hassle: as your RAM and CPU resources are limited there’s no way you can crawl millions of web pages in parallel, so how do you handle job queues, and how can you handle retries in case some jobs are failing (and they will for sure) or in case your server is stopping for some reason?

The most robust approach is to leverage a messaging system like RabbitMQ. Every new URL parsed by your application should now be enqueued in RabbitMQ, and every new page your application needs to crawl should be dequeued from RabbitMQ. The amount of concurrent requests you want to reach is just a simple setting in RabbitMQ.

Of course, even when using a messaging system, the choice of the underlying programming language remains important: triggering 100 parallel jobs in Go will cost you much less resources than in Python for example (which is partly why I really like Go!).

Scalability

At some point, no matter how lightweight and optimized your web crawler is, you’ll be limited by hardware resources.

The first solution is to upgrade your server (which is called “vertical scalability”). It’s easy but once you’re reaching a certain level of RAM and CPU, it’s cheaper to favor “horizontal scalability”.

Horizontal scalability is about adding several modest servers to your infrastructure, rather than turning one single server into a supercomputer. Achieving this is harder though because your servers might have to communicate about a shared state, and a refactoring of your application might be needed. Good news is that a web crawler can fairly easily become “stateless”: several instances of your application can be run in parallel and shared information will most likely be located in your messaging system and/or your database. It’s easy then to increase/decrease the number of servers based on the speed you want to achieve. Each server should handle a certain amount of concurrent requests consumed from the messaging server. It’s up to you to define how many concurrent requests each server can handle depending on its RAM/CPU resources.

Containers orchestrators like Kubernetes make horizontal scalability easier. It’s easy to scale up to more instances simply by clicking a button, and you can even let Kubernetes auto-scale your instances for you (always set limits though in order to control the costs).

If you want to have a deeper understanding of the scalability challenges, you should read this amazing book by Martin Kleppmann: Data Intensive Applications.

Data Intensive Applications book

Report Errors Wisely

Tons of ugly things can happen during a crawl: connectivity issues (on the client side and on the server side), network congestion, target page too big, memory limit reached,…

It’s crucial that you handle these errors gracefully and that you report them wisely in order not to get overwhelmed by errors.

A good practice it centralize all errors into Sentry. Some errors are never sent to Sentry because we’re not considering them as critical and we don’t want to be alerted for that. For example, we want to know when an instance is reaching memory issues, but we don’t want to know when a URL cannot be downloaded because of a website timing out (this kind of error is business as usual for a crawler). It’s up to you to fine tune which errors are worth being reported urgently and which one are not.

File Descriptors and RAM Usage

When dealing with web crawlers, it’s worth being familiar with file descriptors. Every HTTP request you’re launching is opening a file descriptor, and a file descriptor is consuming memory.

On Linux systems, the max number of open file descriptors in parallel is capped by the OS in order to avoid breaking the system. Once this limit is reached you won’t be able to concurrently open any new webpage.

You might want to increase this limit but proceed carefully as it could lead to excessive RAM usage.

Avoid Some Traps

Here are 2 typical tricks that drastically help improve performance when crawling large volumes of data:

  • Abort if excessive page size: some pages are too big and should be ignored not only for stability reasons (you don’t want to fill your disk with it) but also for efficiency reasons.
  • Fine tune timeouts wisely: a web request might time out for several reasons and it’s important that you understand the underlying concept in order to adopt different levels of timeouts. See this great Cloudflare article for more details. In Go you can set a timeout when creating a net/http client, but a more idiomatic (and maybe more modern) approach is to use contexts for that purpose.

DNS

When crawling millions of web pages, the default DNS server you’re using is likely to end up rejecting your requests. Then it’s interesting to start using a more robust DNS server like the Google or Cloudflare ones, or even rotate resolution requests among several DNS servers.

Refresh Data

Crawling data once is often of little interest. Data should be refreshed asynchronously on a regular basis using periodic tasks, or synchronously upon a user request.

A recent application I worked on refreshed data asynchronously. Every time we were crawling a domain, we were storing the current date in the database, and then everyday a periodic task was looking for all the domains we had in db that needed to be refreshed. As Cron was too limited for our needs, we were using this more advanced Cron-like tool for Go applications: https://github.com/robfig/cron.

Being Fair

Crawling the web should be done respectfully. It basically means 2 things:

  • don’t crawl a web page if its robots.txt file disallows it
  • don’t hammer a single web server with tons of requests: set a very low concurrency level when you’re crawling several pages from a single domain, and pause for a moment between 2 requests

Conclusion

Setting up a large volume crawler is a fascinating journey which requires both coding and infrastructure considerations.

In this post I’m only scratching the surface but hopefully some of these concepts will help you in your next project! If you have comments or questions, please don’t hesitate to reach out to me, I’ll be pleased.

Existe aussi en français
Build a PWA with push notifications thanks to Vue.js and Django

Setting up a Progressive Web App (PWA) is dead simple with Vue.js, and especially since Vue CLI v3. However implementing push notifications can be pretty tricky.

Vue.js is going to be used for the frontend side, Python/Django and Django Rest Framework for the backend, and Google Firebase Messaging as the messaging intermediary. The latter is necessary as it will be the third party in charge of pushing the notifications to the device. I know it’s pretty disappointing being forced to use such an additional layer in our stack but there is no other choice. Of course there are alternatives to Firebase, like Pusher for example.

Firebase will have to be integrated into several parts of your code:

  • in the frontend for the browser to listen to Firebase for new notifications
  • in the frontend again on the page where you want to ask the user for his permission to enable notifications and, if he agrees, get a notification token from Firebase and send it to the backend to store it in DB. If a user uses several browsers (e.g. Chromium mobile on his smartphone, and Firefox desktop on his PC), several tokens will be associated with him in DB, and notifications will be received in several locations at the same time.
  • in the backend to receive the notification token from frontend and store it in DB
  • in the backend to send push notifications to a user by sending a message to the Firebase API. Firebase will take care of retrieving your message and routing it to the right associated browser.

Please keep in mind that the PWA standard is still evolving and not yet equally implemented in all browsers/platforms. For example push notifications are not yet implemented on iOS as of this writing!

Vue.js PWA

Install the Vue.js CLI thanks to the following npm command (install NPM first if needed):

npm i -g @vue/cli

Create a new PWA project:

vue create <My Project Name>

Select the “Manually select features” option and then select “Progressive Web App (PWA) support”:

Vue CLI v3

Select all the other options you need and wait for Vue CLI to create the project. Please notice that Vue CLI automatically creates a registerServiceWorker.js in the src directory and imports it at the top of your main.js. This file will take care of creating a service-worker.js at the root of your website during production build. The latter is needed in order for the browser to detect your website as a PWA.

In your public directory create a manifest.json file which will describe your PWA: the name of your app, app icons for various screen sizes, colors… Important things are the start_url which is the URL to open by default when launching the PWA on your smartphone, and gcm_sender_id which is the ID that all web apps using Firebase should use (don’t change it then). You can specify much more information in this file, just have a look at the docs. You can also use this helper if you like. It should look like the following:

{
  "name": "My App Name",
  "short_name": "My App Short Name",
  "icons": [{
      "src": "./img/icons/android-chrome-192x192.png",
      "sizes": "192x192",
      "type": "image/png"
    },
    {
      "src": "./img/icons/android-chrome-512x512.png",
      "sizes": "512x512",
      "type": "image/png"
    },
    {
      "src": "./img/icons/apple-touch-icon-60x60.png",
      "sizes": "60x60",
      "type": "image/png"
    },
    {
      "src": "./img/icons/apple-touch-icon-76x76.png",
      "sizes": "76x76",
      "type": "image/png"
    },
    {
      "src": "./img/icons/apple-touch-icon-120x120.png",
      "sizes": "120x120",
      "type": "image/png"
    },
    {
      "src": "./img/icons/apple-touch-icon-152x152.png",
      "sizes": "152x152",
      "type": "image/png"
    },
    {
      "src": "./img/icons/apple-touch-icon-180x180.png",
      "sizes": "180x180",
      "type": "image/png"
    },
    {
      "src": "./img/icons/apple-touch-icon.png",
      "sizes": "180x180",
      "type": "image/png"
    },
    {
      "src": "./img/icons/favicon-16x16.png",
      "sizes": "16x16",
      "type": "image/png"
    },
    {
      "src": "./img/icons/favicon-32x32.png",
      "sizes": "32x32",
      "type": "image/png"
    },
    {
      "src": "./img/icons/msapplication-icon-144x144.png",
      "sizes": "144x144",
      "type": "image/png"
    },
    {
      "src": "./img/icons/mstile-150x150.png",
      "sizes": "150x150",
      "type": "image/png"
    }
  ],
  "start_url": ".",
  "display": "standalone",
  "background_color": "#000000",
  "theme_color": "#210499",
  "gcm_sender_id": "103953800507"
}

Please note that your site should be HTTPS in order for the browser to read the manifest.json and behave like a PWA.

If everything goes fine, the PWA should now be easily installable on your smartphone. Visit your website with a modern mobile browser like Chrome. If the browser detects the manifest.json it should automatically propose you to install this PWA as a phone application (still not supported by all the browsers as of this writing).

Firebase Set Up

In order for your PWA to support push notifications, you should pair with an external service like Firebase Cloud Messaging (FCM). Please note that FCM is a subset of Firebase but you don’t need any of the other Firebase features (like DB, hosting…).

So please create a Firebase account, go to your Firebase console, create a project for your website, and retrieve the following information from your project settings (careful, there are multiple tabs to open, and this is not obvious to get all the information at once):

  • Project ID
  • Web API Key
  • Messaging Sender ID
  • Server Key
  • create a web push certificate and then retrieve the Public Vapid Key generated

Django Backend

I’m assuming that you’re using Django Rest Framework here.

In Django, use the FCM Django third party app to make your FCM integration easier (this app will take care of automatically saving and deleting notification tokens in DB, and will provide you with a helper to easily send notifications to FCM).

Install the app with pip install fcm-django, add it to your Django apps, and set it up (feel free to adapt the below settings, the only required one is FCM_SERVER_KEY for FCM authentication):

INSTALLED_APPS = (
        ...
        "fcm_django"
)

FCM_DJANGO_SETTINGS = {
        # authentication to Firebase
        "FCM_SERVER_KEY": "<Server Key>",
        # true if you want to have only one active device per registered user at a time
        # default: False
        "ONE_DEVICE_PER_USER": False,
        # devices to which notifications cannot be sent,
        # are deleted upon receiving error response from FCM
        # default: False
        "DELETE_INACTIVE_DEVICES": True,
}

Add a route in urls.py to the FCM Django endpoint that will take care of receiving the notification token and store it in DB:

from fcm_django.api.rest_framework import FCMDeviceAuthorizedViewSet

urlpatterns = [
  path('register-notif-token/',
    FCMDeviceAuthorizedViewSet.as_view({'post': 'create'}), name='create_fcm_device'),
]

Now whenever you want to send a push notification to a user do the following (likely to be in your views.py):

from fcm_django.models import FCMDevice

user = <Retrieve the user>
fcm_devices = FCMDevice.objects.filter(user=user)
fcm_devices.send_message(
  title="<My Title>", body="<My Body>", time_to_live=604800,
  click_action="<URL of the page that opens when clicking the notification>")

It’s up to you to adapt the query on the database to define precisely whom you want to send push notifs to. Here I’m sending push notifs to all the browsers of a user, but I could also decide to send notifs to a specific browser (called “device” in the FCM Django terminology).

There are more parameters available in the send_message method, feel free to have a look at the docs but also at the docs of the underlying Python project this library is based on.

Setting up the time_to_live was necessary in my case: Firebase say there is a default time to live set but it appeared there wasn’t when I tested it (bug?) so when notifications were sent while the user device was turned off, he never received it again when turning on the device.

Implementing Push Notifications in Vue.js

Create a firebase-messaging-sw.js file in your public directory and put the following inside:

importScripts('https://www.gstatic.com/firebasejs/5.5.6/firebase-app.js');
importScripts('https://www.gstatic.com/firebasejs/5.5.6/firebase-messaging.js');

var config = {
    apiKey: "<Web API Key>",
    authDomain: "<Project ID>.firebaseapp.com",
    databaseURL: "https://<Project ID>.firebaseio.com",
    projectId: "<Project ID>",
    storageBucket: "<Project ID>.appspot.com",
    messagingSenderId: "<Messenging Sender ID>"
};

firebase.initializeApp(config);

const messaging = firebase.messaging();

You now have a valid service worker which will poll Firebase in the background listening to new incoming push notifications.

It’s time now to ask the user for his permission to send him notifications and, if he agrees, get a notification token from FCM and store it in the backend DB. Your backend will use this token to send push notifications through FCM. It’s up to you to decide on which page of your app you want to do ask the user permission. For example you could implement this on the home page of your application once the user is logged in. You could do something like this:

import firebase from 'firebase/app'
import 'firebase/app'
import 'firebase/messaging'

export default {
  methods: {
    saveNotificationToken(token) {
      const registerNotifTokenURL = '/register-notif-token/'
      const payload = {
        registration_id: token,
        type: 'web'
      }
      axios.post(registerNotifTokenURL, payload)
        .then((response) => {
          console.log('Successfully saved notification token!')
          console.log(response.data)
        })
        .catch((error) => {
          console.log('Error: could not save notification token')
          if (error.response) {
            console.log(error.response.status)
            // Most of the time a "this field must be unique" error will be returned,
            // meaning that the token already exists in db, which is good.
            if (error.response.data.registration_id) {
              for (let err of error.response.data.registration_id) {
                console.log(err)
              }
            } else {
              console.log('No reason returned by backend')
            }
            // If the request could not be sent because of a network error for example
          } else if (error.request) {
            console.log('A network error occurred.')
            // For any other kind of error
          } else {
            console.log(error.message)
          }
        })
      },
    },
  mounted() {
    var config = {
      apiKey: "<Web API Key>",
      authDomain: "<Project ID>.firebaseapp.com",
      databaseURL: "https://<Project ID>.firebaseio.com",
      projectId: "<Project ID>",
      storageBucket: "<Project ID>.appspot.com",
      messagingSenderId: "<Messenging Sender ID>"
    }
    firebase.initializeApp(config)

    const messaging = firebase.messaging()

    messaging.usePublicVapidKey("<Public Vapid Key>")

    messaging.requestPermission().then(() => {
      console.log('Notification permission granted.')
      messaging.getToken().then((token) => {
        console.log('New token created: ', token)
        this.saveNotificationToken(token)
      })
    }).catch((err) => {
      console.log('Unable to get permission to notify.', err)
    })

    messaging.onTokenRefresh(function () {
      messaging.getToken().then(function (newToken) {
        console.log('Token refreshed: ', newToken)
        this.saveNotificationToken(newToken)
      }).catch(function (err) {
        console.log('Unable to retrieve refreshed token ', err)
      })
    })
  }
}

Conclusion

Setting up push notifications within a PWA is definitely NOT straightforward! Many parts of your application should be involved and you need to understand how the third party you chose (here Firebase) is working.

Please keep in mind that PWAs are still pretty new and supported features are constantly evolving. More importantly, don’t base critical information on push notifications only as it’s less reliable than other systems like SMS or emails…

Also, don’t forget to use push notifications carefully as notification flooding can be very annoying!

I hope you liked this how-to. Please don’t hesitate to send me a feedback or add some ideas in the comments!

Existe aussi en français
Leveraging Django Rest Framework and generic views for rapid API development

As a seasoned API developer, you end up doing very repetitive tasks so you might be looking for tools that makes your development time faster. As a novice, you might be looking for a way to implement best practice and REST standards easily out of the box without too much hesitation.

In both cases, Django Rest Framework (DRF) is a great solution. It is a standard, widely used, and fully featured API framework that will not only save you a lot of time but also show you the right way to develop RESTful APIs. More particularly, DRF proposes generic views, that’s to say pre-built endpoints for your API. Let’s see how to leverage this feature to achieve rapid API development.

I put the below code in a little working Django project right here.

Concept

DRF’s Generic views are perfect for simple APIs that basically do CRUD (create, read, update, delete) on the database without too much data processing. For example, let’s say you have a product table that contains all your store products and you want to expose these products as is to customers through an API, then it’s a perfect use case for the ListAPIView (see below).

From now on I’m assuming that you installed Python, Django, DRF and that you know tha basics about Django.

Basic Example 1: Reading Data

Let’s create an API endpoint showing all the products customers. In your views.py do the following:

from rest_framework import generics
from .serializers import ProductsSerializer

class GetProducts(generics.ListAPIView):
    """Return all products."""
    serializer_class = ProductsSerializer

ProductsSerializer is the serializer that will convert your data from the database to API friendly data. This serializer should be put in serializers.py and will be in charge of retrieving data from your Product model and transforming them:

from rest_framework import serializers
from .models import Product

class ProductsSerializer(serializers.ModelSerializer):
    """Serialize products."""

    class Meta:
        model = Product
        fields = ("__all__")

Now in your urls.py create the route to this endpoint:

from django.urls import path
from .views import GetProducts

urlpatterns = [
    path('get-products/', GetProducts.as_view(), name='get_products'),
]

As you can see this is dead simple as DRF is doing many things for you under the hoods! You now have an endpoint (/get-products/) that you can consume with get HTTP requests, and that outputs all products with an API format (usually json but it depends on your settings).

Basic Example 2: Deleting Data

Now let’s create an endpoint dedicated to deleting a product for authenticated users only. It’s even simpler as it does not require to serialize data (once the product is deleted no data can be returned to user anymore).

In views.py:

from rest_framework import generics

class DeleteProduct(generics.DestroyAPIView):
    """Remove product"""
    permission_classes = (permissions.IsAuthenticated,) # Limit to authenticated users only

In urls.py

from django.urls import path
from .views import DeleteProduct

urlpatterns = [
    path('delete-product/', DeleteProduct.as_view(), name='delete_product'),
]

Now you have a /delete-product/ endpoint that you can use to delete one product at a time using delete HTTP requests, and that only accepts authenticated requests (authentication mechanism depends on your settings).

Customizing Generic Views’ Behavior

Each generic view can be customized by writing a get_queryset() method. For example let’s say you only want to show products that have an active flag set to True in db. You could do this:

from rest_framework import generics
from .serializers import ProductsSerializer
from .model import Product

class GetProducts(generics.ListAPIView):
    """Return all active products."""
    permission_classes = (permissions.IsAuthenticated,)
    serializer_class = ProductsSerializer

    def get_queryset(self):
        """Filter active products."""
        return Product.objects.filter(active=True)

get_queryset() is a common method that you have in every generic views. Some generic views also have their own methods to control more precisely the behavior of the endpoint. For example, let’s say that you don’t really want to delete products but just mark them as inactive. You could use the destroy() method:

from django.shortcuts import get_object_or_404
from rest_framework.response import Response
from rest_framework import status

class DeleteProduct(generics.DestroyAPIView):
    """Remove product"""
    permission_classes = (permissions.IsAuthenticated,)

    def destroy(self, request, pk):
        """
        By default, DestroyAPIView deletes the product from db.
        Here we only want to flag it as inactive.
        """
        product = get_object_or_404(self.get_queryset(), pk=pk)
        product.active = False
        product.save()
        return Response(status=status.HTTP_204_NO_CONTENT)

In the above example we’re first trying to look for the product that the user wants to delete. If we can’t find it we return a 404 code to user. If the product is successfully marked as inactive, we just return a 204 code to user meaning that the product was successfully deleted.

Generic views are perfect for simple use cases and it’s sometimes wiser to use the classic APIView for edge cases. For example, let’s say you want not only to return products to the user but also enrich data with other information that is not in the Product model (e.g. orders related to this product, product manufacturer, etc.). In that case, if you wanted to use generic views, you would have to define new fields in the serializer thanks to additional get_new_field() methods which can easily make your serializer very ugly…

Conclusion

As you could see, DRF’s generic views make API endpoints development very easy thanks to a bit of magic under the hood. However you should keep in mind that generic views cannot apply to every use cases as sometimes tweaking generic views is harder than developing things by yourself from scratch!

I hope you liked this little how to. I’d love to have your feedbacks!

Existe aussi en français
Security of a Go (Golang) website

Web frameworks are not so frequent in the Go world compared to other languages. It gives you more freedom but web frameworks are also a great way to force people to implement basic security practices, especially beginners.

If you’re developing Go websites or APIs without frameworks, like me, here are some security points you should keep in mind.

CSRF

Cross-site Requests Forgery (CSRF) attacks target the password protected pages of your website which are using forms. Authenticated users (via a session cookie in their browser) might post information to a protected form without knowing it if they’re visiting a malicious website. In order to avoid this, every forms should have a hidden field containing a CSRF token that the server will use to check the authenticity of the request.

Let’s use Gorilla Toolkit in this regard. First integrate the CSRF middleware. You can either do it for the whole website:

package main

import (
    "net/http"

    "github.com/gorilla/csrf"
    "github.com/gorilla/mux"
)

func main() {
    r := mux.NewRouter()

    http.ListenAndServe(":8000",
        csrf.Protect([]byte("32-byte-long-auth-key"))(r))
}

Or for some pages only:

package main

import (
    "net/http"

    "github.com/gorilla/csrf"
    "github.com/gorilla/mux"
)

func main() {
    r := mux.NewRouter()
    csrfMiddleware := csrf.Protect([]byte("32-byte-long-auth-key"))

    protectedPageRouter := r.PathPrefix("/protected-page").Subrouter()
    protectedPageRouter.Use(csrfMiddleware)
    protectedPageRouter.HandleFunc("", protectedPage).Methods("POST")

    http.ListenAndServe("8080", r)
}

Pass the CSRF token to your template then:

func protectedPage(w http.ResponseWriter, r *http.Request) {
    var tmplData = ContactTmplData{CSRFField: csrf.TemplateField(r)}
    tmpl.Execute(w, tmplData)
}

Last of all put the hidden field {{.CSRFField}} into your template.

CORS

Cross-origin Resource Sharing (CORS) attacks consist in sending information to a malicious website from a clean website. In order to mitigate this, the clean website should prevent users from sending asynchronous requests (XHR) to another website. Good news: this behavior is implemented by default in every browsers! Bad news: it can lead to false positives so, if you want to consume an API located on another domain or another port from your web page, requests will be blocked by the browser. It’s often a tricky behavior for the API rookies…

You should whitelist some domains in order to avoid the above problem. You can do this using the github.com/rs/cors:

package main

import (
    "net/http"

    "github.com/rs/cors"
)

func main() {
    c := cors.New(cors.Options{
        AllowedOrigins: []string{"http://my-whitelisted-domain"},
    })
    handler = c.Handler(handler)

    http.ListenAndServe(":8080", handler)
}

HTTPS

Switching your website to HTTPS is an essential security point. Here I’m assuming that you’re using the Go built-in server. If not (maybe because you’re using Nginx or Apache), you can skip this section.

Get a A on SSLLabs.com

In order to get the best security grade on SSLLabs (meaning the certificate is perfectly configured and thus avoid security warnings on some web clients), you should disable SSL and only use TLS 1.0 and above. That’s why I’m using the crypto/tls library. In order to serve HTTPS requests we’re using http.ListenAndServeTLS:

package main

import (
    "crypto/tls"
    "net/http"
)

func main() {
    config := &tls.Config{MinVersion: tls.VersionTLS10}
    server := &http.Server{Addr: ":443", Handler: r, TLSConfig: config}
    server.ListenAndServeTLS(tlsCertPath, tlsKeyPath)
}

Redirect HTTP to HTTPS

It’s good practice to force HTTP requests to HTTPS. You should use a dedicated goroutine:

package main

import (
    "crypto/tls"
    "net/http"
)

// httpsRedirect redirects http requests to https
func httpsRedirect(w http.ResponseWriter, r *http.Request) {
    http.Redirect(
        w, r,
        "https://"+r.Host+r.URL.String(),
        http.StatusMovedPermanently,
    )
}

func main() {
    go http.ListenAndServe(":80", http.HandlerFunc(httpsRedirect))

    config := &tls.Config{MinVersion: tls.VersionTLS10}
    server := &http.Server{Addr: ":443", Handler: r, TLSConfig: config}
    server.ListenAndServeTLS(tlsCertPath, tlsKeyPath)
}

Let’s Encrypt’s certificates renewal

Let’s Encrypt has become the most common way of provisioning TLS certs (because it’s free of course) but not necessarily the easiest. Once Let’s Encrypt has been installed and the first certs have been provisioned, the question is how to renew them automatically with Certbot. Certbot does not integrate to the Go HTTP server so it’s necessary to use the Certbot’s standard version (this one for Ubuntu 18.04 for example), and then briefly (a couple of seconds) turn off the production server during the certificates renewal (in order to avoid conflicts on the 80 and 443 ports). It can be done by modifying the renewal command launched by Certbot’s cron (in /etc/cron.d/certbot). On Ubuntu Certbot also uses the systemd timer (as a first choice rather than cron) so it’s better to modify the /lib/systemd/system/certbot.service config file:

[Unit]
Description=Certbot
Documentation=file:///usr/share/doc/python-certbot-doc/html/index.html
Documentation=https://letsencrypt.readthedocs.io/en/latest/
[Service]
Type=oneshot
# Proper command to stop server before renewal and restart server afterwards
ExecStart=/usr/bin/certbot -q renew --pre-hook "command to stop go server" --post-hook "command to start go server"
PrivateTmp=true

One interesting alternative could be to use a Go lib dedicated to the certs renewal inside your program called x/crypto/acme/autocert. Personally I’m not sure I like this solution because - even if it does not create downtime contrary to my solution - it means your code it strongly coupled to a specific type of certs renewal (ACME).

XSS

Cross-site scripting (XSS) attacks consist in executing some Javascript code in the user browser when he’s visiting your website, while it should not be happening. For example in case of a forum, if a user is posting a message containing some Javascript and you’re displaying this message to all users without any filter, the script will execute on all user browsers. In order to mitigate this, strings should be “escaped” before being displayed in order to be harmless.

Good news: when you’re using templates via the html/template lib, Go ensures strings are automatically escaped. Be careful though: text/template does not escape strings!

SQL Injections

SQL Injection attacks consist in posting malicious data containing SQL in an HTML form. When inserting data into database or retrieving data, this malicious code can cause damage.

Once again: good news is this attack is easily mitigated if you’re correctly using the standard SQL libs. For example if you’re using database/sqlSQL injections are automatically escaped when you’re using the $ or ? keywords. Here’s an SQL request properly written for PostgreSQL:

db.Exec("INSERT INTO users(name, email) VALUES($1, $2)",
  "Julien Salinas", "julien@salinas.com")

Personally I’m using the PostgreSQL optimized ORM github.com/go-pg/pg. A properly formed request in that case would be:

user := &User{
    name:       "Julien Salinas",
    email:      "julien@salinas.com",
}
err = db.Insert(user)

Directory Listing

Directory listing is the fact that you can display all the files from a given directory on the website. It might disclose documents to users that you don’t want to show them if they don’t have the exact url. Directory listing is enabled by default if you’re using the standard lib http.FileServer. I’m explaining in this article how to disable this behavior.

Conclusion

Here’s a short insight about the major security points for your Go website.

It might be a good idea to use a tool giving you the ability to easily set various security related essential elements. In my opinion github.com/unrolled/secure is a great lib in this regard. It allows you to easily set up HTTPS redirects, handle CORS, but also filter authorized hosts and many other complex things.

I hope these basic tips helped some of you!

Existe aussi en français
Developing and deploying a whole website in Go (Golang)

In my opinion Go (Golang) is a great choice for web development:

  • it makes non-blocking requests and concurrency easy
  • it makes code testing and deployment easy as it does not require any special runtime environment or dependencies (making containerization pretty useless here)
  • it does not require any additional frontend HTTP server like Apache or Nginx as it already ships with a very good one in its standard library
  • it does not force you to use a web framework as everything needed for web development is ready to use in the std lib

A couple of years back the lack of libraries and tutorials around Go could have been a problem, but today it is not anymore. Let me show you the steps to build a website in Go and deploy it to your Linux server from A to Z.

The Basics

Let’s say you are developing a basic HTML page called love-mountains. As you might already know, the rendering of love-mountains is done in a function, and you should launch a web server with a route pointing to that function. This is good practice to use HTML templates in web development so let’s render the page through a basic template here. This is also good practice to load parameters like the path to templates directory from environment variables for better flexibility.

Here is your Go code:

package main

import (
    "html/template"
    "net/http"
)

// Get path to template directory from env variable
var templatesDir = os.Getenv("TEMPLATES_DIR")

// loveMountains renders the love-mountains page after passing some data to the HTML template
func loveMountains(w http.ResponseWriter, r *http.Request) {
    // Build path to template
    tmplPath := filepath.Join(templatesDir, "love-mountains.html")
    // Load template from disk
    tmpl := template.Must(template.ParseFiles(tmplPath))
    // Inject data into template
    data := "La Chartreuse"
    tmpl.Execute(w, data)
}

func main() {
    // Create route to love-mountains web page
    http.HandleFunc("/love-mountains", loveMountains)
    // Launch web server on port 80
    http.ListenAndServe(":80", nil)
}

Retrieving dynamic data in a template is easily achieved with {{.}} here. Here is your love-mountains.html template:

<h1>I Love Mountains<h1>
<p>The mountain I prefer is {{.}}</p>

HTTPS

Nowadays, implementing HTTPS on your website has become almost compulsory. How can you switch your Go website to HTTPS?

Linking TLS Certificates

Firstly, issue your certificate and private key in .pem format. You can issue them by yourself with openssl (but you will end up with a self-signed certificate that triggers a warning in the browser) or you can order your cert from a trusted third-party like Let’s Encrypt. Personally, I am using Let’s Encrypt and Certbot to issue certificates and renew them automatically on my servers. More info about how to use Certbot here.

Then you should tell Go where your cert and private keys are located. I am loading the paths to these files from environment variables.

We are now using the ListenAndServeTLS function instead of the mere ListenAndServe:

[...]

// Load TLS cert info from env variables
var tlsCertPath = os.Getenv("TLS_CERT_PATH")
var tlsKeyPath = os.Getenv("TLS_KEY_PATH")

[...]

func main() {
    [...]
    // Serve HTTPS on port 443
    http.ListenAndServeTLS(":443", tlsCertPath, tlsKeyPath, nil)
}

Forcing HTTPS Redirection

For the moment we have a website listening on both ports 443 and 80. It would be nice to automatically redirect users from port 80 to 443 with a 301 redirection. We need to spawn a new goroutine dedicated to redirecting from http:// to https:// (principle is similar as what you would do in a frontend server like Nginx). Here is how to do it:

[...]

// httpsRedirect redirects HTTP requests to HTTPS
func httpsRedirect(w http.ResponseWriter, r *http.Request) {
    http.Redirect(
        w, r,
        "https://"+r.Host+r.URL.String(),
        http.StatusMovedPermanently,
    )
}

func main() {
    [...]
    // Catch potential HTTP requests and redirect them to HTTPS
    go http.ListenAndServe(":80", http.HandlerFunc(httpsRedirect))
    // Serve HTTPS on port 443
    http.ListenAndServeTLS(":443", tlsCertPath, tlsKeyPath, nil)
}

Static Assets

Serving static assets (like images, videos, Javascript files, CSS files, …) stored on disk is fairly easy but disabling directory listing is a bit hacky.

Serving Files from Disk

In Go, the most secured way to serve files from disk is to use http.FileServer. For example, let’s say we are storing static files in a static folder on disk, and we want to serve them at https://my-website/static, here is how to do it:

[...]
http.Handle("/", http.FileServer(http.Dir("static")))
[...]

Preventing Directory Listing

By default, http.FileServer performs a full directory listing, meaning that https://my-website/static will display all your static assets. We don’t want that for security and intellectual property reasons.

Disabling directory listing requires the creation of a custom FileSystem. Let’s create a struct that implements the http.FileSystem interface. This struct should have an Open() method in order to satisfy the interface. This Open() method first checks if the path to the file or directory exists, and if so checks if it is a file or a directory. If the path is a directory then let’s return a file does not exist error which will be converted to a 404 HTTP error for the user in the end. This way the user cannot know if he reached an existing directory or not.

Once again, let’s retrieve the path to static assets directory from an environment variable.

[...]

// Get path to static assets directory from env variable
var staticAssetsDir = os.Getenv("STATIC_ASSETS_DIR")

// neuteredFileSystem is used to prevent directory listing of static assets
type neuteredFileSystem struct {
    fs http.FileSystem
}

func (nfs neuteredFileSystem) Open(path string) (http.File, error) {
    // Check if path exists
    f, err := nfs.fs.Open(path)
    if err != nil {
        return nil, err
    }

    // If path exists, check if is a file or a directory.
    // If is a directory, stop here with an error saying that file
    // does not exist. So user will get a 404 error code for a file or directory
    // that does not exist, and for directories that exist.
    s, err := f.Stat()
    if err != nil {
        return nil, err
    }
    if s.IsDir() {
        return nil, os.ErrNotExist
    }

    // If file exists and the path is not a directory, let's return the file
    return f, nil
}

func main() {
    [...]
    // Serve static files while preventing directory listing
    mux := http.NewServeMux()
    fs := http.FileServer(neuteredFileSystem{http.Dir(staticAssetsDir)})
    mux.Handle("/", fs)
    [...]
}

Full Example

Eventually, your whole website would look like the following:

package main

import (
    "html/template"
    "net/http"
    "os"
    "path/filepath"
)

var staticAssetsDir = os.Getenv("STATIC_ASSETS_DIR")
var templatesDir = os.Getenv("TEMPLATES_DIR")
var tlsCertPath = os.Getenv("TLS_CERT_PATH")
var tlsKeyPath = os.Getenv("TLS_KEY_PATH")

// neuteredFileSystem is used to prevent directory listing of static assets
type neuteredFileSystem struct {
    fs http.FileSystem
}

func (nfs neuteredFileSystem) Open(path string) (http.File, error) {
    // Check if path exists
    f, err := nfs.fs.Open(path)
    if err != nil {
        return nil, err
    }

    // If path exists, check if is a file or a directory.
    // If is a directory, stop here with an error saying that file
    // does not exist. So user will get a 404 error code for a file/directory
    // that does not exist, and for directories that exist.
    s, err := f.Stat()
    if err != nil {
        return nil, err
    }
    if s.IsDir() {
        return nil, os.ErrNotExist
    }

    // If file exists and the path is not a directory, let's return the file
    return f, nil
}

// loveMountains renders love-mountains page after passing some data to the HTML template
func loveMountains(w http.ResponseWriter, r *http.Request) {
    // Load template from disk
    tmpl := template.Must(template.ParseFiles("love-mountains.html"))
    // Inject data into template
    data := "Any dynamic data"
    tmpl.Execute(w, data)
}

// httpsRedirect redirects http requests to https
func httpsRedirect(w http.ResponseWriter, r *http.Request) {
    http.Redirect(
        w, r,
        "https://"+r.Host+r.URL.String(),
        http.StatusMovedPermanently,
    )
}

func main() {
    // http to https redirection
    go http.ListenAndServe(":80", http.HandlerFunc(httpsRedirect))

    // Serve static files while preventing directory listing
    mux := http.NewServeMux()
    fs := http.FileServer(neuteredFileSystem{http.Dir(staticAssetsDir)})
    mux.Handle("/", fs)

    // Serve one page site dynamic pages
    mux.HandleFunc("/love-mountains", loveMountains)

    // Launch TLS server
    log.Fatal(http.ListenAndServeTLS(":443", tlsCertPath, tlsKeyPath, mux))
}

Plus your love-mountains.html template:

<h1>I Love Mountains<h1>
<p>The mountain I prefer is {{.}}</p>

Testing, Deploying and Daemonizing with Systemd

Having a solid and easy test/deploy process is very important from an efficiency standpoint and Go really helps in this regard. Go is compiling everything within a single executable, including all dependencies (except templates but the latter are not real dependencies and this is better to keep them apart for flexibility reasons anyway). Go also ships with its own frontend HTTP server, so no need to install Nginx or Apache. Thus this is fairly easy to test your application locally and make sure it is equivalent to your production website on the server (not talking about data persistence here of course…). No need to add a container system like Docker to your build/deploy workflow then!

Testing

To test your application locally, compile your Go binary and launch it with the proper environment variables like this:

TEMPLATES_DIR=/local/path/to/templates/dir \
STATIC_ASSETS_DIR=/local/path/to/static/dir \
TLS_CERT_PATH=/local/path/to/cert.pem \
TLS_KEY_PATH=/local/path/to/privkey.pem \
./my_go_website

That’s it! Your website is now running in your browser at https://127.0.0.1.

Deploying

Deployment is just about copying your Go binary to the server (plus your templates, static assets, and certs, if needed). A simple tool like scp is perfect for that. You could also use rsync for more advanced needs.

Daemonizing your App with Systemd

You could launch your website on the server by just issuing the above command, but it is much better to launch your website as a service (daemon) so your Linux system automatically launches it on startup (in case of a server restart) and also tries to restart it in case your app is crashing. On modern Linux distribs, the best way to do so is by using systemd, which is the default tool dedicated to management of system services. Nothing to install then!

Let’s assume you put your Go binary in /var/www on your server. Create a new file describing your service in the systemd directory: /etc/systemd/system/my_go_website.service. Now put the following content inside:

[Unit]
Description=my_go_website
After=network.target auditd.service

[Service]
EnvironmentFile=/var/www/env
ExecStart=/var/www/my_go_website
ExecReload=/var/www/my_go_website
KillMode=process
Restart=always
RestartPreventExitStatus=255
Type=simple

[Install]
WantedBy=multi-user.target

The EnvironmentFile directive points to an env file where you can put all your environment variables. systemd takes care of loading it and passing env vars to your program. I put it in /var/www but feel free to put it somewhere else. Here is what your env file would look like:

TEMPLATES_DIR=/remote/path/to/templates/dir
STATIC_ASSETS_DIR=/remote/path/to/static/dir
TLS_CERT_PATH=/remote/path/to/cert.pem
TLS_KEY_PATH=/remote/path/to/privkey.pem

Feel free to read more about systemd for more details about the config above.

Now:

  • launch the following to link your app to systemd: systemctl enable my_go_website
  • launch the following to start your website right now: systemctl start my_go_website
  • restart with: systemctl restart my_go_website
  • stop with: systemctl stop my_go_website

Replacing Javascript with WebAssembly (Wasm)

Here is a bonus section in case you are feeling adventurous!

As of Go version 1.11, you can now compile Go to Web Assembly (Wasm). More details here. This is very cool as Wasm can work as a substitute for Javascript. In other words you can theoretically replace Javascript with Go through Wasm.

Wasm is supported in modern browsers but this is still pretty experimental. Personally I would only do this as a proof of concept for the moment, but in the mid term it might become a great way to develop your whole stack in Go. Let’s wait and see!

Conclusion

Now you know how to develop a whole website in Go and deploy it on a Linux server. No frontend server to install, no dependency hell, and great performances… Pretty straightforward isn’t it?

If you want to learn how to build a Single Page App (SPA) with Go and Vue.js, have a look at my other post here.

Existe aussi en français