November 26, 2017Julien Salinas Reading time ~6 minutes

REST API fetching: Go vs Python

APIs are everywhere today. Imagine you want to find business prospect information based on an email. Well there is an API for this. Need to geocode an ugly postal address? There is an API for that. Would you like to make a payment ? There are multiple APIs for that too of course. As a developer I am regularly fetching external APIs using either Python or Go. Both methods are quite different, let’s compare them here on an edge case: JSON data sent through a POST request body.

A real life example

Recently, I’ve used the NameAPI.org API, dedicated to splitting a full name into first name and last name, and determine gender of the person.

In order to use their API you should send JSON data encoded in the request body through POST. Moreover, the request Content-Type should be set to application/json instead of multipart/form-data. This is a pretty tricky case since usually POST data is sent through the request headers, and if we decide to send it through the request body (in case of a complex JSON for example) the usual Content-Type is multipart/form-data.

Here is the JSON data we want to send:

{
  "inputPerson" : {
    "type" : "NaturalInputPerson",
    "personName" : {
      "nameFields" : [ {
        "string" : "Petra",
        "fieldType" : "GIVENNAME"
      }, {
        "string" : "Meyer",
        "fieldType" : "SURNAME"
      } ]
    },
    "gender" : "UNKNOWN"
  }
}

We could do this pretty simply using cURL:

curl -H "Content-Type: application/json" \
-X POST \
-d '{"inputPerson":{"type":"NaturalInputPerson","personName":{"nameFields":[{"string":"Petra Meyer","fieldType":"FULLNAME"}]}}}' \
http://rc50-api.nameapi.org/rest/v5.0/parser/personnameparser?apiKey=<API-KEY>

And here is the NameAPI.org’s response (JSON):

{
"matches" : [ {
  "parsedPerson" : {
    "personType" : "NATURAL",
    "personRole" : "PRIMARY",
    "mailingPersonRoles" : [ "ADDRESSEE" ],
    "gender" : {
      "gender" : "MALE",
      "confidence" : 0.9111111111111111
    },
    "addressingGivenName" : "Petra",
    "addressingSurname" : "Meyer",
    "outputPersonName" : {
      "terms" : [ {
        "string" : "Petra",
        "termType" : "GIVENNAME"
      },{
        "string" : "Meyer",
        "termType" : "SURNAME"
      } ]
    }
  },
  "parserDisputes" : [ ],
  "likeliness" : 0.926699401733102,
  "confidence" : 0.7536487758945387
}

Now let’s see how to do this in Go and Python!

Go implementation

Code

/*
Fetch the NameAPI.org REST API and turn JSON response into a Go struct.

Sent data have to be JSON data encoded into request body.
Send request headers must be set to 'application/json'.
*/

package main

import (
    "encoding/json"
    "io/ioutil"
    "log"
    "net/http"
    "strings"
)

// url of the NameAPI.org endpoint:
const (
    url = "http://rc50-api.nameapi.org/rest/v5.0/parser/personnameparser?" +
        "apiKey=<API-KEY>"
)

func main() {

    // JSON string to be sent to NameAPI.org:
    jsonString := `{
        "inputPerson": {
            "type": "NaturalInputPerson",
            "personName": {
                "nameFields": [
                    {
                        "string": "Petra",
                        "fieldType": "GIVENNAME"
                    }, {
                        "string": "Meyer",
                        "fieldType": "SURNAME"
                    }
                ]
            },
            "gender": "UNKNOWN"
        }
    }`
    // Convert JSON string to NewReader (expected by NewRequest)
    jsonBody := strings.NewReader(jsonString)

    // Need to create a client in order to modify headers
    // and set content-type to 'application/json':
    client := &http.Client{}
    req, err := http.NewRequest("POST", url, jsonBody)
    if err != nil {
        log.Println(err)
    }
    req.Header.Add("Content-Type", "application/json")
    resp, err := client.Do(req)

    // Proceed only if no error:
    switch {
    default:
        // Create a struct dedicated to receiving the fetched
        // JSON content:
        type Level5 struct {
            String   string `json:"string"`
            TermType string `json:"termType"`
        }
        type Level41 struct {
            Gender     string  `json:"gender"`
            Confidence float64 `json:"confidence"`
        }
        type Level42 struct {
            Terms []Level5 `json:"terms"`
        }
        type Level3 struct {
            Gender           Level41 `json:"gender"`
            OutputPersonName Level42 `json:"outputPersonName"`
        }
        type Level2 struct {
            ParsedPerson Level3 `json:"parsedPerson"`
        }
        type RespContent struct {
            Matches []Level2 `json:"matches"`
        }

        // Decode fetched JSON and put it into respContent:
        respContentBytes, err := ioutil.ReadAll(resp.Body)
        if err != nil {
            log.Println(err)
        }
        var respContent RespContent
        err = json.Unmarshal(respContentBytes, &respContent)
        if err != nil {
            log.Println(err)
        }
        log.Println(respContent)
    case err != nil:
        log.Println("Network error:", err)
    case resp.StatusCode != 200:
        log.Println("Bad HTTP status code:", err)
    }

}

Explanations

As you can see we’re facing 2 painful problems with Go:

the http lib is quite tricky when it’s about encoding JSON data into the request body and changing the Content-Type header. Go’s documentation is not very clear on this. As a result we cannot use the pretty straightforward http.Post but instead we need to create a http.Client and then use the NewRequest() function and trigger it with client.Do(req). This is the only way to set a custom Content-Type in that case: req.Header.Add("Content-Type", "application/json")
decoding the returned JSON into Go data is pretty long and boring (called Unmarshalling in Go). It’s due to the fact that, Go being a statically typed language, we need to know in advance what the final returned JSON will look like. Thus we need to create a dedicated struct that will map the JSON’s structure and receive the data. In case of a nested JSON like the one returned by NameAPI.org, mixing arrays and maps, it is very touchy. Fortunately, our struct does not need to map the whole JSON but only the fields we will need. Another approach, if we have no idea what the final JSON will look like, would be to guess the types of data. Here is a good article on this.

The jsonString input is already a string here. But for a proper comparison with Python, it should have been a struct that we would have turned into a string. I just did not want to make this script too long for the blog.

Python implementation

Code

"""
Fetch the NameAPI.org REST API and turn JSON response into Python dict.

Sent data have to be JSON data encoded into request body.
Send request headers must be set to 'application/json'.
"""

import requests

# url of the NameAPI.org endpoint:
url = (
    "http://rc50-api.nameapi.org/rest/v5.0/parser/personnameparser?"
    "apiKey=<API-KEY>"
)

# Dict of data to be sent to NameAPI.org:
payload = {
    "inputPerson": {
        "type": "NaturalInputPerson",
        "personName": {
            "nameFields": [
                {
                    "string": "Petra",
                    "fieldType": "GIVENNAME"
                }, {
                    "string": "Meyer",
                    "fieldType": "SURNAME"
                }
            ]
        },
        "gender": "UNKNOWN"
    }
}

# Proceed, only if no error:
try:
    # Send request to NameAPI.org by doing the following:
    # - make a POST HTTP request
    # - encode the Python payload dict to JSON
    # - pass the JSON to request body
    # - set header's 'Content-Type' to 'application/json' instead of
    #   default 'multipart/form-data'
    resp = requests.post(url, json=payload)
    resp.raise_for_status()
    # Decode JSON response into a Python dict:
    resp_dict = resp.json()
    print(resp_dict)
except requests.exceptions.HTTPError as e:
    print("Bad HTTP status code:", e)
except requests.exceptions.RequestException as e:
    print("Network error:", e)

Explanations

The Python Request library is an amazing library saves us a lot of time here compared to Go! In one line, resp = requests.post(url, json=payload), almost everything is done under the hood:

build a POST HTTP request
encode the Python payload dictionary to JSON
pass the JSON to the request body
set header’s 'Content-Type' to 'application/json' instead of the default 'multipart/form-data' thanks to the json keyword argument
send the request

Decoding of returned JSON is also a one-liner: resp_dict = resp.json(). No need to create a complicated data structure in advance here!

Conclusion

Python is clearly the winner. Python’s simplicity combined with its huge set of libraries saves us a lot of time of development!

We’re not dealing with performance here of course. If you’re looking for a high-performance API fetcher using concurrency, Go could be a great choice. But simplicity and performance are not good friends as you can see…

Feel free to comment, I would be glad to here your opinion on this!

Existe aussi en français | También existe en Español

November 19, 2017Julien Salinas Reading time ~5 minutes

How to speed up web scraping with Go (Golang) and concurrency ?

I’ve been developing Python web scrapers for years now. Python’s simplicity is great for quick prototyping and so many amazing libraries can help you build a scraper and a result parser (Requests, Beautiful Soup, Scrapy, …). Yet once you start looking into your scraper’s performance, Python can be somewhat limited and Go is a great alternative !

Why Go ?

When you’re trying to speed up information fetching from the Web (for HTML scraping or even for a mere API consumption), 2 ways of optimization are possible:

speed up the web resource download (e.g. download http://example.com/hello.html)
speed up the parsing of the information you retrieved (e.g. get all urls available in hello.html)

Parsing can be improved either by reworking your code, or using a more efficient parser like lxml, or allocating more resources to your scraper. Still, parsing optimization is often negligible compared to the real bottleneck, namely network access (i.e. web page downloading).

Consequently the solution is about downloading the web resources in parallel. This is where Go is a great help !

Concurrent programming is a very complicated field, and Go makes it pretty easy. Go is a modern language which was created with concurrency in mind. On the other hand, Python is an older language and writing a concurrent web scraper in Python can be tricky, even if Python has improved a lot in this regard recently.

Go has other advantages, but let’s talk about it in another article !

Install Go

I already made a short tuto about how to install Go on Ubuntu.

If you need to install Go on another platform, feel free to read the official docs.

A simple concurrent scraper

Our scraper will basically try to download a list of web pages we’re giving him first, and check it gets a 200 HTTP status code (meaning the server returned an HTML page without an error). We’re not dealing with HTML results parsing here, since the goal is to focus on the critical point: improving network access performance. It’s your turn to write something now !

Final code

/*
Open a series of urls.

Check status code for each url and store urls I could not
open in a dedicated array.
Fetch urls concurrently using goroutines.
*/

package main

import (
    "fmt"
    "net/http"
)

// -------------------------------------

// Custom user agent.
const (
    userAgent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) " +
        "AppleWebKit/537.36 (KHTML, like Gecko) " +
        "Chrome/53.0.2785.143 " +
        "Safari/537.36"
)

// -------------------------------------

// fetchUrl opens a url with GET method and sets a custom user agent.
// If url cannot be opened, then log it to a dedicated channel.
func fetchUrl(url string, chFailedUrls chan string, chIsFinished chan bool) {

    // Open url.
    // Need to use http.Client in order to set a custom user agent:
    client := &http.Client{}
    req, _ := http.NewRequest("GET", url, nil)
    req.Header.Set("User-Agent", userAgent)
    resp, err := client.Do(req)

    // Inform the channel chIsFinished that url fetching is done (no
    // matter whether successful or not). Defer triggers only once
    // we leave fetchUrl():
    defer func() {
        chIsFinished <- true
    }()

    // If url could not be opened, we inform the channel chFailedUrls:
    if err != nil || resp.StatusCode != 200 {
        chFailedUrls <- url
        return
    }

}

func main() {

    // Create a random urls list just as an example:
    urlsList := [10]string{
        "http://example1.com",
        "http://example2.com",
        "http://example3.com",
        "http://example4.com",
        "http://example5.com",
        "http://example10.com",
        "http://example20.com",
        "http://example30.com",
        "http://example40.com",
        "http://example50.com",
    }

    // Create 2 channels, 1 to track urls we could not open
    // and 1 to inform url fetching is done:
    chFailedUrls := make(chan string)
    chIsFinished := make(chan bool)

    // Open all urls concurrently using the 'go' keyword:
    for _, url := range urlsList {
        go fetchUrl(url, chFailedUrls, chIsFinished)
    }

    // Receive messages from every concurrent goroutine. If
    // an url fails, we log it to failedUrls array:
    failedUrls := make([]string, 0)
    for i := 0; i < len(urlsList); {
        select {
        case url := <-chFailedUrls:
            failedUrls = append(failedUrls, url)
        case <-chIsFinished:
            i++
        }
    }

    // Print all urls we could not open:
    fmt.Println("Could not fetch these urls: ", failedUrls)

}

Explanations

This code is a bit longer than what we could do with a language like Python, but as you can see it is still very reasonable. Go is a statically typed language, so we need a couple of more lines dedicated to variables declaration. But please measure how much time the script is taking, and you’ll understand how rewarding it is !

We chose 10 random urls as an example.

Here, the magical keywords enabling us to use concurrency are go, chan, and select:

go creates a new goroutine, which means fetchUrl will be executed within a new concurrent goroutine each time.
chan is the type representing a channel. Channels help us communicate among goroutines (main being a goroutine itself as well).
select ... case is a switch ... case dedicated to receiving messages sent through channels. Program stays here as long as all goroutines have not sent a message (either to say that url fetching is done, or to say that url fetching failed).

We could have made this scraper without any channel, that’s to say create goroutines and not expect a message from them in return (for instance if every goroutine ends up storing information in database). In such a case, our main goroutine can perfectly end while some goroutines are still working. This is possible because main does not block other goroutines when it stops. But in real life it is almost always necessary to use channels in order to make our goroutines talk to each other.

Don’t forget to limit speed

Here speed is our goal. This is not a concern because we’re scraping all different urls. However if you need to scrap the same urls multiple times (like in API consumption for example), you’ll probably have to stay under a certain number of requests per second. In this case, you’ll have to set up a counter (maybe we’ll talk about it in another article !).

Have a nice scraping !

Existe aussi en français | También existe en Español

November 17, 2017Julien Salinas Reading time ~1 minute

Install Go (Golang) 1.9 on Ubuntu 17.10

Here is a little memento for those who wish to install Go (1.9) on their Ubuntu machine (17.10). As a reminder, Go is a compiled language, meaning no need to install Go on the machine where your application will run.

Update repositories and security patches, just in case:

sudo apt-get update
sudo apt-get -y upgrade

Download and install Go:

sudo curl -O https://storage.googleapis.com/golang/go1.9.linux-amd64.tar.gz  # Download archive. Change the archive's name if you need another version of Go or another system architecture
sudo tar -xvf go1.9.linux-amd64.tar.gz  # Extract archive
sudo mv go /usr/local  # Move binaries to /usr/local
echo 'export PATH=$PATH:/usr/local/go/bin' >> ~/.profile  # Update our bash profile so Go is in the PATH
source ~/.profile  # Update profile

Go is now installed ! Create your project and initialize environment variables:

mkdir $HOME/my_project
mkdir $HOME/my_project/src
mkdir $HOME/my_project/src/my_app
export GOPATH=$HOME/my_project

Then create your app:

vim $HOME/my_project/src/my_app/my_app.go

And put the following code in it (display a mere Hello World):

package main

import "fmt"

func main() {
    fmt.Printf("hello world\n")
}

Compile app:

go install my_app

If everything went well, an executable has been created inside a new bin folder. Run it:

$HOME/my_project/bin/my_app

Finally you should get:

hello world

In order to understand the differences between go install, go build, and go run read this. And if you cannot or do not want to install Go on your machine, have a look at this Docker image.

Enjoy !

Existe aussi en français | También existe en Español

November 14, 2017Julien Salinas Reading time ~1 minute

Why create a blog as a developer ?

Here we are, I’m taking the plunge. I’ve been considering creating a blog for a while. I really wanted it to be multilingual, which made me ask to myself a couple of technical questions about language handling. As a developer I think this blog will be beneficial.

Most developers anticipate that they don’t have enough experience in order to publish on the web. This is very humble but not always correct! IT development is such a big world that we always find people who are more beginner than we are in any field. Those people are relieved that others faced the same issues before them and put it simply.

Promote local languages

I am stunned by how difficult it is to find local content in the IT world. I know English has become the Lingua Franca in this regard but so many people are not fluent enough when it’s about parsing English-speaking blogs efficiently. It’s a shame that skilled and motivated future devs face this barrier (English cannot be learnt overnight…). Defending local languages is important!

OK I must admit that internationalizing a web app is a lot of additional work, and existing localization tools are hard to use IMHO (but let’s talk about that in a future dedicated post), so I understand that everyone wants to avoid it.

Structure ideas

It appears that writing the technical problems you faced in a blog helps structuring your mind and understand concepts better. Let’s see that ! But it is true that this is my main motivation when writing documentation: docs help me remember new concepts much more easily. We must see the blog as another good documentation tool!

Here we go. A quick quote about the above:

Once you get things properly you are able to put it clearly to others.

– Nicolas Boileau

Julien Salinas

Latest Posts

Français | Español

REST API fetching: Go vs Python

A real life example

Go implementation

Code

Explanations

Python implementation

Code

Explanations

Conclusion

Existe aussi en français | También existe en Español

How to speed up web scraping with Go (Golang) and concurrency ?

Why Go ?

Install Go

A simple concurrent scraper

Final code

Explanations

Don’t forget to limit speed

Existe aussi en français | También existe en Español

Install Go (Golang) 1.9 on Ubuntu 17.10

Existe aussi en français | También existe en Español

Why create a blog as a developer ?

Promote local languages

Structure ideas

Existe aussi en français | También existe en Español

Julien Salinas

Latest Posts

Français | Español

A real life example

Go implementation

Code

Explanations

Python implementation

Code

Explanations

Conclusion

Existe aussi en français | También existe en Español

Why Go ?

Install Go

A simple concurrent scraper

Final code

Explanations

Don’t forget to limit speed

Existe aussi en français | También existe en Español

Existe aussi en français | También existe en Español

Share

Promote local languages

Structure ideas

Existe aussi en français | También existe en Español