CTOs, developers: how to assess quality of an external API?

Nowadays finding an external API in order to improve your service is getting easier and easier. More and more companies offer APIs. Problem is many developers/CTOs start the API integration right away while it should be the very last step! Before that you need to figure out whether the quality of this API matches some minimum requirements. Let me tell you how I do it. I hope it will help other CTOs and developers.

Quality of data

A lot of APIs expose data in order for you to enrich your system (this is not always the case of course, Stripe is not an enrichment API for example). This is essential that you check the quality of those data. It will take you a long time and I know you do not like testing! Neither do I but you cannot avoid building a serious test scenario here. If you realize data quality was not good enough only 2 weeks after finishing your API integration, trust me you’ll regret it…

Documentation

I recently fell upon an API which exposed great data (much better than his competitors in my opinion), but its documentation was… awful! Actually it almost did not exist. In addition to that it did not always respect the basic REST standards. How can you possibly integrate an external API if error codes are not properly documented ? Well the only solution is for you to test again and again in order to understand how things work behind the hoods. Reverse-engineering might be fun but it takes a lot of time. Remember you have no Github repo to explore here since source code is not available… Bad documentation is a lot of time lost for the devs and certainly bad surprises in the mid term.

Libraries

Can you consume the API with special libraries in your favorite language ? As a Python and Go developer I’m always glad to see APIs offering a Python lib (I know I can forget about Go for the moment). It can save you quite a lot of time, but first make sure the lib is mature enough and covers all the API features (not always the case).

Reputation of the vendor

Reputation can help you find out whether you’ll have bad surprises with your API in the future. By bad surprises I mean service interruption, features regression, or even end of the service… You can partly tackle that by asking yourself the following questions:

  • is this API popular on the internet (in general if you find little information, run away)? Are there a lot of articles/tutorials talking about it? Are those articles positive?
  • are some popular companies using it?
  • if the company developed libs, are they popular on Github? Are the issues on Github solved regularly?
  • were there recent updates of the API or was the last update released a long time ago?

Technical support

Make sure someone answers you quickly by email when you have an issue and the answer is relevant. If you’re based in Europe and the API is run by an American company, check whether time difference is not too much of a problem.

Respect standards

In my humble opinion, you choose only RESTful APIs today. If the API you’re in love with do not respect the REST standard, be suspicious. But keep in mind that it’s not perfectly clear what the REST standard is about, and each API implements its own rules (HTTP codes, POST requests encoding, …). Still, have a close look at the docs, and check that you do not see something original. Originality will slow you down…

Price

Of course price is very important. But be careful, API prices are not always easy to understand. Are you going to be charged per month for an unlimited amount of requests ? Charged per request ? If so are you going to be charged twice for 2 identical requests (in case of an enrichment API) or will the second request be free ? Are you going to be charged for a request returning no result (HTTP 404) ? Make sure you understand all the implications of pricing.

Quality of Service (QoS)

QoS is highly important. Basically you want the API to go fast and have as little downtime as possible. Unfortunately this is not an easy to test point. Indeed QoS may vary a lot over time, and many APIs offer 2 levels of QoS depending on whether you’re using the free version of the API or you paid for it… Sometimes you can also choose between different subscriptions with different levels of response time.

Parallel queries support

Depending on how you’re planning to integrate your API, you might want to speed things up by making multiple parallel queries to the API instead of using it sequentially. Personally I’m using Golang for that most of the time. If so be careful: many vendors do not support parallel queries, and when they do they always set up a limit. In that case make sure to ask them what this limit is (not always told in the docs) and adapt your script based on this.

This post will be a good memo for me. I hope for you too!

Existe aussi en français | También existe en Español
REST API fetching: Go vs Python

APIs are everywhere today. Imagine you want to find business prospect information based on an email. Well there is an API for this. Need to geocode an ugly postal address? There is an API for that. Would you like to make a payment ? There are multiple APIs for that too of course. As a developer I am regularly fetching external APIs using either Python or Go. Both methods are quite different, let’s compare them here on an edge case: JSON data sent through a POST request body.

A real life example

Recently, I’ve used the NameAPI.org API, dedicated to splitting a full name into first name and last name, and determine gender of the person.

In order to use their API you should send JSON data encoded in the request body through POST. Moreover, the request Content-Type should be set to application/json instead of multipart/form-data. This is a pretty tricky case since usually POST data is sent through the request headers, and if we decide to send it through the request body (in case of a complex JSON for example) the usual Content-Type is multipart/form-data.

Here is the JSON data we want to send:

{
  "inputPerson" : {
    "type" : "NaturalInputPerson",
    "personName" : {
      "nameFields" : [ {
        "string" : "Petra",
        "fieldType" : "GIVENNAME"
      }, {
        "string" : "Meyer",
        "fieldType" : "SURNAME"
      } ]
    },
    "gender" : "UNKNOWN"
  }
}

We could do this pretty simply using cURL:

curl -H "Content-Type: application/json" \
-X POST \
-d '{"inputPerson":{"type":"NaturalInputPerson","personName":{"nameFields":[{"string":"Petra Meyer","fieldType":"FULLNAME"}]}}}' \
http://rc50-api.nameapi.org/rest/v5.0/parser/personnameparser?apiKey=<API-KEY>

And here is the NameAPI.org’s response (JSON):

{
"matches" : [ {
  "parsedPerson" : {
    "personType" : "NATURAL",
    "personRole" : "PRIMARY",
    "mailingPersonRoles" : [ "ADDRESSEE" ],
    "gender" : {
      "gender" : "MALE",
      "confidence" : 0.9111111111111111
    },
    "addressingGivenName" : "Petra",
    "addressingSurname" : "Meyer",
    "outputPersonName" : {
      "terms" : [ {
        "string" : "Petra",
        "termType" : "GIVENNAME"
      },{
        "string" : "Meyer",
        "termType" : "SURNAME"
      } ]
    }
  },
  "parserDisputes" : [ ],
  "likeliness" : 0.926699401733102,
  "confidence" : 0.7536487758945387
}

Now let’s see how to do this in Go and Python!

Go implementation

Code

/*
Fetch the NameAPI.org REST API and turn JSON response into a Go struct.

Sent data have to be JSON data encoded into request body.
Send request headers must be set to 'application/json'.
*/

package main

import (
    "encoding/json"
    "io/ioutil"
    "log"
    "net/http"
    "strings"
)

// url of the NameAPI.org endpoint:
const (
    url = "http://rc50-api.nameapi.org/rest/v5.0/parser/personnameparser?" +
        "apiKey=<API-KEY>"
)

func main() {

    // JSON string to be sent to NameAPI.org:
    jsonString := `{
        "inputPerson": {
            "type": "NaturalInputPerson",
            "personName": {
                "nameFields": [
                    {
                        "string": "Petra",
                        "fieldType": "GIVENNAME"
                    }, {
                        "string": "Meyer",
                        "fieldType": "SURNAME"
                    }
                ]
            },
            "gender": "UNKNOWN"
        }
    }`
    // Convert JSON string to NewReader (expected by NewRequest)
    jsonBody := strings.NewReader(jsonString)

    // Need to create a client in order to modify headers
    // and set content-type to 'application/json':
    client := &http.Client{}
    req, err := http.NewRequest("POST", url, jsonBody)
    if err != nil {
        log.Println(err)
    }
    req.Header.Add("Content-Type", "application/json")
    resp, err := client.Do(req)

    // Proceed only if no error:
    switch {
    default:
        // Create a struct dedicated to receiving the fetched
        // JSON content:
        type Level5 struct {
            String   string `json:"string"`
            TermType string `json:"termType"`
        }
        type Level41 struct {
            Gender     string  `json:"gender"`
            Confidence float64 `json:"confidence"`
        }
        type Level42 struct {
            Terms []Level5 `json:"terms"`
        }
        type Level3 struct {
            Gender           Level41 `json:"gender"`
            OutputPersonName Level42 `json:"outputPersonName"`
        }
        type Level2 struct {
            ParsedPerson Level3 `json:"parsedPerson"`
        }
        type RespContent struct {
            Matches []Level2 `json:"matches"`
        }

        // Decode fetched JSON and put it into respContent:
        respContentBytes, err := ioutil.ReadAll(resp.Body)
        if err != nil {
            log.Println(err)
        }
        var respContent RespContent
        err = json.Unmarshal(respContentBytes, &respContent)
        if err != nil {
            log.Println(err)
        }
        log.Println(respContent)
    case err != nil:
        log.Println("Network error:", err)
    case resp.StatusCode != 200:
        log.Println("Bad HTTP status code:", err)
    }

}

Explanations

As you can see we’re facing 2 painful problems with Go:

  • the http lib is quite tricky when it’s about encoding JSON data into the request body and changing the Content-Type header. Go’s documentation is not very clear on this. As a result we cannot use the pretty straightforward http.Post but instead we need to create a http.Client and then use the NewRequest() function and trigger it with client.Do(req). This is the only way to set a custom Content-Type in that case: req.Header.Add("Content-Type", "application/json")
  • decoding the returned JSON into Go data is pretty long and boring (called Unmarshalling in Go). It’s due to the fact that, Go being a statically typed language, we need to know in advance what the final returned JSON will look like. Thus we need to create a dedicated struct that will map the JSON’s structure and receive the data. In case of a nested JSON like the one returned by NameAPI.org, mixing arrays and maps, it is very touchy. Fortunately, our struct does not need to map the whole JSON but only the fields we will need. Another approach, if we have no idea what the final JSON will look like, would be to guess the types of data. Here is a good article on this.

The jsonString input is already a string here. But for a proper comparison with Python, it should have been a struct that we would have turned into a string. I just did not want to make this script too long for the blog.

Python implementation

Code

"""
Fetch the NameAPI.org REST API and turn JSON response into Python dict.

Sent data have to be JSON data encoded into request body.
Send request headers must be set to 'application/json'.
"""

import requests

# url of the NameAPI.org endpoint:
url = (
    "http://rc50-api.nameapi.org/rest/v5.0/parser/personnameparser?"
    "apiKey=<API-KEY>"
)

# Dict of data to be sent to NameAPI.org:
payload = {
    "inputPerson": {
        "type": "NaturalInputPerson",
        "personName": {
            "nameFields": [
                {
                    "string": "Petra",
                    "fieldType": "GIVENNAME"
                }, {
                    "string": "Meyer",
                    "fieldType": "SURNAME"
                }
            ]
        },
        "gender": "UNKNOWN"
    }
}

# Proceed, only if no error:
try:
    # Send request to NameAPI.org by doing the following:
    # - make a POST HTTP request
    # - encode the Python payload dict to JSON
    # - pass the JSON to request body
    # - set header's 'Content-Type' to 'application/json' instead of
    #   default 'multipart/form-data'
    resp = requests.post(url, json=payload)
    resp.raise_for_status()
    # Decode JSON response into a Python dict:
    resp_dict = resp.json()
    print(resp_dict)
except requests.exceptions.HTTPError as e:
    print("Bad HTTP status code:", e)
except requests.exceptions.RequestException as e:
    print("Network error:", e)

Explanations

The Python Request library is an amazing library saves us a lot of time here compared to Go! In one line, resp = requests.post(url, json=payload), almost everything is done under the hood:

  • build a POST HTTP request
  • encode the Python payload dictionary to JSON
  • pass the JSON to the request body
  • set header’s 'Content-Type' to 'application/json' instead of the default 'multipart/form-data' thanks to the json keyword argument
  • send the request

Decoding of returned JSON is also a one-liner: resp_dict = resp.json(). No need to create a complicated data structure in advance here!

Conclusion

Python is clearly the winner. Python’s simplicity combined with its huge set of libraries saves us a lot of time of development!

We’re not dealing with performance here of course. If you’re looking for a high-performance API fetcher using concurrency, Go could be a great choice. But simplicity and performance are not good friends as you can see…

Feel free to comment, I would be glad to here your opinion on this!

Existe aussi en français | También existe en Español
How to speed up web scraping with Go (Golang) and concurrency ?

I’ve been developing Python web scrapers for years now. Python’s simplicity is great for quick prototyping and so many amazing libraries can help you build a scraper and a result parser (Requests, Beautiful Soup, Scrapy, …). Yet once you start looking into your scraper’s performance, Python can be somewhat limited and Go is a great alternative !

Why Go ?

When you’re trying to speed up information fetching from the Web (for HTML scraping or even for a mere API consumption), 2 ways of optimization are possible:

  • speed up the web resource download (e.g. download http://example.com/hello.html)
  • speed up the parsing of the information you retrieved (e.g. get all urls available in hello.html)

Parsing can be improved either by reworking your code, or using a more efficient parser like lxml, or allocating more resources to your scraper. Still, parsing optimization is often negligible compared to the real bottleneck, namely network access (i.e. web page downloading).

Consequently the solution is about downloading the web resources in parallel. This is where Go is a great help !

Concurrent programming is a very complicated field, and Go makes it pretty easy. Go is a modern language which was created with concurrency in mind. On the other hand, Python is an older language and writing a concurrent web scraper in Python can be tricky, even if Python has improved a lot in this regard recently.

Go has other advantages, but let’s talk about it in another article !

Install Go

I already made a short tuto about how to install Go on Ubuntu.

If you need to install Go on another platform, feel free to read the official docs.

A simple concurrent scraper

Our scraper will basically try to download a list of web pages we’re giving him first, and check it gets a 200 HTTP status code (meaning the server returned an HTML page without an error). We’re not dealing with HTML results parsing here, since the goal is to focus on the critical point: improving network access performance. It’s your turn to write something now !

Final code


/*
Open a series of urls.

Check status code for each url and store urls I could not
open in a dedicated array.
Fetch urls concurrently using goroutines.
*/

package main

import (
    "fmt"
    "net/http"
)

// -------------------------------------

// Custom user agent.
const (
    userAgent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) " +
        "AppleWebKit/537.36 (KHTML, like Gecko) " +
        "Chrome/53.0.2785.143 " +
        "Safari/537.36"
)

// -------------------------------------

// fetchUrl opens a url with GET method and sets a custom user agent.
// If url cannot be opened, then log it to a dedicated channel.
func fetchUrl(url string, chFailedUrls chan string, chIsFinished chan bool) {

    // Open url.
    // Need to use http.Client in order to set a custom user agent:
    client := &http.Client{}
    req, _ := http.NewRequest("GET", url, nil)
    req.Header.Set("User-Agent", userAgent)
    resp, err := client.Do(req)

    // Inform the channel chIsFinished that url fetching is done (no
    // matter whether successful or not). Defer triggers only once
    // we leave fetchUrl():
    defer func() {
        chIsFinished <- true
    }()

    // If url could not be opened, we inform the channel chFailedUrls:
    if err != nil || resp.StatusCode != 200 {
        chFailedUrls <- url
        return
    }

}

func main() {

    // Create a random urls list just as an example:
    urlsList := [10]string{
        "http://example1.com",
        "http://example2.com",
        "http://example3.com",
        "http://example4.com",
        "http://example5.com",
        "http://example10.com",
        "http://example20.com",
        "http://example30.com",
        "http://example40.com",
        "http://example50.com",
    }

    // Create 2 channels, 1 to track urls we could not open
    // and 1 to inform url fetching is done:
    chFailedUrls := make(chan string)
    chIsFinished := make(chan bool)

    // Open all urls concurrently using the 'go' keyword:
    for _, url := range urlsList {
        go fetchUrl(url, chFailedUrls, chIsFinished)
    }

    // Receive messages from every concurrent goroutine. If
    // an url fails, we log it to failedUrls array:
    failedUrls := make([]string, 0)
    for i := 0; i < len(urlsList); {
        select {
        case url := <-chFailedUrls:
            failedUrls = append(failedUrls, url)
        case <-chIsFinished:
            i++
        }
    }

    // Print all urls we could not open:
    fmt.Println("Could not fetch these urls: ", failedUrls)

}


Explanations

This code is a bit longer than what we could do with a language like Python, but as you can see it is still very reasonable. Go is a statically typed language, so we need a couple of more lines dedicated to variables declaration. But please measure how much time the script is taking, and you’ll understand how rewarding it is !

We chose 10 random urls as an example.

Here, the magical keywords enabling us to use concurrency are go, chan, and select:

  • go creates a new goroutine, which means fetchUrl will be executed within a new concurrent goroutine each time.
  • chan is the type representing a channel. Channels help us communicate among goroutines (main being a goroutine itself as well).
  • select ... case is a switch ... case dedicated to receiving messages sent through channels. Program stays here as long as all goroutines have not sent a message (either to say that url fetching is done, or to say that url fetching failed).

We could have made this scraper without any channel, that’s to say create goroutines and not expect a message from them in return (for instance if every goroutine ends up storing information in database). In such a case, our main goroutine can perfectly end while some goroutines are still working. This is possible because main does not block other goroutines when it stops. But in real life it is almost always necessary to use channels in order to make our goroutines talk to each other.

Don’t forget to limit speed !

Here speed is our goal. This is not a concern because we’re scraping all different urls. However if you need to scrap the same urls multiple times (like in API consumption for example), you’ll probably have to stay under a certain number of requests per second. In this case, you’ll have to set up a counter (maybe we’ll talk about it in another article !).

Have a nice scraping !

Existe aussi en français | También existe en Español
Install Go (Golang) 1.9 on Ubuntu 17.10

Here is a little memento for those who wish to install Go (1.9) on their Ubuntu machine (17.10). As a reminder, Go is a compiled language, meaning no need to install Go on the machine where your application will run.

Update repositories and security patches, just in case:

sudo apt-get update
sudo apt-get -y upgrade

Download and install Go:

sudo curl -O https://storage.googleapis.com/golang/go1.9.linux-amd64.tar.gz  # Download archive. Change the archive's name if you need another version of Go or another system architecture
sudo tar -xvf go1.9.linux-amd64.tar.gz  # Extract archive
sudo mv go /usr/local  # Move binaries to /usr/local
echo 'export PATH=$PATH:/usr/local/go/bin' >> ~/.profile  # Update our bash profile so Go is in the PATH
source ~/.profile  # Update profile

Go is now installed ! Create your project and initialize environment variables:

mkdir $HOME/my_project
mkdir $HOME/my_project/src
mkdir $HOME/my_project/src/my_app
export GOPATH=$HOME/my_project

Then create your app:

vim $HOME/my_project/src/my_app/my_app.go

And put the following code in it (display a mere Hello World):

package main

import "fmt"

func main() {
    fmt.Printf("hello world\n")
}

Compile app:

go install my_app

If everything went well, an executable has been created inside a new bin folder. Run it:

$HOME/my_project/bin/my_app

Finally you should get:

hello world

In order to understand the differences between go install, go build, and go run read this. And if you cannot or do not want to install Go on your machine, have a look at this Docker image.

Enjoy !

Existe aussi en français | También existe en Español
Why create a blog as a developer ?

Here we are, I’m taking the plunge. I’ve been considering creating a blog for a while. I really wanted it to be multilingual, which made me ask to myself a couple of technical questions about language handling. As a developer I think this blog will be beneficial.

Share

Most developers anticipate that they don’t have enough experience in order to publish on the web. This is very humble but not always correct! IT development is such a big world that we always find people who are more beginner than we are in any field. Those people are relieved that others faced the same issues before them and put it simply.

Promote local languages

I am stunned by how difficult it is to find local content in the IT world. I know English has become the Lingua Franca in this regard but so many people are not fluent enough when it’s about parsing English-speaking blogs efficiently. It’s a shame that skilled and motivated future devs face this barrier (English cannot be learnt overnight…). Defending local languages is important!

OK I must admit that internationalizing a web app is a lot of additional work, and existing localization tools are hard to use IMHO (but let’s talk about that in a future dedicated post), so I understand that everyone wants to avoid it.

Structure ideas

It appears that writing the technical problems you faced in a blog helps structuring your mind and understand concepts better. Let’s see that ! But it is true that this is my main motivation when writing documentation: docs help me remember new concepts much more easily. We must see the blog as another good documentation tool!

Here we go. A quick quote about the above:

Once you get things properly you are able to put it clearly to others.

– Nicolas Boileau

Existe aussi en français | También existe en Español