APIs and scraping

Thu, Nov 19, 2015

What is an API?

An API is an Application Programming Interface. An API allows us to ask for data from a website or web service. In more advanced cases, we can also post information — to a web service, such as posting a Tweet or updating Facebook from your own application/website.

The API is basically a set of instructions for how to communicate with a service to either get/post data. It uses special URLs called endpoints.

API Keys

Most APIs require you to register an account with the service, so they can track how you are using their data, and how often, so they can make sure you’re within the limits of their Terms of Service (TOS). When you register with a website, they will often issue you an API Key that keeps track of how you’re using the service.

For example, if you wanted to use Google Map’s API, you would first register for a key here. Then in your website, you could call Google Maps with their JavaScript library:

<!-- Example using Google Maps API -->
<script src="https://maps.googleapis.com/maps/api/js?key=YOUR_API_KEY"></script>

Parts of an API

There are a few terms to be familiar with when working with APIs.

  • URI This is the same as URL. There is a subtle technical difference, but for our purposes, they are the same thing.
  • Query String This is the part of the URI that is after the question mark, and it has parameter/value pairs, each separate by an ampersand &. It will look like this: ?key=3223&count=4&sort=newest.
  • URL encoding Replacing characters of a URI, like a colon or slash, in order to put them as part of a query string.
  • Endpoint The part of the URI that will specify which type of data you are going to retrieve from the service.
  • Token or Key A unique scrambled set of characters to serve as an identifier. Similar to keys, but typically more temporary used in place of a password to gain access to a user’s data.
  • OAuth Authentication Open standard to Authentication, a way for users to authorize an application without giving their password. It uses tokens instead, and allows a user to revoke access at any time.

Parts of an API URI

Example using NYTimes API

In this example, we will use the New York Time API to call up some articles. You will need to register for a NYTimes.com account if you don’t have one already.

NYTimes API

First, look at the documentation for the NYTimes Article Search API, you will see the following example of the URI you can use to make a request.

http://api.nytimes.com/svc/search/v2/articlesearch.response-format?[q=search term&fq=filter-field:(filter-term)&additional-params=values]&api-key=####

Following this example, let’s search for the terms “Barack Obama” sorting the oldest articles first.

#paste the following into a browser with your own API key
http://api.nytimes.com/svc/search/v2/articlesearch.json?q=barack+obama&page=0&sort=oldest&api-key=####

This will return a JSON file with multiple results, including articles from 1990 when Obama was first elected as president of the Harvard Business Review.

Note: You shouldn't run scripts in a webpage that exposes your API key. These URLs are typically used in server side languages that are beyond the scope of this tutorial.

Understanding AJAX

AJAX stands for Asynchronous JavaScript and XML. While that may sound daunting, in its simplest form, it really means a way to transfer information between a web page that has already been loaded in the browser, and the server. It is the technology that allows things like GMail to exist. In the case of GMail, you load the website once, but then as you work your way thorough the web application, additional requests for information (calling up additional e-mails, filing away old ones) is made using AJAX.

For our purposes, it usually means communicating with another website, and requesting data.

Making an AJAX request

We can use jQuery to make requests for JSON data from other websites. Specifically, we will use the ajax function to request data.

<script src="//code.jquery.com/jquery-2.1.1.min.js"></script>
<script>

    $.ajax({
        url      : "",     //the url for a feed
        dataType : "json", //json or jsonp
    }).done(function(d){

        //if success, d will hold the data result

    })

</script>

You can also specify a callback for use of JSONP. JQuery will handle everything automatically. Just add callback=? to the end of the URL, and jQuery will replace the question mark with the appropriate function call.

This can be used to load in most JSON feeds from other websites.

CORS, or Cross-origin Resource Sharing security issues, and using JSONP

There is a security feature in browsers that prevents data from being loaded directly from other domains into a web page. Say website example.com wants to load JSON data from Google.com, it could be blocked since both websites are on different domains. While this feature protects browsers and users, it also creates issues when trying to make AJAX requests from other domains for legitimate purposes.

The way we can solve this is by using JSONP, (which stands for JSON plus padding). JSONP wraps the JSON in a function call. This can be used by libraries like jQuery to get around cross-domain issues. (It is also possible to manually create your own function of the same name, and include this code using a standard <script> tag, but libraries like jQuery are generally more reliable).

//normal JSON
{
    "name" : "Tyson",
    "age"  : 24
}

//JSONP, where JSON data is wrapped in a function call
someCallback({
    "name" : "Tyson",
    "age"  : 24

})

To use JSONP, you would set the dataType property in jQuery to "JSONP", and possibly add a jsonpCallback property, which should list the name of the function that will pad your request.

//revised ajax request using JSONP
$.ajax({
    url           : "",            //the url for a feed
    dataType      : "jsonp",       //json or jsonp
    jsonpCallback : "callbackName" //name of the function call
}).success(function(d){

    //if success, d will hold the data result

})

Examples: Starter HTML

Below we have a starter HTML page to begin making request. We will use jQuery to process all of our AJAX requests.

Exercise: Making an AJAX request to a JSON file

<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8">
    <title>Example AJAX Request</title>
    <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <style type="text/css">
    /* any custom css styles will go here */

    </style>
</head>
<body>

    <div id="results"></div>

    <script src="http://code.jquery.com/jquery-2.1.1.min.js"></script>
    <script>
    //All your ajax code goes here


    </script>
</body>
</html>

Example Getting an RSS feed from a website

Google has a special service which will convert any RSS feed to JSON. This is helpful if you want to use AJAX for pulling in your data from an RSS feed, which tend to be more readily available.

https://ajax.googleapis.com/ajax/services/feed/load?v=1.0&num=1000&q={RSS URL}

The num=1000 can be changed depending on the number of items you want it limited to. The q={rss url} should be the url for an RSS feed (replace the curly brackets as well).

<script src="http://code.jquery.com/jquery-2.1.1.min.js"></script>
<script>

    //replace this with your own RSS feed
    var rssFeed = "http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml";

    $.ajax({
        url       : "https://ajax.googleapis.com/ajax/services/feed/load?v=1.0&num=1000&callback=?&q=" + encodeURIComponent(rssFeed), 
        dataType  : "json", //json or jsonp
    })
    .done(function(d){
        
        //we only need the feed part
        d = d.responseData.feed;

        //prints out the data to the console
        console.log(d);

        d.entries.forEach(function(entry, i){

            //entry.title
            //entry.link
            //entry.contentSnippet
            //entry.publishedDate
            //entry.content

            //example
            $("body")
                .append("<h2>" + entry.title + "</h2>")
                .append("<p>" + entry.contentSnippet + "</p>");
        })

    })
    .fail(function(x,error){
        console.log("Request Failed: " + error);
    });

</script>

In the above example, you should do something with the results of the .done function. One possibility is to print them out to the web page using jQuery methods.

.done(function(d){
    
    //replace the data with the feed element
    d = d.responseData.feed;

    $("body").append("<h1>" + d.title + "</h1>");

    for(var i=0; i < d.entries.length; i++){

        $("body").append("<p>" + d.entries[i].contentSnippet + "</p>");

    }

})

Let’s use AJAX to make a Flickr Photo Gallery using a photo pool called “color,” which is a curated list of popular photos on Flickr that constantly rotates. Flickr’s feed documentation explains the base URI and endpoints.

First, take a look at the results from the Flickr “Color” Photo Pool feed.

#paste the following into your browser
https://api.flickr.com/services/feeds/groups_pool.gne?id=28747776@N00&format=json

Which produce the following results.

//results
jsonFlickrFeed({
        "title": "Interestingness | Must be in Top 10 Pool",
        "link": "https://www.flickr.com/groups/colors/pool/"
...

This feed is already setup for JSONP as all of the data is wrapped in a function call titled jsonFlickrFeed.

Let’s make a photo gallery from this feed:

<script src="http://code.jquery.com/jquery-2.1.1.min.js"></script>
<script>

    $.ajax({
        url           : "https://api.flickr.com/services/feeds/groups_pool.gne?id=28747776@N00&format=json", 
        dataType      : "jsonp", //json or jsonp
        jsonpCallback : "jsonFlickrFeed",
    }).success(function(data){
        
        //for loop cycles through each of the items
        data.items.forEach(function(d){

            //append an image tag
            $("body").append("<img src=\"" + d.media.m + "\" />");
        });

    })

</script>