Python Weekly statistics (using urllib2, HTMLParser and pickle)

The Python Weekly is a weekly newsletter curated by Rahul Chaudhary and sent out every Thursday. It has an archive, though unfortunately I only found the first 81 issues in there and on July 30 2015 it was already issue 202.

Nevertheless, I think it is an interesting exercise to collect link statistics from these issues.

Specifically, I wanted to know which hosts are featured the most in the Python Weekly.

The result is that there were a total of 2399 links to 910 different hosts in the first 81 issues of the Python weekly. The top featured hosts were:

The code behind the stats

While the numbers themselves might be interesting, the main point of this exercise was to write and explain the code behind it. So let's see the full script, and then let's explain it:

examples/python/python_weekly_stats.py

from __future__ import print_function
import urllib2, os, pickle, re, sys
from HTMLParser import HTMLParser


class ArchiveHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag != 'a':
            return
        attr = dict(attrs)
        if 'href' in attr:
            self.urls.append(attr["href"])
        else:
            print("Missing href from a with attributes: ", attr)


class Collector(object):
    def __init__(self):
        self.cache_file = 'python_weekly.pickle'
        self.data = { 'issue': 0, 'hosts' : {} }
        self.read_cache()

    def read_cache(self):
        if os.path.exists(self.cache_file):
            fh = open(self.cache_file, 'r')
            self.data = pickle.load(fh)
            fh.close()
    def save_cache(self):
        fh = open(self.cache_file, 'w')
        pickle.dump(self.data, fh)
        fh.close()

    def collect(self):
        base_url = 'http://www.pythonweekly.com/archive/'
        while (True):
            self.data['issue'] += 1
            url = base_url + str(self.data['issue']) + '.html'
            try:
                f = urllib2.urlopen(url)
                html = f.read()
                f.close()
            except urllib2.HTTPError as e:
                print(e, 'while fetching', url)
                break

            parser = ArchiveHTMLParser()
            parser.urls = []
            parser.feed(html)
            for u in parser.urls:
                if u == '#':
                    continue
                if not re.search(r'^https?://', u):
                    print("Unhandled url: " + u)
                    continue
                match = re.search(r'^https?://([^/]+)', u)
                if match:
                    host = match.group(1)
                    if host not in self.data['hosts']:
                        self.data['hosts'][host]  = 0
                    self.data['hosts'][host] += 1
            self.save_cache()

    def report(self):
        hosts, total = 0, 0
        for k in sorted(self.data['hosts'].keys(), key=lambda x: self.data['hosts'][x], reverse=True):
            hosts += 1
            total += int(self.data['hosts'][k])
            #print(k, self.data['hosts'][k])
            print('<li><a href="http://{0}/">{0}</a> {1}</li>'.format(k, self.data['hosts'][k]))
        print("A total of {} links to {} hosts in {} issues.".format(total, hosts, self.data['issue']))

    def run(self):
        if len(sys.argv) == 2:
            if sys.argv[1] == 'collect':
                self.collect()
            if sys.argv[1] == 'report':
                self.report()
        else:
            self.collect()
            self.report()

Collector().run()

The code has two major parts. One of them fetches the URLs of the archive pages, parses the HTML and enters all the data into a dictionary. The other part generates the report from this dictionary.

In order to avoid repeatedly downloading all the pages while working on the report the data collected by the first part was serialized using the standard pickle module.

This technique would be also very useful if the archive was actually growing every week and if we wanted to update our stats every week. In that case having the previously collected and processed data stored in a local file would allow us to retrieve only the new issues. That would be more considerate of the web site and would not waste our bandwidth either.

The two classes in the script are the Collector class and the ArchiveHTMLParser class. The latter is a subclass of the HTMLParser that parses the retrieved HTML files, the individual archive pages.

The Collector class has a method collect, that does the data collection (using the ArchiveHTMLParser class), and another method called report that generates the report.

Construction and loading the serialized data

In the body of the script we create an instance from the Collector class and immediately call it run method that will process the command line options.

When the the Collector instance is created Python calls the __init__ constructor where we set the name of the file in which we keep the serialized data and set up a dictionary to represent the data if the file does not exist yet. Then we call the read_cache method that will load the data from the pickled file, if the file already exists. When we run the script for the first time this method does not do anything and then we rely on the data attribute created in the constructor.

class Collector(object):
    def __init__(self):
        self.cache_file = 'python_weekly.pickle'
        self.data = { 'issue': 0, 'hosts' : {} }
        self.read_cache()

    def read_cache(self):
        if os.path.exists(self.cache_file):
            fh = open(self.cache_file, 'r')
            self.data = pickle.load(fh)
            fh.close()

Saving the serialized data

A separate save_cache method was added to serialize the data using pickle and save it in a file.

def save_cache(self):
    fh = open(self.cache_file, 'w')
    pickle.dump(self.data, fh)
    fh.close()

The Data

The dictionary data itself has two keys. issue holds the number of most recent issue processed. It defaults to 0 and is updated every time we fetch a new page. The other key is hosts which itself holds a dictionary in which the key/value pairs are the hostnames and the number of occurrences.

Data collection

The collect method has an infinite loop in it. On every iteration we increase the number of the issue we are processing and we leave the loop using break when the retrieval of the HTML page fails. Assuming that means we reach the end of the archives.

The first step in the collection is to create the URL of the current issue using the base_url and the number of the current issue. (and an .html extension). We then use urllib2 to fetch the page. If the fetching fails we leave the infinite loop using break.

If the fetching of the HTML was successful we create a parser instance and feed the html to it. The parser is very simple. It will extract the links and put the URLs into an attribute called urls. That way we'll be able to access the list of urls after the parsing has finished.

Once the parsing is finished we iterate over the urls hold in the parser.urls attribute. We eliminate the self links that only have a "#" in them and check if the current url stars with either http:// or https:// . We used a Regular Expresion for this.

Using another Regular Expression we extract the fully qualified hostname form the URL, removing the http:// from the beginning and the path in the URL after the host and domain name.

def collect(self):
    base_url = 'http://www.pythonweekly.com/archive/'
    while (True):
        self.data['issue'] += 1
        url = base_url + str(self.data['issue']) + '.html'
        try:
            f = urllib2.urlopen(url)
            html = f.read()
            f.close()
        except urllib2.HTTPError as e:
            print(e, 'while fetching', url)
            break

        parser = ArchiveHTMLParser()
        parser.urls = []
        parser.feed(html)
        for u in parser.urls:
            if u == '#':
                continue
            if not re.search(r'^https?://', u):
                print("Unhandled url: " + u)
                continue
            match = re.search(r'^https?://([^/]+)', u)
            if match:
                host = match.group(1)
                if host not in self.data['hosts']:
                    self.data['hosts'][host]  = 0
                self.data['hosts'][host] += 1
        self.save_cache()

If you are not familiar with regexes, then this regex might need some explanation:

match = re.search(r'^https?://([^/]+)', u)

^ means match at the beginning of the string.

? in https? means the character s is optional in the match.

://</lh> just matches ://`, nothing special there.

[^/] means match any character that is not a slash. [^/]+ match as many non-slashes as possible, but at least match one. (+ means 1 or more).

The parenthese () tell the regex engine to capture whatever was matched by the regex inside the parentheses and make that accessible via match.group(1).

On every iteration, after retrieving a url and adding its links to the data dictionary we call the save_cache method to serialize the data and save it to the disk.

Parsing HTML with HTMLParser

the HTMLParser subclass is quite simple, and it was actually explained in the article about extracting html links using Python. As the HTMLParser goes over the HTML page it will call the handle_starttag method on every start tag. We only handle the a tags and save their href attribute.

class ArchiveHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag != 'a':
            return
        attr = dict(attrs)
        if 'href' in attr:
            self.urls.append(attr["href"])
        else:
            print("Missing href from a with attributes: ", attr)

Generate Report

The last part of the Collector class is the report method, that will use the data in the data attribute to sort the URLs and their hit count sorted by the count.

def report(self):
    hosts, total = 0, 0
    for k in sorted(self.data['hosts'].keys(), key=lambda x: self.data['hosts'][x], reverse=True):
        hosts += 1
        total += int(self.data['hosts'][k])
        #print(k, self.data['hosts'][k])
        print('<li>[{0}](http://{0}/) {1}</li>'.format(k, self.data['hosts'][k]))
    print("A total of {} links to {} hosts in {} issues.".format(total, hosts, self.data['issue']))

Conclusion

As I wrote in the opening these numbers only represent the first 81 issues of the Python Weekly It could be interesting to track down the other, more than 140 issues and create the stats based on that.

It might be also interesting to merge the different blogspot gTLDs. (If you are visiting a Blogspot site from Italy, it will redirect you to blogspot.it and if the author of the weekly does not notice it the localized link will be included in the article. It is not a problem, but it changes the stats.)