Now that we know how to fetch an HTML page with Python using urllib we take another step and try to extract all the links from the HTML file. For this we are going to use the HTMLParser module.
examples/python/print_links_html_parser.py
from __future__ import print_function
import urllib2, sys
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag != 'a':
return
attr = dict(attrs)
print(attr)
def extract():
if len(sys.argv) != 2:
print('Usage: {} URL'.format(sys.argv[0]))
return
url = sys.argv[1]
try:
f = urllib2.urlopen(url)
html = f.read()
f.close()
except urllib2.HTTPError as e:
print(e, 'while fetching', url)
return
parser = MyHTMLParser()
parser.feed(html)
extract()
The extract
function first expects a URL on the command line, and then using that URL
and the urllib2 library, it fetches the HTML served on that URL.
Then we create an HTMLParser instance and call the feed
method passing
the HTML to it. More precisely, we are subclassing HTMLParser and we create an instance
of that subclass.
The way HTMLParser works is that it goes over all the elements of the HTML and every time
it encounters an opening tag it calls the handle_starttag
method with two parameters,
(besides the object itself): the name of the tag and the attributes as a list of tuples.
When it encounters an end-tag it calls handle_endtag
with the name of the tag.
When it encounters text inside a tag (for example the anchor of a link), it calls the
handle_data
method with the text.
If we subclass the HTMLParser, and implements some, or all of the above methods, then
when we call the feed
method, it will call the methods we have overridden
in the subclass.
So we have created a subclass called MyHTMLParser
and we have implemented
the handle_starttag
in it. In this task we are only interested in the
URLs of the links and those are the href
attributes in the opening part of the a
tags.
Inside the method we check the tag and if it is not an a
then we call return
:
We don't need to do anything with such tags.
If it is an a
tag, we convert the attributes to a dictionary and then print
them out.
In the next article we'll see how can we collect this information for later use.