Web client - web scraping

get HTML page using urllib

urllib is a rather low level library. It comes standard with Python and thus it might be useful for simple tasks.

import urllib.request

# fh is like a filehandle
with urllib.request.urlopen('https://python.org/') as fh:
    html = fh.read()

print(html)

This code will print the content of the HTML page download from the given web site.

If there is an error the request will raise an exception.

For example if ask for a page that does not exist on the server we’ll get a urllib.error.HTTPError with the text HTTP Error 404: Not Found.

If we ask for a URL that does not even resolve we’ll get a urllib.error.URLError exception the text Name or service not known.

We should probably wrap the code in a try - except expression.

urllib GET with error handling

The urllib module is split into several submodules. One of them is the urllib.error that defines the exceptions raised by urllib.request

import urllib.request
import sys

if len(sys.argv) != 2:
    exit(f"Usage: {sys.argv[0]} URL")

url = sys.argv[1]

try:
    with urllib.request.urlopen(url) as fh:
        html = fh.read()
except urllib.error.HTTPError as err:
    if err.code == 404:
        print("not found")
    elif err.code == 500:
        print("internal error")
    else:
        print(f"Unhandled error: {err.code}")

    print(f"HTTPError: {err}")
except urllib.error.URLError as err:
    print(f"URLError: {err}")
    print(err.reason)
else:
    print(f"Success: Downloaded {len(html)} characters.")

We are using the httpbin service locally.

Asking for the main page we get a proper response with content:

$ python3 urllib_error_handling.py http://localhost/
Success: Downloaded 9593 characters.

no name resolution

If we try to access a site, but there is no DNS name resolution we get a URLError and we can check reason field to see what was the problem.

$ python3 urllib_error_handling.py http://notlocalhost
URLError: <urlopen error [Errno -3] Temporary failure in name resolution>
[Errno -3] Temporary failure in name resolution

Connection refused

If the name resolution works, but the web server does not work (in our case we don’t run any service on port 8081 locally) then again we get an URLError but with a different reason.

$ python3 urllib_error_handling.py http://localhost:8081
URLError: <urlopen error [Errno 111] Connection refused>
[Errno 111] Connection refused

404 NOT FOUND

We can ask the httpbin service to return specific status codes. That’s a very nice way to test how our code behaves when it receives am “unexpected” status code.

$ python3 urllib_error_handling.py http://localhost/status/404
not found
HTTPError: HTTP Error 404: NOT FOUND

So we received an HTTPError and we could even find out what was the exact status code.

500 INTERNAL SERVER ERROR

We can even ask the httpbin service to return a 500 error.

$ python3 urllib_error_handling.py http://localhost/status/500
internal error
HTTPError: HTTP Error 500: INTERNAL SERVER ERROR

401 UNAUTHORIZED

We can ask httpbin to return a 401 status code. Our example does not have any special treatment for that so we just print “Unhandled error”.

$ python3 urllib_error_handling.py http://localhost/status/401
Unhandled error: 401
HTTPError: HTTP Error 401: UNAUTHORIZED

301 redirection

However, redirections, such as the 301 status code, does not trigger an exception. So if we will want to handle these as well, we’ll have to use another approach.

$ python3 urllib_error_handling.py http://localhost/status/301
Success: Downloaded 228 characters.

Download image using urllib

Usually you will want to save the downloaded image to the local disk.

import urllib.request

url = 'https://www.python.org/images/python-logo.gif'
with urllib.request.urlopen(url) as fh:
    with open('logo.gif', 'wb') as out:
        out.write(fh.read())

get HTML page using requests

requests

requests is the de-facto standard in Python for dealing with web pages as a web client.

import requests

res = requests.get('https://python.org/')
print(type(res))
print(res.status_code)
print(res.headers)
print(res.headers['content-type'])
# print(res.content)

Download image using requests

import requests

url = 'https://www.python.org/images/python-logo.gif'
filename = 'logo.gif'
res = requests.get(url)
print(res.status_code)
with open(filename, 'wb') as out:
    out.write(res.content)

Download image as a stream using requests

OK, this is not such a good example for streaming.

import requests
import shutil

url = 'https://bloximages.newyork1.vip.townnews.com/wpsdlocal6.com/content/tncms/assets/v3/editorial/7/22/722f8401-e134-5758-9f4b-a542ed88a101/5d41b45d92106.image.jpg'
filename = "source.jpg"
res = requests.get(url, stream=True)
print(res.status_code)
with open(filename, 'wb') as fh:
    res.raw.decode_content
    shutil.copyfileobj(res.raw, fh)

Download zip file using requests

import requests
import shutil

url = "https://code-maven.com/public/developer_survey_2019.zip"
filename = "developer_survey_2019.zip"

res = requests.get(url, stream=True)
print(res.status_code)
if res.status_code == 200:
    with open(filename, 'wb') as fh:
        res.raw.decode_content
        shutil.copyfileobj(res.raw, fh)

Extract zip file

zipfile
unzip
zip

This is unrelated, but once you have downloaded a zip file you will need to be able to extract its content. This example shows how to unzip a file already on your disk.

zipfile

import zipfile

path = "developer_survey_2019.zip"
zf = zipfile.ZipFile(path)
zf.extractall()

Beautiful Soup to parse HTML

bs4
BeautifulSoup
Beautiful soup

from bs4 import BeautifulSoup
import requests

url = 'https://en.wikipedia.org/wiki/Main_Page'
res = requests.get(url)
if res.status_code != 200:
    exit(f"Error in getting the page. Status code: {res.status_code}")
html = res.content

soup = BeautifulSoup(html, features='lxml')
print(soup.title.text)

for link in soup.find_all("a", limit=3):
    print(link)
    print(link.text)
    print(link.attrs.get('href'))
    print()

print('-----------------------------------------')

forms = soup.select("#searchform")
if forms is not None:
    print(forms)
    form = forms[0] # We used an ID to search we expect to have 0 or one matches in the list
    print()
    print('Action: ', form.attrs.get('action'))

    # Search inside that element we found earlier
    for inp in form.find_all('input'):
        print('id: ', inp.attrs.get('id'))
print('-----------------------------------------')

tfa = soup.select("#mp-tfa")
if tfa is not None:
    #print(tfa)
    paras = tfa[0].select("p")
    if paras is not None:
        #print(paras)
        links = paras[0].find_all("a", limit=1)
        if links:
            print(links[0].text)
            print(links[0].attrs.get('href'))

requests - JSON - API

Downloading HTML pages and parsing them to extract data can be a lot of fun, but it is also very unstable. Page layouts will change. The code will break easily. In many cases there is a better way. Use the API provided by the site.

httpbin.org

httpbin.org a website to practice various URL requests
source code of httpbin.

You can’t always depend on the public service and you don’t want to overwhelme it with lots of requests. Luckily the whole system is packed into a Docker image and you can run it locally:

Run httpbin locally

Start as:

docker run --rm -p 80:80 --name httpbin kennethreitz/httpbin

Then run the code:

python client.py

To stop the Docker container open another terminal and execute

docker container stop -t0 httpbin

requests get from httpbin - JSON

import requests

res = requests.get('https://httpbin.org/get')
print(type(res))
print(res.status_code)
print()
print(res.headers)
print()
#print(res.content)
print()
print(res.json())
data = res.json()
print(type(data))

requests get IP from httpbin - JSON

import requests

res = requests.get('http://httpbin.org/ip')
print(res.headers['content-type'])
print(res.text)
print()

data = res.json()
print(data)
print()
print(data['origin'])

requests get JSON User-Agent

When our browser sends a requests it identifies itself.

See your User-Agent

import requests

res = requests.get('http://httpbin.org/user-agent')
#print(res.headers['content-type'])
#print(res.text)
data = res.json()
print(data)
print(data['user-agent'])

requests change User-Agent

import requests

res = requests.get('http://httpbin.org/user-agent',
    headers = {'User-agent': 'Internet Explorer/2.0'})
# print(res.headers['content-type'])
# print(res.text)

data = res.json()
print(data)
print(data['user-agent'])

requests get header

httpbin makes it easy to see what kind of headers your browser sends. Not only the User Agent

headers

import requests

res = requests.get('https://httpbin.org/headers')
print(res.text)

# {
#   "headers": {
#     "Accept": "*/*",
#     "Accept-Encoding": "gzip, deflate",
#     "Host": "httpbin.org",
#     "User-Agent": "python-requests/2.3.0 CPython/2.7.12 Darwin/16.3.0"
#   }
# }

print()
data = res.json()
print(data)
#print(data['headers'])

requests change header

requests

The requests module too sends a set of default headers, but you can tell it to send other fields and values as well. This examples shows how to set some additional headers.

import requests

res = requests.get('http://httpbin.org/headers',
        headers = {
            'User-agent'  : 'Internet Explorer/2.0',
            'SOAPAction'  : 'http://www.corp.net/some/path/CustMsagDown.Check',
            'Content-type': 'text/xml'
        }
    )
print(res.text)

# {
#   "headers": {
#     "Accept": "*/*",
#     "Accept-Encoding": "gzip, deflate",
#     "Content-Type": "text/xml",
#     "Host": "httpbin.org",
#     "Soapaction": "http://www.corp.net/some/path/CustMsagDown.Check",
#     "User-Agent": "Internet Explorer/2.0"
#   }
# }

requests post

requests
POST

We can also send POST requests to an address with any payload (content).

import requests

payload = '''
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:cus="http://www.corp.net/Request.XSD">
    <soapenv:Header/>
    <soapenv:Body>
       <cus:CustMsagDown.Check>
           <cus:MainCustNum>327</cus:MainCustNum>
           <cus:SourceSystem></cus:SourceSystem>
       </cus:CustMsagDown.Check>
    </soapenv:Body>
</soapenv:Envelope>
'''

res = requests.post('http://httpbin.org/post',
    headers = {
        'User-agent'  : 'Internet Explorer/2.0',
        'SOAPAction'  : 'http://www.corp.net/some/path/CustMsagDown.Check',
        'Content-type': 'text/xml'
    },
    data = payload,
)
print(res.headers['content-type'])
print(res.text)

Interactive Requests

import requests

r = requests.get('http://httpbin.org/')

import code
code.interact(local=locals())

Download the weather - scraping

Open Weather map
type in the name of your city

Download the weather - API call with requests

import configparser
import requests
import sys
import os

def get_api_key():
    config_file = 'config.ini'
    if not os.path.exists(config_file):
        exit(f"File {config_file} must exists with an [openweathermap] section and an api=  field")
    config = configparser.ConfigParser()
    config.read(config_file)
    return config['openweathermap']['api']

def get_weather(api_key, location):
    url = "https://api.openweathermap.org/data/2.5/weather?q={}&units=metric&appid={}".format(location, api_key)
    r = requests.get(url)
    return r.json()

def main():
    if len(sys.argv) != 2:
        exit("Usage: {} LOCATION".format(sys.argv[0]))
    location = sys.argv[1]

    api_key = get_api_key()
    weather = get_weather(api_key, location)

    print(weather)
    print()
    print(weather['main']['temp'])


if __name__ == '__main__':
    main()

Download the weather - API call with requests

pip install openweathermap-simplified

from openweathermap import get_daily_forecast, APIError

try:
    #forecast = get_daily_forecast(49.24966, -123.11934)  # Vancouver, BC, Canada
    forecast = get_daily_forecast(34.8186, 31.8969)  # Rehovot, Israel
    #forecast = get_daily_forecast('Rehovot') 
    print(forecast)
except APIError as err:
    # Deal with missing/incorrect API key or failed requests
    print(err)

import configparser
import twitter
import os

config = configparser.ConfigParser()
config.read(os.path.join(os.path.dirname(os.path.abspath(__file__)), 'api.cfg'));
api = twitter.Api( **config['twitter'] )

status = api.PostUpdate('My first Tweet using Python')
print(status.text)

bit.ly

bit.ly API

import configparser
import os
import requests

def shorten(uri):
    config = configparser.ConfigParser()
    #config.read(os.path.join(os.path.expanduser('~'), 'api.cfg'))
    config.read(os.path.join(os.path.dirname(os.path.abspath(__file__)), 'api.cfg'))

    query_params = {
        'access_token': bitly_config['bitly']['access_token'],
        'longUrl': uri
    }

    endpoint = 'https://api-ssl.bitly.com/v3/shorten'
    response = requests.get(endpoint, params=query_params, verify=False)

    data = response.json()

    if not data['status_code'] == 200:
        exit("Unexpected status_code: {} in bitly response. {}".format(data['status_code'], response.text))
    return data['data']['url']

print(shorten("http://code-maven.com/"))

API config file

{% embed include file=“src/examples/web-client/api.cfg)

Exercise: Combine web server and client

Write a web application that can get a site and a text as input (e.g. http://cnn.com and ‘Korea’) check if on the given site the word appears or not?

Extended version: Only get the URL as the input and create statistics, which are the most frequent words on the given page.

Keyboard shortcuts

Python web client - web scraping