Web client - web scraping
get HTML page using urllib
urllib is a rather low level library. It comes standard with Python and thus it might be useful for simple tasks.
import urllib.request
# fh is like a filehandle
with urllib.request.urlopen('https://python.org/') as fh:
html = fh.read()
print(html)
This code will print the content of the HTML page download from the given web site.
If there is an error the request will raise an exception.
For example if ask for a page that does not exist on the server we'll get a urllib.error.HTTPError with the text HTTP Error 404: Not Found.
If we ask for a URL that does not even resolve we'll get a urllib.error.URLError exception the text Name or service not known.
We should probably wrap the code in a try - except expression.
urllib GET with error handling
The urllib module is split into several submodules. One of them is the urllib.error that defines the exceptions raised by urllib.request
import urllib.request
import sys
if len(sys.argv) != 2:
exit(f"Usage: {sys.argv[0]} URL")
url = sys.argv[1]
try:
with urllib.request.urlopen(url) as fh:
html = fh.read()
except urllib.error.HTTPError as err:
if err.code == 404:
print("not found")
elif err.code == 500:
print("internal error")
else:
print(f"Unhandled error: {err.code}")
print(f"HTTPError: {err}")
except urllib.error.URLError as err:
print(f"URLError: {err}")
print(err.reason)
else:
print(f"Success: Downloaded {len(html)} characters.")
We are using the httpbin service locally.
Asking for the main page we get a proper response with content:
$ python3 urllib_error_handling.py http://localhost/
Success: Downloaded 9593 characters.
no name resolution
If we try to access a site, but there is no DNS name resolution we get a URLError and we can check reason field to see what was the problem.
$ python3 urllib_error_handling.py http://notlocalhost
URLError: <urlopen error [Errno -3] Temporary failure in name resolution>
[Errno -3] Temporary failure in name resolution
Connection refused
If the name resolution works, but the web server does not work (in our case we don't run any service on port 8081 locally) then again we get an URLError
but with a different reason.
$ python3 urllib_error_handling.py http://localhost:8081
URLError: <urlopen error [Errno 111] Connection refused>
[Errno 111] Connection refused
404 NOT FOUND
We can ask the httpbin service to return specific status codes. That's a very nice way to test how our code behaves when it receives am "unexpected" status code.
$ python3 urllib_error_handling.py http://localhost/status/404
not found
HTTPError: HTTP Error 404: NOT FOUND
So we received an HTTPError and we could even find out what was the exact status code.
500 INTERNAL SERVER ERROR
We can even ask the httpbin service to return a 500 error.
$ python3 urllib_error_handling.py http://localhost/status/500
internal error
HTTPError: HTTP Error 500: INTERNAL SERVER ERROR
401 UNAUTHORIZED
We can ask httpbin to return a 401 status code. Our example does not have any special treatment for that so we just print "Unhandled error".
$ python3 urllib_error_handling.py http://localhost/status/401
Unhandled error: 401
HTTPError: HTTP Error 401: UNAUTHORIZED
301 redirection
However, redirections, such as the 301 status code, does not trigger an exception. So if we will want to handle these as well, we'll have to use another approach.
$ python3 urllib_error_handling.py http://localhost/status/301
Success: Downloaded 228 characters.
Download image using urllib
Usually you will want to save the downloaded image to the local disk.
import urllib.request
url = 'https://www.python.org/images/python-logo.gif'
with urllib.request.urlopen(url) as fh:
with open('logo.gif', 'wb') as out:
out.write(fh.read())
get HTML page using requests
- requests
requests is the de-facto standard in Python for dealing with web pages as a web client.
import requests
res = requests.get('https://python.org/')
print(type(res))
print(res.status_code)
print(res.headers)
print(res.headers['content-type'])
# print(res.content)
Download image using requests
import requests
url = 'https://www.python.org/images/python-logo.gif'
filename = 'logo.gif'
res = requests.get(url)
print(res.status_code)
with open(filename, 'wb') as out:
out.write(res.content)
Download image as a stream using requests
OK, this is not such a good example for streaming.
import requests
import shutil
url = 'https://bloximages.newyork1.vip.townnews.com/wpsdlocal6.com/content/tncms/assets/v3/editorial/7/22/722f8401-e134-5758-9f4b-a542ed88a101/5d41b45d92106.image.jpg'
filename = "source.jpg"
res = requests.get(url, stream=True)
print(res.status_code)
with open(filename, 'wb') as fh:
res.raw.decode_content
shutil.copyfileobj(res.raw, fh)
Download zip file using requests
import requests
import shutil
url = "https://code-maven.com/public/developer_survey_2019.zip"
filename = "developer_survey_2019.zip"
res = requests.get(url, stream=True)
print(res.status_code)
if res.status_code == 200:
with open(filename, 'wb') as fh:
res.raw.decode_content
shutil.copyfileobj(res.raw, fh)
Extract zip file
- zipfile
- unzip
- zip
This is unrelated, but once you have downloaded a zip file you will need to be able to extract its content. This example shows how to unzip a file already on your disk.
import zipfile
path = "developer_survey_2019.zip"
zf = zipfile.ZipFile(path)
zf.extractall()
Beautiful Soup to parse HTML
-
bs4
-
BeautifulSoup
from bs4 import BeautifulSoup
import requests
url = 'https://en.wikipedia.org/wiki/Main_Page'
res = requests.get(url)
if res.status_code != 200:
exit(f"Error in getting the page. Status code: {res.status_code}")
html = res.content
soup = BeautifulSoup(html, features='lxml')
print(soup.title.text)
for link in soup.find_all("a", limit=3):
print(link)
print(link.text)
print(link.attrs.get('href'))
print()
print('-----------------------------------------')
forms = soup.select("#searchform")
if forms is not None:
print(forms)
form = forms[0] # We used an ID to search we expect to have 0 or one matches in the list
print()
print('Action: ', form.attrs.get('action'))
# Search inside that element we found earlier
for inp in form.find_all('input'):
print('id: ', inp.attrs.get('id'))
print('-----------------------------------------')
tfa = soup.select("#mp-tfa")
if tfa is not None:
#print(tfa)
paras = tfa[0].select("p")
if paras is not None:
#print(paras)
links = paras[0].find_all("a", limit=1)
if links:
print(links[0].text)
print(links[0].attrs.get('href'))
requests - JSON - API
Downloading HTML pages and parsing them to extract data can be a lot of fun, but it is also very unstable. Page layouts will change. The code will break easily. In many cases there is a better way. Use the API provided by the site.
httpbin.org
- httpbin.org a website to practice various URL requests
- source code of httpbin.
You can't always depend on the public service and you don't want to overwhelme it with lots of requests. Luckily the whole system is packed into a Docker image and you can run it locally:
Run httpbin locally
Start as:
docker run --rm -p 80:80 --name httpbin kennethreitz/httpbin
Then run the code:
python client.py
To stop the Docker container open another terminal and execute
docker container stop -t0 httpbin
requests get from httpbin - JSON
import requests
res = requests.get('https://httpbin.org/get')
print(type(res))
print(res.status_code)
print()
print(res.headers)
print()
#print(res.content)
print()
print(res.json())
data = res.json()
print(type(data))
requests get IP from httpbin - JSON
import requests
res = requests.get('http://httpbin.org/ip')
print(res.headers['content-type'])
print(res.text)
print()
data = res.json()
print(data)
print()
print(data['origin'])
requests get JSON User-Agent
When our browser sends a requests it identifies itself.
import requests
res = requests.get('http://httpbin.org/user-agent')
#print(res.headers['content-type'])
#print(res.text)
data = res.json()
print(data)
print(data['user-agent'])
requests change User-Agent
import requests
res = requests.get('http://httpbin.org/user-agent',
headers = {'User-agent': 'Internet Explorer/2.0'})
# print(res.headers['content-type'])
# print(res.text)
data = res.json()
print(data)
print(data['user-agent'])
requests get header
httpbin makes it easy to see what kind of headers your browser sends. Not only the User Agent
import requests
res = requests.get('https://httpbin.org/headers')
print(res.text)
# {
# "headers": {
# "Accept": "*/*",
# "Accept-Encoding": "gzip, deflate",
# "Host": "httpbin.org",
# "User-Agent": "python-requests/2.3.0 CPython/2.7.12 Darwin/16.3.0"
# }
# }
print()
data = res.json()
print(data)
#print(data['headers'])
requests change header
- requests
The requests module too sends a set of default headers, but you can tell it to send other fields and values as well. This examples shows how to set some additional headers.
import requests
res = requests.get('http://httpbin.org/headers',
headers = {
'User-agent' : 'Internet Explorer/2.0',
'SOAPAction' : 'http://www.corp.net/some/path/CustMsagDown.Check',
'Content-type': 'text/xml'
}
)
print(res.text)
# {
# "headers": {
# "Accept": "*/*",
# "Accept-Encoding": "gzip, deflate",
# "Content-Type": "text/xml",
# "Host": "httpbin.org",
# "Soapaction": "http://www.corp.net/some/path/CustMsagDown.Check",
# "User-Agent": "Internet Explorer/2.0"
# }
# }
requests post
- requests
- POST
We can also send POST requests to an address with any payload (content).
import requests
payload = '''
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:cus="http://www.corp.net/Request.XSD">
<soapenv:Header/>
<soapenv:Body>
<cus:CustMsagDown.Check>
<cus:MainCustNum>327</cus:MainCustNum>
<cus:SourceSystem></cus:SourceSystem>
</cus:CustMsagDown.Check>
</soapenv:Body>
</soapenv:Envelope>
'''
res = requests.post('http://httpbin.org/post',
headers = {
'User-agent' : 'Internet Explorer/2.0',
'SOAPAction' : 'http://www.corp.net/some/path/CustMsagDown.Check',
'Content-type': 'text/xml'
},
data = payload,
)
print(res.headers['content-type'])
print(res.text)
Interactive Requests
import requests
r = requests.get('http://httpbin.org/')
import code
code.interact(local=locals())
Download the weather - scraping
- Open Weather map
- type in the name of your city
Download the weather - API call with requests
import configparser
import requests
import sys
import os
def get_api_key():
config_file = 'config.ini'
if not os.path.exists(config_file):
exit(f"File {config_file} must exists with an [openweathermap] section and an api= field")
config = configparser.ConfigParser()
config.read(config_file)
return config['openweathermap']['api']
def get_weather(api_key, location):
url = "https://api.openweathermap.org/data/2.5/weather?q={}&units=metric&appid={}".format(location, api_key)
r = requests.get(url)
return r.json()
def main():
if len(sys.argv) != 2:
exit("Usage: {} LOCATION".format(sys.argv[0]))
location = sys.argv[1]
api_key = get_api_key()
weather = get_weather(api_key, location)
print(weather)
print()
print(weather['main']['temp'])
if __name__ == '__main__':
main()
Download the weather - API call with requests
pip install openweathermap-simplified
from openweathermap import get_daily_forecast, APIError
try:
#forecast = get_daily_forecast(49.24966, -123.11934) # Vancouver, BC, Canada
forecast = get_daily_forecast(34.8186, 31.8969) # Rehovot, Israel
#forecast = get_daily_forecast('Rehovot')
print(forecast)
except APIError as err:
# Deal with missing/incorrect API key or failed requests
print(err)
Tweet
import configparser
import twitter
import os
config = configparser.ConfigParser()
config.read(os.path.join(os.path.dirname(os.path.abspath(__file__)), 'api.cfg'));
api = twitter.Api( **config['twitter'] )
status = api.PostUpdate('My first Tweet using Python')
print(status.text)
bit.ly
import configparser
import os
import requests
def shorten(uri):
config = configparser.ConfigParser()
#config.read(os.path.join(os.path.expanduser('~'), 'api.cfg'))
config.read(os.path.join(os.path.dirname(os.path.abspath(__file__)), 'api.cfg'))
query_params = {
'access_token': bitly_config['bitly']['access_token'],
'longUrl': uri
}
endpoint = 'https://api-ssl.bitly.com/v3/shorten'
response = requests.get(endpoint, params=query_params, verify=False)
data = response.json()
if not data['status_code'] == 200:
exit("Unexpected status_code: {} in bitly response. {}".format(data['status_code'], response.text))
return data['data']['url']
print(shorten("http://code-maven.com/"))
API config file
{% embed include file="src/examples/web-client/api.cfg)
Exercise: Combine web server and client
Write a web application that can get a site and a text as input (e.g. http://cnn.com and 'Korea') check if on the given site the word appears or not?
Extended version: Only get the URL as the input and create statistics, which are the most frequent words on the given page.