Requests is the most straightforward HTTP library that supports the entire restful API with all its methods PUT, GET, DELETE, and POST. It allows the user to sent requests to the HTTP server and GET response back in the form of HTML or JSON response. It also allows the user to send POST requests to the server to modify or add some content.
It's a good idea to maintaine and use web scraping seesion with requests, to persist the cookies and other parameters. It can result into a performance improvment and reuses the underlying TCP connection to a host.
import requests
with requests.Session() as session:
# Set Cookies
session.get('http://httpbin.org/cookies/set?key=value')
# Get Cookies
response = session.get('http://httpbin.org/cookies')
print(response.text)
Overview and installation
pip install requestsfrom bs4 import BeautifulSoup
data = """
<ul>
<li class ="item"> l1 </li>
<li class ="item"> l2 </li>
<li class ="item"> l3 </li>
</ul>
"""
soup = BeautifulSoup(data , 'html.parser')
for l in soup.select("li.item"):
print(l.get_text())
Overview and installation
pip install beautifulsoup4Overview and installation
pip install seleniumHowever, you need additional drivers for it to be able to interface with a chosen web browser.
Urllib Urllib is a Python library that allows the developer to open and parse information from HTTP or FTP protocols. Urllib offers some functionality to deal with and open URLs, namely:
urllib.request: opens and reads URLs.urllib.error: catches the exceptions raised by urllib.request.urllib.parse: parses URLs.urllib.robotparser: parses robots.txt files.
HTTP Get
import urllib.request as req ## it returns a file like obj
response = req.urlopen("https://stackoverflow.com/documentation/")
print(response.code)
print(response.read())HTTP Post with parameters
query_param = {"usename" : "stackoverflow" , "password" : "me.em"}
query_encode = urllib.parse.urlencode(query_param).encode('utf8')
response_param = req.urlopen("https://stackoverflow.com/users/login" ,query_encode)
print(response_param.code)Overview and installation
pip install urllibTry to avoid using Scrapy if you have a small project or you want to scrape one or just a few webpages. In this case, Scarpy will overcomplicate things and won’t add and benefits.
I don't get my hand dirty orplay with scrapy yet


