Broad Summary

Requests
Beautiful
Selenuim
Urllib
Scrapy

Requests

Requests is the most straightforward HTTP library that supports the entire restful API with all its methods PUT, GET, DELETE, and POST. It allows the user to sent requests to the HTTP server and GET response back in the form of HTML or JSON response. It also allows the user to send POST requests to the server to modify or add some content.

It's a good idea to maintaine and use web scraping seesion with requests, to persist the cookies and other parameters. It can result into a performance improvment and reuses the underlying TCP connection to a host.

import requests

with requests.Session() as session:
	# Set Cookies
	session.get('http://httpbin.org/cookies/set?key=value')
	# Get Cookies
	response = session.get('http://httpbin.org/cookies')
	print(response.text)

Overview and installation

pip install requests

Beautiful

Beautiful Soup automatically converts incoming documents to __Unicode__ and outgoing documents to **UTF-8**. It has a limited support for css selectors, but converts the most commenly used ones.

from bs4 import BeautifulSoup
data = """
	<ul>
		<li class ="item"> l1 </li>
		<li class ="item"> l2 </li>
		<li class ="item"> l3 </li>
	</ul>
   """
soup = BeautifulSoup(data , 'html.parser')

for l in soup.select("li.item"):
	print(l.get_text())

Overview and installation

pip install beautifulsoup4

Selenuim

Selenium is a Python library originally made for automated testing of web applications. Selenium works as we pretend like an actuall user,It simulates a real user as some websites don't like to be scraped. Selenuim launches and controls a web browser. > Selenuim can modify browser cookies, fill in forms and take screenshots of web browsr.

Overview and installation

pip install selenium

However, you need additional drivers for it to be able to interface with a chosen web browser.

Urllib

Urllib Urllib is a Python library that allows the developer to open and parse information from HTTP or FTP protocols. Urllib offers some functionality to deal with and open URLs, namely:

urllib.request: opens and reads URLs.
urllib.error: catches the exceptions raised by urllib.request.
urllib.parse: parses URLs.
urllib.robotparser : parses robots.txt files.

HTTP Get

import urllib.request as req ## it returns a file like obj
response = req.urlopen("https://stackoverflow.com/documentation/")
print(response.code)
print(response.read())

HTTP Post with parameters

query_param = {"usename" : "stackoverflow" , "password" : "me.em"}
query_encode = urllib.parse.urlencode(query_param).encode('utf8')
response_param = req.urlopen("https://stackoverflow.com/users/login" ,query_encode)
print(response_param.code)

Overview and installation

pip install urllib

Scrapy

Try to avoid using Scrapy if you have a small project or you want to scrape one or just a few webpages. In this case, Scarpy will overcomplicate things and won’t add and benefits.

I don't get my hand dirty orplay with scrapy yet

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitattributes		.gitattributes
FaceBook.py		FaceBook.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Broad Summary

Requests

Beautiful

Selenuim

Urllib

Scrapy

Resources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Broad Summary

Requests

Beautiful

Selenuim

Urllib

Scrapy

Resources

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages