Skip to content

rowidanagah/Web_Scrapping_Hands_on

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

Broad Summary

Requests

Requests is the most straightforward HTTP library that supports the entire restful API with all its methods PUT, GET, DELETE, and POST. It allows the user to sent requests to the HTTP server and GET response back in the form of HTML or JSON response. It also allows the user to send POST requests to the server to modify or add some content.

It's a good idea to maintaine and use web scraping seesion with requests, to persist the cookies and other parameters. It can result into a performance improvment and reuses the underlying TCP connection to a host.

import requests

with requests.Session() as session:
	# Set Cookies
	session.get('http://httpbin.org/cookies/set?key=value')
	# Get Cookies
	response = session.get('http://httpbin.org/cookies')
	print(response.text)

Overview and installation

pip install requests

Beautiful

Beautiful Soup automatically converts incoming documents to __Unicode__ and outgoing documents to **UTF-8**. It has a limited support for css selectors, but converts the most commenly used ones.
from bs4 import BeautifulSoup
data = """
	<ul>
		<li class ="item"> l1 </li>
		<li class ="item"> l2 </li>
		<li class ="item"> l3 </li>
	</ul>
   """
soup = BeautifulSoup(data , 'html.parser')

for l in soup.select("li.item"):
	print(l.get_text())

Overview and installation

pip install beautifulsoup4

Selenuim

Selenium is a Python library originally made for automated testing of web applications. Selenium works as we pretend like an actuall user,It simulates a real user as some websites don't like to be scraped. Selenuim launches and controls a web browser. > Selenuim can modify browser cookies, fill in forms and take screenshots of web browsr.

Overview and installation

pip install selenium

However, you need additional drivers for it to be able to interface with a chosen web browser.

Urllib

Urllib Urllib is a Python library that allows the developer to open and parse information from HTTP or FTP protocols. Urllib offers some functionality to deal with and open URLs, namely:

  • urllib.request: opens and reads URLs.
  • urllib.error: catches the exceptions raised by urllib.request.
  • urllib.parse: parses URLs.
  • urllib.robotparser : parses robots.txt files.

HTTP Get

import urllib.request as req ## it returns a file like obj
response = req.urlopen("https://stackoverflow.com/documentation/")
print(response.code)
print(response.read())

HTTP Post with parameters

query_param = {"usename" : "stackoverflow" , "password" : "me.em"}
query_encode = urllib.parse.urlencode(query_param).encode('utf8')
response_param = req.urlopen("https://stackoverflow.com/users/login" ,query_encode)
print(response_param.code)

Overview and installation

pip install urllib

Scrapy

Try to avoid using Scrapy if you have a small project or you want to scrape one or just a few webpages. In this case, Scarpy will overcomplicate things and won’t add and benefits.

I don't get my hand dirty orplay with scrapy yet

Resources

About

This is a simple hands-on web scraping tutorial to display various Web scraping techniques as well as to work on some scraping of a number of websites.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors