Web Scraping using Python

Venturing into Machine Learning, I quickly realized the need for good datasets. Web Scraping can come in very handy when datasets aren’t easily available.

It is important to use these web scraping bots in moderation and in accordance with the terms and conditions of the websites being scraped.

Static websites

Here’s how to use requests and BeautifulSoup to get started scraping websites using python.

Install these packages if needed:

pip install requests
pip install bs4

Here’s a code snippet that fetches five headlines from BBC News

import requests
from bs4 import BeautifulSoup

result = requests.get("http://www.bbc.com/news")

soup = BeautifulSoup(result.content, "lxml")

headlines = soup.find_all("h3")[:5]

for headline in headlines:
    print headline.text

Here are the docs for BeautifulSoup.

Dynamic websites

These days, with many websites fetching data after initial page load, above method won’t cut it anymore. requests can’t process the javascript code like a browser. Enter Selenium . A WebDriver is needed for Selenium to work. I use ChromeDriver which can be downloaded here.

Install selenium if needed:

pip install selenium

Here’s a code snippet that also fetches five headlines from BBC News

from selenium import webdriver

driver = webdriver.Chrome("C:\path\to\chromedriver\chromedriver.exe")
driver.implicitly_wait(5) # waiting 5 seconds for dynamic data to load
driver.get("http://www.bbc.com/news")
headlines = driver.find_elements_by_tag_name("h3")[:5]
for headline in headlines:
    print headline.text
driver.close()

Here are the unofficial docs for Selenium.