Python Web Scraper (Very Simple Example)

Welcome to a tutorial on how to create a simple web scraper in Python. So you need to fetch some data from another webpage in Python?

A simple web scraper in Python generally consists of 2 parts:

  • Using requests to fetch the webpage.
  • Using an HTML parser such as BeautifulSoup to find and extract data from the page.

That should cover the basics, but just how does it work exactly? Read on for the example!

ⓘ I have included a zip file with all the source code at the start of this tutorial, so you don’t have to copy-paste everything… Or if you just want to dive straight in.

 

 

TLDR – QUICK SLIDES

 

TABLE OF CONTENTS

 

DOWNLOAD & NOTES

Firstly, here is the download link to the example code as promised.

 

QUICK NOTES

  • Create a project folder, e.g. D:\scrape, unzip the code inside this folder.
  • Navigate to the project folder in the command line cd D:\scrape, create a virtual environment to not mess up your other projects.
    • virtualenv venv
    • Windows – venv\scripts\activate
    • Mac/Linux – venv/bin/activate
  • Get all the packages – pip install flask requests beautifulsoup
  • Run python S1_http.py to start the dummy HTTP server.
  • Run python S2_scrape.py (in another command line window) for the scraper example.
If you spot a bug, feel free to comment below. I try to answer short questions too, but it is one person versus the entire world… If you need answers urgently, please check out my list of websites to get help with programming.

 

EXAMPLE CODE DOWNLOAD

Click here to download all the example source code, I have released it under the MIT license, so feel free to build on top of it or use it in your own project.

 

 

PYTHON WEB SCRAPER

All right, let us now get into the details of the python web scraper.

 

STEP 1) DUMMY PRODUCT PAGE

templates/S1_dummy.html
<div id="product">
  <img src="static/box.png" id="pImg">
  <div id="pName">Empty Box</div>
  <div id="pPrice">$12.34</div>
  <div id="pDesc">It's an... empty box. Horray?</div>
  <input type="button" value="Add To Cart" id="pAdd">
</div>

First, we need a web page to work with. Here’s a simple dummy product page, we will use the web scraper to extract the product information from this page.

 

 

STEP 2) PYTHON WEB SCRAPER

S2_scrape.py
# (A) LOAD REQUIRED MODULES
import requests
from bs4 import BeautifulSoup
 
# (B) GET HTML
html = requests.get("http://localhost").text
# print(html)
 
# (C) HTML PARSER
soup = BeautifulSoup(html, "html.parser")
name = soup.find("div", {"id": "pName"}).text
desc = soup.find("div", {"id": "pDesc"}).text
price = soup.find("div", {"id": "pPrice"}).text
image = soup.find("img", {"id": "pImg"})["src"]
print(name)
print(desc)
print(price)
print(image)

Yep. That is pretty much all it takes to scrape a website in Python.

  1. Load the required modules.
  2. Use requests.get(URL).text to get the web page as text.
  3. We can pretty much do “hardcore string searches”, or use regular expressions to extract information from the “HTML string”. But the smarter way is to use an HTML parser, which will make data extraction a lot easier.

P.S. The text and image source is not the only information that can be extracted. Follow up with the BeautifulSoup documentation if you need more, links are below.

 

 

EXTRA BITS & LINKS

That’s all for the tutorial, and here is a small section on some extras and links that may be useful to you.

 

LINKS & REFERENCES

 

INFOGRAPHIC CHEAT SHEET

Python Web Scraper (click to enlarge)

 

THE END

Thank you for reading, and we have come to the end. I hope that it has helped you to better understand, and if you want to share anything with this guide, please feel free to comment below. Good luck and happy coding!

Leave a Comment

Your email address will not be published. Required fields are marked *