Very Simple PHP Web Scraper (Quick Example)

Welcome to a tutorial on how to create a simple web scraper in PHP. So you need to extract some information from a website periodically? That calls for a web scraper, and it is actually pretty simple.

We can build a web scraper in PHP by using CURL to fetch the HTML, then DOMDocument to parse and extract the information.

That covers the quick basics, but just how does this work? Read on for an example!

ⓘ I have included a zip file with all the source code at the start of this tutorial, so you don’t have to copy-paste everything… Or if you just want to dive straight in.

 

 

TLDR – QUICK SLIDES

 

TABLE OF CONTENTS

 

DOWNLOAD & NOTES

Firstly, here is the download link to the example code as promised.

 

QUICK NOTES

If you spot a bug, feel free to comment below. I try to answer short questions too, but it is one person versus the entire world… If you need answers urgently, please check out my list of websites to get help with programming.

 

EXAMPLE CODE DOWNLOAD

Click here to download all the example source code, I have released it under the MIT license, so feel free to build on top of it or use it in your own project.

 

 

PHP WEB SCRAPER

All right, let us now get into the example of creating a web scraper in PHP.

 

PART 1) DUMMY WEB PAGE

1-dummy.html
<div id="product">
  <img src="box.png" id="pImg">
  <div id="pName">Empty Box</div>
  <div id="pPrice">$12.34</div>
  <div id="pDesc">It's an... empty box. Horray?</div>
  <input type="button" value="Add To Cart" id="pAdd">
</div>

First, let us start by creating a simple dummy product page. We will use a PHP web scraper to extract the product information from this page.

 

PART 2) WEB SCRAPER WITH CURL & DOMDOCUMENT

2-scrape.php
<?php
// (A) CURL FETCH HTML FROM WEBPAGE
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://localhost/1-dummy.html");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$html = curl_exec($ch);
curl_close($ch);
 
// (B) CREATE DOM DOCUMENT
$dom = new DOMDocument();
$dom->loadHTML($html);
 
// (C) GET DATA
echo $dom->getElementById("pName")->nodeValue . "\r\n";
echo $dom->getElementById("pDesc")->nodeValue . "\r\n";
echo $dom->getElementById("pPrice")->nodeValue . "\r\n";
echo $dom->getElementById("pImg")->getAttribute("src");

  1. As in the introduction, we will use CURL to fetch the webpage into $html.
  2. Now, $html is a long string of “raw HTML”. While we can use functions like strpos() and preg_match() to extract information, the (probably) smarter way is to parse it into a new DOMDocument() object.
  3. Yep. DOMDocument provides a few “Javascript-like” functions to get HTML elements and easier ways to extract information.

 

 

EXTRA) MORE ON GETTING HTML ELEMENTS

3-get-element.php
// https://stackoverflow.com/questions/6366351/getting-dom-elements-by-classname
$finder = new DomXPath($dom);
$classname = "pName";
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(@class), ' '), ' $classname ')]");
print_r($nodes);

Lastly, if you need to “get element by tag name” or “get element by CSS class”… That is where things get funky, there is seemingly no easy way but to use DomXPath(). I will leave links below if you need to learn more.

 

EXTRA BITS & LINKS

That’s all for the tutorial, and here is a small section on some extras and links that may be useful to you.

 

A FEW NOTES

  • Information on websites is technically public. While it is not illegal to scrape websites, what you do with the data is subjective… For example, if you copy an entire article into your own website without permission, that can be subjected to copyright laws.
  • Also, take note that some websites are protected by firewalls. Any attempts to fetch data using bots will fail.
  • Yes, we can “attach cookies” in CURL calls. See the links below.

 

 

LINKS & REFERENCES

 

INFOGRAPHIC CHEAT SHEET

PHP Web Scraper (click to enlarge)

 

THE END

Thank you for reading, and we have come to the end. I hope that it has helped you to better understand, and if you want to share anything with this guide, please feel free to comment below. Good luck and happy coding!

Leave a Comment

Your email address will not be published. Required fields are marked *