Very Simple NodeJS Web Scraper (Quick Example)

Welcome to a tutorial on how to create a simple web scraper in NodeJS. So you need to get data from a website automatically? That calls for a web scraper, and it is actually easy to create one in NodeJS.

A simple web scraper in NodeJS consists of 2 parts – Using fetch to get the raw HTML from the website, then using an HTML parser such JSDOM to extract information.

Yes, but just how does that work? Read on for an example!

ⓘ I have included a zip file with all the source code at the start of this tutorial, so you don’t have to copy-paste everything… Or if you just want to dive straight in.

 

 

TLDR – QUICK SLIDES

 

TABLE OF CONTENTS

 

DOWNLOAD & NOTES

Firstly, here is the download link to the example code as promised.

 

QUICK NOTES

  • A copy of JSDOM is not included in the zip file. Run npm i jsdom to get the latest version.
  • Run 1B-server.js to start the dummy HTTP server.
  • Run 2-scrape.js for the web scraper demo.
If you spot a bug, feel free to comment below. I try to answer short questions too, but it is one person versus the entire world… If you need answers urgently, please check out my list of websites to get help with programming.

 

EXAMPLE CODE DOWNLOAD

Click here to download all the example source code, I have released it under the MIT license, so feel free to build on top of it or use it in your own project.

 

 

NODEJS WEB SCRAPER

All right, let us now get into the example of creating a web scraper in NodeJS.

 

PART 1) DUMMY PRODUCT PAGE

1A-dummy.html
<div id="product">
  <img src="box.png" id="pImg">
  <div id="pName">Empty Box</div>
  <div id="pPrice">$12.34</div>
  <div id="pDesc">It's an... empty box. Horray?</div>
  <input type="button" value="Add To Cart" id="pAdd">
</div>

First, we need a dummy page to work with, and here is a simple product page – We will be extracting the product information from this page in the NodeJS web scraper below.

 

 

PART 2) NODEJS WEB SCRAPER

2-scrape.js
// (A) LOAD JSDOM
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
 
// (B) FETCH
fetch("http://localhost/1-dummy.html")
.then(res => res.text())
.then(txt => {
  // (B1) PARSE HTML
  const dom = new JSDOM(txt);
  const doc = dom.window.document;
 
  // (B2) EXTRACT INFORMATION
  console.log(doc.getElementById("pName").innerHTML);
  console.log(doc.getElementById("pPrice").innerHTML);
  console.log(doc.getElementById("pDesc").innerHTML);
  console.log(doc.getElementById("pImg").src);
})
.catch(err => console.log(err));

Yep, that’s all for the “very difficult” web scraper. As in the introduction:

  • We use fetch(URL).then(res => res.text()) to get the web page as a string.
  • Technically, we can use substring() and match() to extract data from the “raw HTML string”… But it’s smarter to use an HTML parser here – const dom = new JSDOM(txt).
  • Finally, you should already know all the “get HTML element” functions – getElementById() querySelector() querySelectorAll()… If you don’t it’s time to catch up with the basics. Links below.

P.S. Take note that fetch() is only available in Node 17.5 and later. If you are still using an older version, it’s time to update or consider using other modules such as CURL.

 

 

EXTRA BITS & LINKS

That’s all for the tutorial, and here is a small section on some extras and links that may be useful to you.

 

LINKS & REFERENCES

 

INFOGRAPHIC CHEAT SHEET

NODEJS Web Scraper (click to enlarge)

 

THE END

Thank you for reading, and we have come to the end. I hope that it has helped you to better understand, and if you want to share anything with this guide, please feel free to comment below. Good luck and happy coding!

Leave a Comment

Your email address will not be published. Required fields are marked *