What is Parsing, and Why is it Needed in Web Scraping?

What is Parsing, and Why is it Needed in Web Scraping?
Image by Danni Liu adapted from nullplus/ Canva

A year ago, I had no idea what parse meant. I first heard of the term from a web scraping video. The author said something along the line: "…we need to parse the response from the get request using a parser. There are many types of parsers …."

And this was me:

what the fungi

Written all over my face was what the fungi is parsing!?!?! Please explain! Guess what? He didn't explain!

To save you from having fungi all over your face, I've tasked myself to explain what parsing is and why it is required in web scraping. 😁

I'm going to cover the following:

  1. What is parsing?
  2. Why is a parser needed in web scraping?
  3. Common types of parsers used in web scraping

What is parsing?

Parsing is the act of separating information into its components. In English, it involves breaking down a sentence into its grammatical parts, such as subject, verb, object etc. The term parse comes from the Latin par for "part of speech".

In computing, parse means the same. In computing, a parser is a program or library responsible for analyzing and breaking down data into smaller, more manageable sizes and not just that, it transforms unstructured data into a structured form for ease of analysis.

Why is a parser needed in web scraping?

Parsers are important for web scraping because they allow you to process the HTML source code of a web page and extract the data you are interested in.

Let me elaborate if this sounds a bit murky. Recall the How the Web Works section in my blog on Understanding HTML Basics for Web Scraping. I explained that when we enter an URL in the address bar of a browser and hit enter, our browser sends an HTTP request to the server. The server then responds by returning an HTTP response with a lot of information, one of which is the HTML code document. Our browser reads this document via constructing a model representing the object or elements in the HTML document, known as Document Object Model (DOM) and renders this on the web page we then see.

Now when web scraping, we need to write codes in an editor to do what I've just described. An editor is a software that allows you to edit and write codes—an example of an editor is VS Studio Code.
Let me break it down for you. We need to write codes to request the information in the HTML code document from the server. What the server returns are horrific for humans to read. Recall that the browser reads the HTML code document by creating a DOM representing the information in a logical tree structure. So in the editor environment, we have the parsers assume this role. It builds a data structure, such as a tree, that represents the content and structure of the page. This allows us to navigate through the elements of the page and extract the data we are interested in.

Common Types of Parsers Used in Web Scraping

There are many types of parsers. I don't know all of them. But the three commonly used for web scraping are:

  • lxml
  • html5lib
  • Python built-in parsers

lxml

The lxml library is a fast and flexible Python library for parsing and processing HTML documents. It's pretty forgiving, lenient, as they say, of syntax errors and other problems.

html5lib

The html5lib parses the page the same way a web browser does. This library is designed to be very lenient of syntax errors and other problems, and it can parse a wide range of HTML documents that other parsers might reject. Its disadvantage is that it isn't speedy.

Python built-in parsers

The python standard library includes several built-in parsers for parsing HTML documents. It's of decent spend. It's not as fast as lxml and not as lenient as html5lib.

From what I've read, it doesn't matter all that much which of the three you choose. The outcome is more or less the same with documents if there are no major document errors. If there are, however, each of the parsers treats syntax errors differently. You can read about how they handle it here if you're interested. ⚠ Warning, it's very dry. Great if you want to catch some 💤. I've only skimmed through the document and not read it in detail. It's one of those reference sources. You just need to know it exists and refer to it when required.

What I've shared is probably very top-level, but I think it's adequate for web scraping.