What is Beautiful Soup?

What is Beautiful Soup?
Image by Danni Liu adapted from Sarsmis/ Canva

I'm located down under in Australia, and it's scorching hot this time of the year, so it's not a time for soup! But this soup that I'm writing about is different. It doesn't have the ability like the liquid dish to induce that warm fuzzy feeling in your cockle on a cold winter's day, but boy, it has its magical ability that shall not be undermined! This so-called Beautiful Soup is a Python package (a folder of various codes that perform specific tasks) commonly used in web scraping.

I don't know if you have ever noticed that the technology industry has some pretty eccentric names. I find many of them very cute and endearing. There are a lot of things named after animals and food. Here are some of the ones I like. They are primarily in the food category πŸ˜‹.

Naming conventions:

    β€’	camelCase πŸͺ
    β€’	kebab-case 🍒
    β€’	snake_case 🐍

Python packages:

    β€’	Pickle πŸ₯’: a package for preserving codes, hence pickle πŸ˜‚
    β€’	Pandas 🐼: a package for data manipulation and analysis
    β€’	Turtle 🐒: a package for creating pictures and shapes

Coding styles:

    β€’	Spaghetti code 🍝: codes characterized by messy unreadable lines
    β€’	Lasagna code: codes that have too many layers

Android versions, which are named after sweets. Like:

    β€’	Cupcake 🧁
    β€’	Donut 🍩
    β€’	KitKat 🍫
    β€’	Lollipop 🍭

Rubber duck πŸ¦† debugging:

 It is a debugging technique where the programmer would debug codes by articulating the problem to someone, and in case there is no one around, they will speak out the issue to a rubber duck.

Aren't they adorable? I'd strayed, sorry~ Back to Beautiful Soup. I'm no expert, but I'll share what I think will help you learn Beautiful Soup based on my experience. In this blog, I'll cover the following:

  • What is Beautiful Soup
  • What concepts to know to use Beautiful Soup
  • Approach to learning Beautiful Soup

What is Beautiful Soup

Beautiful Soup is a Python library, which is an interchangeable term for package. This library contains functions that will allow us to parse and navigate HTML and XML documents to extract data from websites. There are many other libraries like Scrapy and Selenium used for web scraping. I know next to nothing about Scrapy and Selenium. Based on what I could gather, Beautiful Soup is the most popular of these three libraries for web scraping. It seems to be the most beginner friendly of the three.

What Concepts to Know to Use Beautiful Soup?

In writing this section, I assume you already have a basic understanding of Python programming language, as Beautiful Soup is a Python library. Familiarity with Python's syntax, basic data types (string, list, dictionary etc.), and concepts such as loops and conditional statements (if then, elif etc.) is an essential pre-requisite. It will be extremely challenging if you don't yet understand these fundamentals. You can learn Python basics on W3Schools. It is a free resource. If you prefer learning through watching videos, then try this free course: Learn Python Full Course for Beginners by freeCodeCamp.

Apart from having a basic understanding of Python, here is a list of concepts you should also know to use Beautiful Soup for web scraping. Many of which I've covered in my earlier blogs.

  1. HTML
  2. HTML DOM
  3. CSS and CSS Selectors
  4. HTTP Requests and Responses

HTML

Beautiful Soup is a library for parsing and navigating HTML and XML documents. Therefore, it is essential to understand the structure and syntax of HTML language and the meaning of jargon like Tags, elements, and attributes. Check out my blog on Understanding HTML Basics for Web Scraping if this is new to you or if you need a refresher.

HTML DOM

In my blog on Understanding HTML Basics for Web Scraping, I also covered HTML DOM. Recall that DOM stands for Document Object Model and is a tree-like representation of an HTML document browsers create. Understanding how the DOM is structured and how to navigate it using Beautiful Soup can help you extract specific data from a webpage.

CSS and CSS Selectors

CSS is a styling language, and CSS selectors are used to select elements in an HTML document based on their tag name, class name, and other attributes, which is very useful for extracting specific data from a webpage. Check out my blog on Understanding CSS Basics for Web Scraping if this is new to you or if you need a refresher.

HTTP Requests and Responses

Web scraping often involves sending HTTP requests to a server and receiving responses in HTML documents. Familiarity with the basic structure of HTTP requests and responses and how to send and receive them using Python is a must when working with Beautiful Soup. We do this using a Python Requests library. I'll cover this in another blog. Check out the blog's section How the Web Works on Understanding HTML Basics for Web Scraping, if you want to understand HTTP requests and responses or need a refresher.

Approach to Learning Beautiful Soup

Here is the approach I've taken to learn how to use Beautiful Soup:

  1. Install the Beautiful Soup library in your code editor of choice and import it to use. If you don't have an editor or don't want to install one just yet, you can consider using Google Colaboratory (Colab), a web-based editor. If you plan to use Colab, you don't need to install Beautiful Soup, as it comes preinstalled. However, you still need to import it to use it. Check out Beautiful Soup Documentation under Installing Beautiful Soup for instructions.

  2. Study this Beautiful Soup documentation. Follow the examples covered in the document in your editor of choice or Colab. The sections I've ticked are the sections I recommend you study.
    Blog-29--Beautiful-Soup-Documentation

  3. Watch this video on Comprehensive Python Beautiful Soup Web Scraping Tutorial by Keith Galli.
    It'll reinforce the content covered in the Beautiful Soup documentation. The video is 1 hour and 13 minutes long. I suggest you take a break every 20 minutes or so. Please don't watch it in one sitting. Recall that we can improve learning by introducing more breaks because of the primary and recency effects, as explained in my blog on How to Learn Faster and Retain More.

  4. Keith Galli has another tutorial video on Solving real world data science tasks with Python Beautiful Soup! (movie dataset creation). It's a lengthy video, 3.5 hours long. The video covers three tasks. I recommend you follow along and complete at least one of the three tasks. I completed the first task. By then, I think you'll have a good handle on Beautiful Soup.

  5. The final recommended step is to work on a personal project. Scrape something that you're interested in or required for another project, like what I've done- scrape a list of rare diseases for my dashboard project.

I know there are a lot of things to learn. If you are feeling overwhelmed, it's ok and completely valid. Recognize that it's normal and bring your attention to identifying the one small step you can take to progress yourself, do that routinely, preferably each day, and in no time, you'll be able to do web scraping.