Understanding CSS Basics for Web Scraping

Understanding CSS Basics for Web Scraping
Image by Ron Lach/ Canva

Many resources I came across on web scraping speak to the need to understand the basics of HTML to web scrape; this doesn't seem to extend to CSS, is the impression I'm getting from my research. With the benefit of hindsight, I think it's helpful to learn the fundamentals of CSS beyond just what it is.

HTML and CSS are like inseparable besties. When web-scraping, we use query languages like XPath and CSS Selector (more about that in a future blog) to describe to the computer how it should find the content we seek. We tell it the characteristics or elements relationships of HTML/CSS to look for. If it helps, you can think of it as using Search in a word document.

It's likely that when you embark on your first web scraping project, you'll be following web scraping tutorials. I recall perplexing and struggling to follow the first couple of videos I watched. I had difficulty discerning between what's HTML and what's CSS. Explanation in those videos is often scant. Without some foundational understanding, it's also hard to comprehend the documentation on the libraries of functions required in web scraping.

I don't know about you, for things I want to learn, I like to spend time understanding the content quite deeply. I wouldn't say I like it when I'm told how it is without understanding the why or the purpose it serves. Knowing the why and purpose, I can recall more of what I've learned.

So, in this blog, let's look at the following:

  • What is CSS?
  • CSS Syntax
  • CSS selectors
  • How to add CSS to HTML?

What is CSS?

In my last blog, I explained that CSS stands for Cascading Style Sheet. It is a styling language that dresses up and beautifies HTML. With CSS, you can make text bold, change table size and colour background, add borders, and essentially edit pretty much any webpage aspects to make them attractive.

Now, some little styling can be achieved using HTML elements (tags) like font, <small>, <big>, <center> and attributes like color, border etc. But developers steer clear from them unless they want to inflict nightmares on themselves and other poor developers. Let me help you understand why it's a nightmare. Imagine you have typed a draft book of 365 pages, and you didn't use the Styles feature with Microsoft Word. Then your editor requests that you change the font of all the subheadings. You would have to comb through 365 pages and locate and update the subheading one by one. Needless to say, that is PAINFUL! 😫
Using CSS, we can make the change in one location, and the change will cascade
through to all the subheadings and that, my friend, is the beauty of CSS!

Oh, by the way, HTML was never intended to contain tags for formatting a web page. With the potential grief that can be inflicted using HTML to style webpages, many of the elements and attributes I mentioned above are no longer supported in HTML5 (the latest version).

CSS Syntax

CSS syntax is made up of a selector, property and its value.

The Selector points to the HTML element you want to style. Each property and value, known as the property-value pair, form a declaration. A colon separates property and value.

You can have multiple declarations. Each declaration is separated from the next with a semicolon. Curly braces flank declaration blocks.

Blog-27--CSS-Syntax

Let me explain the example in the image:

  • p is a selector in CSS. It points to the HTML element we want to style, which is <p>, paragraph, element.
  • backgroud-colour is a property, and yellow is the property value.
  • opaque is a property, and 0.3 is the property value. Opacity can take a value from 0 to 1. 0 means transparent.

CSS Selectors

Before venturing into CSS selectors, I want to clarify that the CSS selectors referred to here differ from the CSS selector referred to in the introduction. That said, the two are inextricably related. The CSS selector I referred to in the introduction instructs the computer how to search for what we want to scrape. The CSS selectors we're about to delve into is illustrated in the CSS syntax above. CSS Selectors here defines the HTML elements you want to style with CSS. To avoid confusion, I will refer to CSS Selector in the introduction as CSS locator.

CSS selectors are divided into five categories. The conventions for these selectors are applied to CSS locators. Knowing this will help you expedite your learning of the CSS locator.

The five CSS selector categories are:

  1. Simple selectors- select elements based on name, id, class
  2. Combinator selectors- select elements based on a specific relationship between them
  3. Pseudo-class selectors- select elements based on a certain state
  4. Pseudo-elements selectors- select and style a part of an element
  5. Attribute selectors- select elements based on an attribute or attribute value

We won't spend time on all five. From what I could gather, understanding #1 Simple selectors, #2 Combinator Selectors and # 5 Attribute Selectors is more than sufficient for web scraping.

Simple Selectors

There are several simple CSS selectors:

  • CSS element selector
  • CSS id selector
  • CSS class selector
  • CSS universal selector

CSS Element Selector

The element selector selects HTML elements based on the element name.
Here, all <h1> elements on the page will be centre-aligned with yellow text.
Blog-27--Simple-CSS-Selector--element-selector

CSS id Selector

The id selector uses the HTML id attribute to identify a specific element. id of an element is unique.
To select an element with a specific id, you need a hash (#) character before the element's id.
The CSS colour will be applied to the HTML element with the id = special_h1.
Blog-27--Simple-CSS-Selector--id-selector

CSS Class Selector

The class selector selects elements with a specific class attribute. Unlike id, class attributes aren't unique.
You need a period (.) character before the class name to select elements with a specific class.
In the example here, HTML elements with class = right will have the CSS of right-aligned, red-coloured text applied.
Blog-27--Simple-CSS-Selector--class-selector1

We can also specify that we only want a specific HTML element with a particular class name. In this example, only the text in <p> elements with class = right will be right aligned with red text.
Blog-27--Simple-CSS-Selector--class-selector2

Something to note, HTML elements can refer to more than one class.
Here, the <p> element will be styled according to class = right and class = small:
Blog-27--Simple-CSS-Selector--class-selector3

CSS Universal Selector

The universal Selector represented by the * character selects all HTML elements on the page.

The CSS rule here will be applied to every HTML element on the page.
Blog-27--Simple-CSS-Selector--universal-selector

Combinator Selectors

There are four different types:

  • Descendant Selector ( space)
  • Child selector (>)
  • Adjacent sibling selector (+)
  • General sibling selector (~)

Descendant Selector ( )

The descendant selector matches all elements that are descendants of the element specified.

All <h2> elements inside of <div> elements are selected in this example. I like to draw your attention to the space between the two elements in the syntax because it's easily overlooked.
Blog-27--Combinator-CSS-Selector--descendant-selector

Child Selector (>)

The child selector selects all the children elements to the element specified. All the <h2> children of <div> element will be selected in this example.
Blog-27--Combinator-CSS-Selector--child-selector-2

Adjacent Sibling Selector (+)

The adjacent sibling selector selects an element directly after another specific element. Sibling elements mean the element must have the same parent, and adjacent means immediately following.

In this example, the first <h2> elements are placed immediately after <div> elements are selected.
Blog-27--Combinator-CSS-Selector--adjacent-sibling

General Sibling Selector (~)

The general sibling selector selects all elements that are the next siblings of the element specified. In this example, all <h2> elements that are the next sibling of <div> elements are selected.
Blog-27--Combinator-CSS-Selector--general-sibling

Attribute Selectors

Attribute selectors allow you to select an HTML element given a specific attribute or attribute value.

CSS [attribute] Selector

The [attribute] selector selects elements with an attribute specified.
The following example selects all <a> elements with a target attribute:
Blog-27--Attribute-CSS-Selector--attribute

CSS [attribute= "value"] Selector

The [attribute= "value"] selector is used to select elements with an attribute with a specific value.

In this example, all the <a> elements with a target value of "empty" will be selected.
Blog-27--Attribute-CSS-Selector--attribute-value

CSS [attribute~= “value”] and CSS [attribute|= “value”] selectors

The [attribute~= "value"] selector is used to select elements with an attribute with a specific value containing a specific word.

In this example, all the elements with a title attribute containing a space-separated list of words with "bird" will be selected.

Elements that will be returned are titles like title = "yellow bird", "dark bird", "bower bird", and "noisy bird". These won't be returned: "yellow-bird", "dark-bird", and "bird-poo"

If you want to retrieve results with words containing hyphen (-) then use [attribute|=" value"] instead.
Blog-27--Attribute-CSS-Selector--attribute-specific-value

CSS [attribute^= "value"] , CSS [attribute$= "value"] and CSS [attribute*= "value"] Selectors

The [attribute^= "value"] selector is used to select elements with an attribute with a specific value, whose value begins with the specified value.

In this example, all the elements with a class attribute value that starts with 'cl' will be selected.

If you want elements with an attribute that ends with a specific value, then it can be achieved using [attribute$=" value].

If you want elements with an attribute that contains a specific value, then it can be achieved with [attribute*=" value"].
Blog-27--Attribute-CSS-Selector--attribute-begin--end--contain-value

How to add CSS to HTML?

There are three ways to insert a style sheet:

  • External CSS
  • Internal CSS
  • Inline CSS

External CSS

With an external style sheet, your styling is coded in a CSS document, so you can change the aesthetic of an entire website just by tweaking that one file!

To add external CSS to the HTML page, you need to include a reference to the external style sheet file inside the <link> element inside the head section.
Blog-27--External-CSS

Internal CSS

An internal style sheet is used when one single HTML page has a unique style. The internal style is written inside the <style> element in the head section.
Blog-27--Internal-CSS

Inline CSS

Inline CSS can be used when applying a unique style for a single element. To use inline styles, add the style attribute to the relevant element. The style attribute can contain any CSS property.
Blog-27--Inline-CSS

Rightio, I've covered good grounds of what I think will help with web scraping. In your web scraping adventure, if you come across any CSS knowledge that I've not covered but is good to know for scraping information, please share it with me.