What is the difference between structured, semi-structured and unstructured data?

What is the difference between structured, semi-structured and unstructured data?
Image by Danni Liu, adapted from Alesmunt/ Canva

There are so many ways of classifying data. Many are familiar with organising it into either quantitative (comprised of numbers) or qualitative (comprised of non-numbers) data. One that many may not be familiar with but is common in a business setting is the distinction between master and transactional data. Master data refers to data about business entities that provide context for a business transaction, such as customer, product, employee, and supplier data. Transactional data refers to information recorded from transactions such as product orders.

There are many more classification types out there. One important thing to know in the age of big data, where the proliferation of unstructured data is continuing to rise, is the classification centred on the degree of organisation: structured, semi-structured and unstructured.

On a side note, is everyone clear on what is big data? Don't be embarrassed if you're not. I know of many people who throw around the term but don't know what it means. I'll admit, I was one of them. 😅
Big Data is a term that refers to mountain loads of varied data received at a fast rate. When someone brings it up, think of the 3 Vs: Variety, Volume and Velocity.

Variety refers to the many forms of data.
Volume refers to the high amount of data to be processed.
Velocity refers to the fast rate at which data is received/ acted on.

Ok, now that is out of the way, let's go back to structured, semi-structured and unstructured data categorisation. For a long time, business intelligence was only done on structured data. We are generating a much greater variety of data in today's world. Mobile devices, the Internet of Things (IoT), creates valuable data that can be leveraged for better business decisions and harnessed for opportunities and solving challenges. These data don't come in a fully structured form. So, in this blog, I will explain the difference between structured, semi-structured and unstructured.

As I explain this, please remember that the distinction between the three categories isn't always black and white. Think of them as a continuous scale ranging from white to black and many shades of grey in between.

Structured Data

Structured data is data that is highly organised. A well-organised Excel spreadsheet is an example of structured data. Structured data is generally tabular data represented by rows and columns in a database and is often managed using SQL. As mentioned earlier, traditional systems and reporting rely on this form of data. Its high degree of organisation makes structured data easy to store, analyse and search. We can easily use it for data visualisation, analytics, and machine learning.

Photo by AndreyPopov/ Canva

Unstructured Data

Unstructured data is more prevalent than structured data. A few years back, International Data Corporation (IDC) projected that 80% of data is unstructured; according to an article by Timonthy King: 80 Percent of Your Data Will be Unstructured in Five Years.

Unstructured data cannot be contained in a row-column database; it's data in absolute raw form. It is anything that's not in a specific format. Examples of unstructured data are paragraphs from a book, social media posts, chats, emails, websites, presentations, digital photos, videos, and audio.

Lack of structure makes this form of data challenging to process, search, manage and analyse, which is why companies have widely discarded this type of data until the recent proliferation of more powerful technologies.

Unstructured contains a wealth of information and harnessing it provides more opportunities to turn data into a competitive advantage.

Animations by Danni Liu

Semi-structured Data

Semi-structured data is a mix between structured and unstructured. The degree of organisation is typically achieved through some form of tags or metadata.

Some examples of semi-structured data are digital photos and emails. Hold on, didn't you say earlier that they are unstructured data? Glad you're paying attention. Yes, I did say that. The content of the photo and emails are unstructured. Digital photos taken with smartphones would be DateTime stamped with geotagging and device ID. These give digital photos certain structural attributes. In the case of emails, the structural properties are the name, email address of the sender, recipient(s), time sent etc. HTML files and JSON files are additional examples of semi-structured data.

Image by Antonio Batinic

The distinction between the data types is important as it affects how the data is stored, organised, and ease of process and analysis.

Structured data takes up less space relative to semi-structured and unstructured data. Instead of spreadsheets and relational databases, unstructured data is usually stored in data lakes and NoSQL databases. The varied format of semi-structured and unstructured makes it challenging to process and analyse the data. Processing and analysis can easily be done on structured data.

There you go, now you know the difference between structured, semi-structured and unstructured data!