OK, I bet you took one look at that title and thought “What the heck is wrong with him? Of course, unstructured data matters!” And sure, if you go with the common wisdom, it certainly does. But let’s look a bit deeper at the topic.
What is Unstructured Data?
To understand what we mean by unstructured data, we first need to define structured data, which is data that is consistent, well-formatted, and typically used for business transactions, like number, short character strings, and dates. Structured data requires a predefined schema and is the data that is traditionally used by business systems and stored in relational databases. Structured data is commonly found in business applications like customer relationship management (CRM), inventory management, financial systems, and more. It is suitable for generating structured reports, performing calculations, and supporting day-to-day operational tasks.
Unstructured data is typically defined as anything else. It can encompass a wide variety of data types, including text, images, audio, and video. So, unstructured data refers to data that doesn’t conform to a specific data model or structure. It is not organized in a predefined manner like traditional structured data (e.g., databases with rows and columns). Unstructured data is often used for analytical and exploratory purposes. It is valuable for gaining insights from sources like social media, customer feedback, image analysis, and sentiment analysis.
Here are some common types of unstructured data:
- Text: This includes documents, articles, emails, social media posts, chat logs, and any other textual content. Natural language processing (NLP) techniques are commonly used to analyze and extract information from text data.
- Images: Unstructured image data includes photographs, scanned documents, diagrams, charts, and more. Computer vision techniques are used to analyze and extract information from images, such as object recognition or character recognition in scanned documents.
- Audio: Audio data encompasses recordings, speeches, music, and other sound files. Speech recognition and audio analysis technologies are used to process and extract information from audio data.
- Video: Video data includes movies, television shows, surveillance footage, and more. Video analysis involves techniques such as video summarization, object tracking, and facial recognition.
- Social Media Data: Social media platforms generate vast amounts of unstructured data, including text, images, videos, and user interactions (likes, shares, comments). Sentiment analysis and social network analysis are common methods for handling social media data.
- Sensor Data: Sensors in various applications generate unstructured data, such as data from IoT (Internet of Things) devices, environmental sensors, and scientific instruments. Analyzing this data often involves time series analysis and signal processing techniques.
- Geospatial Data: Geospatial data includes location-based information, such as GPS coordinates, maps, and geographic information system (GIS) data. Geospatial analysis is used to process and interpret this type of unstructured data.
- Log Files: Log files from servers, applications, and systems contain unstructured data that can provide insights into system performance, errors, and user behavior. Log analysis tools are used to make sense of this data.
- Emails: Email messages, including their attachments, contain unstructured text and potentially unstructured data in various formats (e.g., spreadsheets, documents) that may be relevant for analysis.
- Web Content: The web is a rich source of unstructured data, including web pages, blogs, forums, and wikis. Web scraping and text mining techniques are used to extract valuable information from web content.
- Free-Form Surveys: Responses from open-ended survey questions can be unstructured data, as they often involve varying and non-standardized textual answers.
- Handwritten Documents: Scanned handwritten documents, such as notes, forms, or historical manuscripts, require handwriting recognition technology to convert them into machine-readable text.
These are just a few examples of unstructured data types. There are certainly other forms of unstructured data specific to certain industries or applications.
Furthermore, unstructured data continues to grow. Analysts at IDC estimate that unstructured data accounts for about 90% of all digital information1 . And many industry trends contribute to the skyrocketing growth of unstructured data. For example, digital transformation, social media, the Internet of Things, and the continuing growth of the mobile market… all of which will contribute to the continuing growth of unstructured data.
But is it Really Unstructured?
Basically, any data that falls outside of the scope of traditional, structured data is referred to as unstructured data. But is most of this data really unstructured? Consider some common examples of unstructured data such as word processing documents and spreadsheets. Have you ever tried to read a spreadsheet document using a text browser? Or read a file that is not formatted for Microsoft Word in that word processor? In both cases, the attempt either fails or the “data” is displayed as a bunch of gibberish characters.
Why does this happen? Well, there is a structure to this data. If the structure is not correct, then it cannot be read accurately. For example, the word processor expects to read and save data using a specific structure. If that structure is not there, the data cannot be accessed. So, calling such data unstructured is not really an accurate description, is it? It is structured, albeit in a different way than traditional, structured data.
It is a common practice for multimedia data—such as video, audio, and image files—to be processed and analyzed. And people refer to this as unstructured data. But it is anything but unstructured! For images, there are various structures such as JPG, TIF, GIF, and PNG. The same is true for audio with formats such as WMA, AAC, FLAC, and MP3. For video, we have MPG, MOV, WMV, and RM formats. It is not possible to access any of this multimedia data using software that does not understand the structure. So how can it be called unstructured?
The other types of data that are commonly referred to as unstructured, such as log files, social media data, emails, and so on, are not truly unstructured either. This data has a structure, and if you don’t know that structure, you cannot make sense of the data. If you don’t believe me, just try to read the log files of your favorite DBMS without referring to the manual and see how much of it you understand!
While there is a benefit to having a term that everybody understands—such as unstructured data—there is also a detriment when the term is inaccurate. It is easy to summarily dismiss unstructured data because, well, it doesn’t have a structure—so how can it be of any value?
Nevertheless, more and more types of data are being ingested and processed for analytics, AI, and other types of uses. And this data has been termed unstructured, although it would be more accurately called “differently structured” data. But that doesn’t have much chance of catching on. So, whenever you hear the term unstructured data, try to translate that in your head to “differently structured data” and work to figure out how you can use that data to the benefit of your organization.
How is Unstructured Data Being Used?
It is undeniable that what we call unstructured data can deliver significant benefits. Advances in technology, data analytics, and AI techniques have enabled organizations to extract valuable insights from unstructured data.
Natural Language Processing (NLP) techniques are used to analyze and extract information from unstructured text data, such as emails, social media posts, customer reviews, and documents. This is valuable for sentiment analysis, chatbots, and information retrieval.
Text mining involves processing unstructured text data to discover patterns, trends, and valuable information. It can be applied to various domains, including healthcare (analyzing medical records), finance (news sentiment analysis), and e-commerce (product reviews).
Speech recognition can be used on unstructured audio data, such as recorded phone conversations or voice notes. This enables applications like voice assistants and transcription services.
Deep learning and computer vision techniques are being used to analyze unstructured image and video data. Applications include facial recognition, object detection, autonomous vehicles, and medical image analysis.
Social media analysis is being conducted by companies using unstructured data from social media platforms to understand customer sentiment, track brand mentions, and identify emerging trends. This helps in reputation management, marketing, and product development.
Unstructured data from customer surveys, emails, and support tickets can provide insights into customer satisfaction, pain points, and areas for improvement. This data can inform customer service strategies and product development.
Internet of Things (IoT) devices generate vast amounts of unstructured data. Analyzing this data can help in predictive maintenance, optimizing operations, and improving product performance.
Streaming platforms and e-commerce websites use unstructured data, such as user behavior and preferences, to recommend content or products to users. This enhances user engagement and sales.
Other common use cases for unstructured data include fraud detection, healthcare data analysis, legal document analysis, news/media monitoring, and more.
Overall, unstructured data analysis helps organizations gain insights, make data-driven decisions, and improve their operations, products, and services. It has become increasingly important in the era of big data, as it allows organizations to harness valuable information from diverse sources.
The Bottom Line
So, yes, to answer the question posed in the title of this article, unstructured data matters, and it will matter more and more as organizations embrace modern analytics and AI technologies to better understand such data.
References
1 Source: IDC White Paper, “Untapped Value: What Every Executive Needs to Know About Unstructured Data,” Doc. US51128223, August 2023