My team of data and analytics researchers and consultants falls into two main specialties: one for structured data, what we consider “traditional” data or information that can be organized into a spreadsheet, and the other for unstructured data, i.e., content and records in formats such as text documents and messages that do not fit neatly into a standard model of tables and cells.
And this is the way the technology world has aligned itself:
- Structured data is searchable and, because it’s organized into a schema, is stored in data warehouses with relational databases, where it can be extracted and analyzed for insights using business intelligence (BI) tools. The most familiar examples of structured data are customer data (names, addresses) and transaction data (financial). We call this practice “Data,” which includes data management, analytics and business intelligence (with rising star artificial intelligence).
- Unstructured data is not so searchable, is stored in non-relational databases incorporated into websites, email and business applications and, without organizing models, does not lend itself easily to extracting business insights. We call this practice enterprise content management (ECM).
As a team, we’re all working on data. But we think and operate in separate orbits because structured and unstructured data live in different systems and provide different insights. When we receive requests for research, we have a triage process that directs the case to either “Data” or “ECM.” (Except for Igor — the man can do anything.)
The Growth of Unstructured Data
But the world is changing. There is so much more unstructured content out there — over 80% of corporate data is unstructured and that’s growing by 50% a year — and it’s diverse, coming from emerging data sources in the form of social media, email, the internet of things (IoT) and videos. We can learn so much from all the knowledge that’s locked away in that information, but we have not been able to extract it using analytics tools like we can for structured data.
Until now.
The untapped potential intelligence within unstructured content, along with the complexity and volumes of emerging data sources, have driven new practices (big data), new technologies (data lakes to store it all) and new roles (data scientists). This means companies need to invest in specialized skills and software and find creative new ways of doing things.
Related Article: Data Scientists vs. BI Analysts: What's the Difference?
Unlocking Unstructured Data Intelligence
To unlock the intelligence within unstructured data stores, we need to focus on:
- Master data: Use standard enterprise descriptors to zoom in on and aggregate the right data. Master data contains the organization’s basic information (e.g., customers, products, locations) needed to conduct business, and good master data provides common definitions across the organization. This common terminology makes it possible to join data from different sources, whether unstructured or structured.
- Metadata: Each data object and document has a label (metadata) for identifying and describing it. Apply the common terminology from master data to the metadata tag that labels each file. Some metadata is applied automatically with the file (e.g., word document properties) and some can be wrapped around the file to add meaning.
- Transform unstructured data using machine learning: Master data and metadata allow us to find and connect data. But how do we get the essential information out of the unstructured file, that real meaning organizations are looking for? We need to transform the file — add structure to unstructured data. Using machine learning, technologies such as Microsoft’s Project Cortex and newcomer SortSpoke can scan and extract key terms automatically. Some tools not only extract key fields from the file (via text and image recognition), but also use this extract to classify the file, in effect, building its own taxonomy and refining it over time as it gathers more and more descriptors.
- Analytics tools for unstructured data particularly benefit from machine learning: Machine learning enables them to work at high volumes and speeds and “learn” patterns from user activity. But the product landscape for unstructured data analytics is still developing and the industry is yet to settle on strong frontrunners.
Business leaders don’t see the world of data as structured and unstructured — they just want the information they need to take care of their business and their customers. It’s the IT service providers who have had to divide up this world because of the limitations of the technologies and the different ways in which they’re managed. Even when we do catch up our capability to do unstructured analytics and bridge the knowledge from both forms of data, we will still be managing structured and unstructured data differently.
Related Article: How Machine Learning Will Tame the Explosion of Unstructured Data
A More Universal View of Information Systems
It’s exciting to think of the stories that will emerge by unlocking this rich new knowledge. But business leads and their data teams will need to make important decisions:
- How to use and manage your different data types.
- Establishing master data management to apply common definitions across the enterprise.
- Using metadata to tag and connect data for an integrated view of the business.
- When and what to invest in new skills and new analytics technologies.
It requires a more universal view of our information systems, and the data orbits are still there, but we will have the jet fuel to navigate them more easily — like Igor.