Data is the fuel which powers AI. It’s commonly referred to as the ‘new oil’ of the digital economy which helps to paint a picture of how data will come to underpin and fuel many different applications in the future.
Data can come in different forms, shapes and sizes. One way of speaking about data is in terms of its modality. There are a number of modalities data can take:
In most circumstances, a dataset is usually unimodal meaning it only contains only a single modality. A dataset consisting of legal documents would be a unimodal textual dataset. Datasets can also be multimodal and contain multiple modalities. An image captioning dataset is multimodal since it contains the modalities of text in the form of captions as well as images.
Like oil, before data can be used there are a series of steps that need to be taken to ensure that the data is prepared, refined and ready to be consumed. To help appreciate this better, data can be broken down into three different states:
- Unstructured Data
- Semi-structured Data
- Structured Data
Unstructured data is how data is typically found in the wild. It’s oil that is still in the ground or vegetables that are still in the garden. Some work needs to be put in before it can be used. A typical example of unstructured data is a set of contracts that live within a document management system.
Semi-structured data is data that has undergone some refinement but still has some rough edges. It’s oil that has been extracted from the ground but still needs to be purified or vegetables that have been picked from the garden but still need to be chopped and peeled before they can be used. An example of semi-structured data is a set of contracts with some associated meta-data about the parties - it has some structure but work is still needed before the data can properly be used.
Structured data is data that is now ready to be used to fuel applications. It’s oil that has been purified and is now ready for consumption. It’s vegetables that have now been peeled, chopped and diced and are ready to be used in cooking. An example of structured data is a set of contracts with all of the relevant fields extracted into a table. Structured data usually comes in the form of a CSV file or a database table and is the starting point for creating higher-level business insights in the form of analytics, predictions or statistics.
Just like with oil, there is a value chain with data. At every step in the chain, the value of the data increases as it goes from unstructured to semi-structured to structured data. The use case of extracting fields from contracts captures precisely this value chain of data - the goal is to go from unstructured data in the form of contracts and arrive at structured data in the form of a table with the extracted fields from the contracts to enable contract analytics that can inform future negotiations and help with compliance.
While most data in firms and organizations is unstructured, there should be active efforts to work through the value chain to create structured data that can be used to unlock business value and deliver insights.
There is an adage in machine learning: "Garbage in, Garbage out" (GIGO). Any AI model is only as good as the data it's trained on. If you're cooking with old and worn-out ingredients, the dish won't be anywhere near as good as you want it to be. Before venturing into the shiny world of AI, it's important to have a data strategy that enables long-term benefit and value from a firm's data.