Data Storage
We have now looked at how we find data and how we can collect data. By following the principles of the data life cycle, we realise that we have now reached the point where we must store the data. Data collection involves not just retrieving and recording data, but also compiling and storing it somewhere.
We could collect data simply by using a program like Microsoft Excel, where we can input, structure and organise the data and make it usable. In larger projects, however, it often becomes cumbersome—if not outright unfeasible—to manually collect data and work in a spreadsheet.
Of course, there is a lot one can achieve with Excel and similar programs, but when we talk about data-driven projects, we sometimes need even more powerful tools. Additionally, we would want to automate as much of the collection and clean-up of the data as possible.
For larger data-driven projects, we often use sophisticated storage systems that can serve multiple purposes. Let's take a look at some examples here:
Databases
You will learn more about databases in the next chapter, but let's quickly explain what they are here.
A database is an organised collection of data that has been collected and structured according to specific rules. A database shares some similarities with a spreadsheet. Broadly speaking, both are structured collections of data in tables, with rows and columns. However, they operate in different ways.
A relational database—which is the most common form of database—is best explained with an example:
You want to create a list of your business's customers and their addresses. If you were to do this in a spreadsheet, you would quickly end up duplicating the same data in different cells. Several customers could live on the same street, and many would share the same postcode.
With a relational database, however, you avoid entering the street name and postcode for each individual customer. Instead, customers, addresses and postal codes are stored separately, and we instead establish relationships that link them together. Thus, each thing we collect data about, and the characteristics that can describe them, are only entered once. In this way, data can be centralised and double-stored data can be controlled.
When working with databases, a specific type of software is used, known as a database management system. Here we can alter and explore data, and grant other applications access to the data. For example, this database could have been linked to the “My Account” section of your online store, where customers can view and update their own profiles.
Data warehouses
A data warehouse is a database system that is specifically designed for search and data analysis—rather than data collection and data entry. The aim is to collect structured data from many sources and facilitate its use in analyses.
Combining and storing data in one place is one way to avoid silos and make use of data across different sources. A relational database could be one such source that a data warehouse retrieves data from.
To build a data warehouse, one would use specific tools for what is called “ETL” (Extract, Transform, Load).
When retrieving data from multiple sources, it cannot be transferred directly, because it is important that the data in the warehouse is structured in a certain way. The data must be extracted, processed and transformed into the correct formats and units before being loaded into the warehouse. Hence extraction (data is retrieved from the silo), transformation (data is converted to the correct units and formats) and loading (data is loaded into the warehouse).
When a business collects and structures all relevant data in such a warehouse, they have a good starting point for reporting, analyses and data-driven decision making.
The data lake
A data lake is a data storage system for collecting large amounts and variations of data, both structured and unstructured.
You can imagine that you have lots of different data from many sources. The data lake provides you with a place to collect and store all of this for later analysis or other uses. A data lake could also potentially offer processing.
Data in the data lake does not need to be fully structured and cleaned up, but is often raw and unprocessed. Therefore, there will often be a need to process and clean unstructured data in the lake before it can be used in analysis—unlike data in a data warehouse which is already processed and structured. This is a significant difference between the data lake and the data warehouse.
The data lake can be used as temporary storage for, among other things, data warehouse data (a so-called “staging area"—a place you can collect things before use).
Working in the cloud
Databases and other storage structures can be both local and cloud-based. However, when we talk about working data-driven, increasingly we’re talking about working cloud-based. This is particularly true for data warehouses and the data lake.
By working in the cloud—where data can be streamed both up from the sources and down to the users—it becomes easier to handle large amounts of data, the communication between different software improves, and we get access to real-time information from all relevant sources, all in one place.
With such a platform as a backbone, we can also build an application layer that makes it possible to utilise the data, whether that be an app, a digital twin, or something entirely different.
Example
The wind farm
Imagine a power company that wants to collect data about their wind farms and the operation of mills and turbines. They aim to minimise wear and tear, carry out smarter maintenance and optimise operations. They therefore install a series of sensors that regularly collect data from the various equipment.
With this type of large, industrial data, it would be impractical to download the data and start sorting it in Excel. Instead, the power company chooses to work in the cloud, on a custom data platform built on, for example, the cloud computing services of Amazon, Google or Microsoft.
Instead of static datasets, one can work with live data streams here, where data is streaming in real-time from various sources. The power company collaborates with a technology partner to create a tailored solution where incoming data is cleaned and processed according to specific rules. This data is then used in statistical models and visualisations that help the company extract information and insights from the data.
How we sort and process the data, and then actually put it to use and extract value, will be discussed in later chapters.