Data Quality
In the last chapter, we learned how to find, collect, and store data. Great job!
But before we can use this data as a basis for important decisions, we must answer an essential question: Can we trust the data we’ve collected?
We’ve established that data doesn’t just magically appear out of thin air. It always stems from a certain situation or activity, either in a digital system or the real world. Something must first exist for data to be generated; it doesn’t just emerge randomly. There’s no smoke without fire.
In essence, data quality is about how accurately the data reflects the situation or activity it represents.
For example, if you want to measure the percentage of Norwegians who use the Internet daily, how would you go about it? If you survey 100 random students at a school, you’ll likely find most of them had already been online that day, perhaps even before brushing their teeth.
But does that mean all Norwegians are online every day? Of course not. A group of 100 students doesn’t represent the entire population. In terms of addressing our question, this data is of poor quality. It simply does not match reality.
This concept applies to other types of data as well: An audio recording with a low sample rate and bit depth won’t represent the actual musical performance as well as a high-quality recording would.
Is the data complete and accurate?
The outcome of a data-driven project will always be at the mercy of the quality of the input data. This might seem obvious, but it’s vital to all data-related work, and is often summarised by the term GIGO: “garbage in, garbage out."
If the input is of poor quality, you’ll probably end up with poor quality results as well.
Data quality is about making sure the data is accurate, complete, current, and relevant so that it truly reflects reality. This ensures that the data can be used effectively for analysis and in gaining insights. We also have to make sure that the data isn’t altered or manipulated, whether on purpose or by accident, in a way that might affect the final outcome.
That’s why it’s so important to clean up and ensure the quality of our data.
This is what we look for in the data
Here are some aspects to consider when tidying up data:
Be aware of bias
The term bias refers to when the data set, in one way or another, gives a skewed representation of reality.
Sometimes we insert bias intentionally. For instance, if we’re using an artificially intelligent system to assist with hiring, we might say that female applicants should be prioritised because we want gender quotas. However, bias often creeps in unintentionally, and this can lead to significant issues.
Remember that data-driven systems can only work with the data they’re given and the code they’re programmed with. As we discussed earlier, if you put garbage in, you’ll get garbage out.
This could go wrong in several ways, for example if…
- We select data, consciously or unconsciously, that supports the answer we’re looking for, instead of providing a comprehensive and accurate picture of the situation. If you want to prove that PCs are better than Macs, you might focus on data showing that Windows systems are more prone to viruses than Mac systems. But this alone doesn’t make your claim true.
- The data is based on a situation that is not neutral. Say you want to create a model to find potential employees, and you train it based on your current employees. If most of them are men, the system might falsely consider being male as a desirable quality in a candidate. Amazon experienced this when they had to discard an automated hiring system due to systematic (robotic) gender discrimination.
- The data is incomplete, contains duplicates, uses different formats, or has errors. Any of these flaws can make the final analysis, visualisation, report, or recommendation unreliable.
In short, we must keep our data in order and ensure it represents reality as accurately as possible. In the next section, you’ll learn more about how to organise and clean up a data set.
Insight
Data quality and governance
Chapter 1 highlighted the potential pitfalls of working with data-driven systems. If we rely on such systems for decision-making or for automating decisions, it can go terribly wrong if the data is flawed, or if the algorithms reflect our own prejudices and narrow-mindedness. Such mistakes can have real-world consequences, possibly infringing on privacy and fundamental human rights.
Hence, transparency, trust, and good management and oversight are important when dealing with such systems. This is what often is referred to as governance.
Governance requires efficient work towards our goals, but without turning a blind eye to other important factors like ethics, legal considerations, security, and sustainability.
In terms of data quality, it’s about knowing what data you’re working with and establishing a strong process to ensure its quality and availability. This includes identifying data that should be used with caution. It’s essential to tell the difference between data that’s already been checked and approved, and data that hasn’t. Additionally, you might find that new data coming in during an active process is different from old data, which might not have been initially intended for use but could now be useful in a different context.
We can rarely claim to have 100% perfect data. But if there are errors, we need to assess the potential impact. For instance, if poor-quality data is used in a credit approval process, individuals could unjustly be denied loans. However, if the same data is used in a report indicating that e-commerce rose by 16% last month, when the actual figure was 15%, the consequences are less severe. The report might not lead to harmful outcomes. Incorrect information about a loan applicant, on the other hand, has very negative and direct implications for the person involved.