How We Collect Data

There are several different methods for data collection. In principle, you can observe, take notes, and download data from various sources, and manually punch in relevant values into a spreadsheet. This could be a good approach if you, for example, need to create a simple report.

However, in larger data-driven projects, to truly harness the inherent potential of data, you must have a lot of time and a firm belief in your own numerical abilities—not to mention patience—to work in this way.

Therefore, we usually employ various techniques and technologies, and often combinations of these, to automate the collection of information needed to solve our problem. We can use software to collect and sort digital traces. And we can use sensors to register data from the physical world.

At the same time, we can pick up and “borrow” data from various sources to use in our own systems and programs by connecting to the sources via an API (which will be explained soon).

Let's take a closer look at three methods and data sources: Digital traces, APIs, and sensors.

1. Digital traces

As we have already discussed many times throughout the course, the use of digital devices itself is a source of data—in the form of, say, log files and other forms of analysis data, diagnostics data and functional data that are registered by the devices and computer systems we interact with.

When you scroll through a website, fill out a form, pause a TV series or tap your payment card, the interaction is recorded and stored. We also leave digital traces that are retrieved from sensors in heart rate monitors, information from fitness apps and even our calendar. These are the traces that create your digital shadow, which we talked about in Chapter 1.

Let's look at a very specific example, namely what it can look like when we visit a website and leave some footprints behind:

On a website and in many mobile apps, parts of the underlying code, which is typically hidden from the user, consist of so-called HTML tags. Many of these tags allow us to specify structure and meaning for the content on the website. This is actually very simple, so let's be specific:

The shortest short crash course in HTML

A markup language helps explain to the computer how text and documents should be structured and presented.

In the previous chapter, we briefly touched on HTML (Hypertext Markup Language), a markup language used to create websites. Here, we’ll show you exactly what it looks like.

Just as these tags define a website’s content and structure, other tags allow you to give the website certain properties.

The <script> tag makes it possible to run code from the JavaScript programming language. Such tags, together with the JavaScript code, let us specify instructions for what data should be recorded and what should be done with it. You can then include instructions for data to be sent to a third-party analysis tool or to place a cookie on the user's computer.

For example, tools like Google Analytics (see fact box) and Meta’s Business Manager can give you a unique JavaScript snippet, surrounded by an HTML tag, which you can insert on your website to collect and transfer data about the website's visitors and their actions on the website (page opening and clicks, to name a few). Such use of HTML tags/JavaScript snippets—specifically to collect and send data to external analysis tools—is also known as page tagging.

It is possible to collect a lot of data in this way—but there are also limitations on what kind of digital traces you are allowed to collect. We'll talk more about this later in the chapter.

Example

Google Analytics

Google Analytics (GA) is a powerful website analysis tool that can provide you with insight into what's happening on your website and how it's being used. You can see how users behave on the site, how and where they are navigating, which websites they come from, what else they have looked at, and how long they stay.

The tool can, for example, be used to find out if someone who shopped in your online store found the product via search, was referred from social media, typed in the address or clicked on an ad.

To understand how GA collects data, let’s look at it within the context of the data lifecycle from Chapter 1:

Activity and data collection: When you connect GA to your website, the tool generates tags that are inserted into the website's code and then capture activity on the page and collect data. Such activity can be a purchase, a link click, scroll depth (how far down a page you scroll) or filling out a form.
Storage: The data is stored and processed on Google's servers.
Analysis, visualisation, reporting: GA will automatically set up a number of reports and visualisations you can explore. These can both present real-time insight and show patterns or development over time. You can also create custom reports and events/automations.
Action, measures: Simply looking at GA's reports is often enough to give you data-driven support for making decisions. You can also use GA's APIs to connect the data to your own systems, and possibly combine them with internal data.

2. APIs

How does Google Maps know when the next bus departure is? They haven't collected and stored all the world’s bus schedules on their servers. Instead, the service fetches updated information directly from the public transport companies' own computer systems.

A special type of connection between different computer systems is used here, called an API—or Application Programming Interface. Think of it as a bridge where data can move from one computer system to another according to specified rules.

More specifically, in the context of data exchange, we can define an API as a set of rules that define how one computer system can exchange information with another.

In this context, it is a server that contains the rules for the API and ensures that it is available, and a client that actually uses the API. It can either be one-way communication of data from the server to the client, or two-way communication where the client can also send data to the server.

An API allows you to plug other people's data into your own system, where you can then make use of it. But it can also be used internally, to get your own systems to communicate with each other and exchange data. Data shared through various APIs can be utilised collectively in programs offering new functionality—a so-called “mash-up”—or it can be used to gather data from multiple sources into an analysis program or dashboard (which you will learn more about in later chapters).

To put it simply, APIs are useful, first, because they allow you to have data in one place, but usable in many different places. Secondly, it can be used to gather data from many different sources to a single place.

The restaurant visit’s API

Here’s an example that may make it easier to understand what APIs do. We can compare it with a restaurant visit, where the “API” determines the rules for how you can order food from the kitchen.

APIs are often the engine behind websites and applications. When you, for example, search for a flight on Finn.no or in the Finn app, it is APIs that retrieve data about flight departures from many different airlines, instead of Finn storing all the information themselves. The same way that Google Maps doesn't store all the world's timetables.

Also, when you click a button on a website or in an app—for example to save a new contact, or to bring up the list of your contacts in a messaging app—APIs are at work.

Fact

REST API

APIs on the World Wide Web usually follow a framework called Representational State Transfer, or REST, which places some limitations on software architecture—making it easier to predict how the API will behave, and getting different systems to work together.

REST is based on the HTTP protocol, with its established rules about request/response, and more—so that a client can request data (rather than web pages) from a server. Associated file formats for returned data are usually JSON and XML.

3. Sensors

While an API allows you to retrieve data from other digital sources, out in the physical world, we use sensors for data collection.

The word has the same origin as “senses”, and that is exactly what a sensor is: something that allows electronics to “sense” and register their physical surroundings.

Sensors often work in conjunction with technology that allows us to store and/or read the data that is registered. An example is the classic thermometer. Here the sensor technology is the mercury that reacts to the heat, while the technology that allows us to read data is the glass and the numbered scale around it.

Today, we have digital sensor technology that allows us to measure everything from distance to pressure, humidity, movements and light, and much more. In conjunction with other technology, sensors enable everything from reverse vending machines for recycling to computer vision (computer systems that can see and understand their surroundings using machine learning), or air purifiers that automatically react to, keep statistics on, and improve the air quality in a room.

As we've talked about before, there are a number of sensors in your mobile phone, for example, and sensors are important in all IoT devices, smart houses, and smart cities.

Sensors typically collect data that is processed by a computer—either locally or in the cloud. The data will go through the life cycle, from storage to steps such as processing, analysis, visualisation and/or reporting. Often, the device with the sensor will receive new instructions or implement an action when the data has been processed. This can be based on software installed on the device itself or on communication sent from computers over a network (locally or over the Internet).