A Day in the Life of a Data Scientist
Do you remember back in the first chapter when we discussed data science? This interdisciplinary field blends statistics, programming, data analysis, and business understanding to organise data and extract data-driven insight.
Data science can be incredibly valuable wherever there are systems and processes from which we can collect data and which could potentially be improved or optimised. And that’s pretty much everywhere! However, working with data in this way can be daunting and time-consuming, so it’s important to know where to direct your efforts.
The process we’re discussing in this section, cleaning and preparing data, is often the most time-consuming work for a data scientist.
You’ve already had a taste of what that entails. Now, let’s take it up a notch and get a sneak peek at how these processes are managed at a more advanced level than what we can handle with Excel.
The value of asking the right questions
A data scientist’s main job is to lay the foundation for data-driven decisions, by getting various datasets and data streams to reveal all their secrets.
They do this by creating models, revealing relationships within data, and producing various types of visualisations. However, the bulk of their work typically lies in data preparation.
Firstly, one must understand the fundamental problem that needs solving. if the aim is to optimise maintenance in a wind park, the data scientist might start by asking questions like:
- What are the costs associated with maintaining the wind park, and what causes these costs?
- How much energy production is lost during wind turbine downtime, how expensive is that, and who is impacted?
- How often is equipment—such as wind turbines and blades—being replaced, and what determines when a piece of equipment needs to be replaced?
- Is there a record of how the wind turbines are used and how much they wear down over time? If yes, how is this data being collected and tracked?
- Are there logs kept of how often, and why, wind turbines break down or need repair?
- Could it be beneficial to schedule maintenance at specific times with low production to minimise the negative consequences of downtime?
A data scientist must not only know programming and statistics, but also possess business understanding and “domain knowledge”, meaning expertise in the specific area they’re working in. They’ll work closely with field professionals, and the more thoroughly they understand the problems and the better the questions they ask, the better the results and the greater the value they can extract from their work.
Not only can the answers to these questions provide relevant data, such as cost against efficiency. They can also uncover new questions, new insights, or highlight a need for more or better data.
Collection and preparations
As we’ve mentioned, the data science work process is not linear. We’ve said we work “backwards,” but perhaps more accurately, it’'s iterative. Simply put; we go multiple rounds, use our stated end goal as a reference, and make improvements and progress with each round.
Once the goal is defined, we start collecting data. The data you choose to gather is also the first step in preparing data, as it sets the stage for the rest of the process.
Insight
Data formats and data preparation
As we learned in connection with the data lifecycle, data must be collected and stored in some specific format. It can, for example, be structured in a spreadsheet, entered into a database, or written into a text document in CSV format (Comma Separated Values, a standard for formatting data as plain text).
When we use databases and for example CSV files, we can use programming languages like SQL, R, and Python to work with the data, either to extract insights or feed it into an application. You’ll learn more about this later in this chapter and the next.
The next step is getting a comprehensive overview of this data, and structuring and cleaning it up. We’ve already covered this: For example, is the data correct, or are there any measurements that are wrong? Is the data incomplete? What connections, similarities, differences, and associations exist between different data sources? This is called profiling the data, and it’s necessary for the cleaning and tidying process that follows.
We also need to perform necessary transformation of the data, which is about ensuring that the data is comparable. For example, if we’re working with sensor readings, we need to ensure they all use the same units and times of measurement.
Investigative data analysis
Creating models could be considered the core activity for a data scientist. But before we get to that, there’s one more thing to do: exploratory data analysis.
Insight
Statistics and models
What sets a data scientist apart from a typical Excel user is not just the complexity and size of the data, but also the tools they use. These include advanced statistics, programming, and machine learning.
Once the data is cleaned and structured in a database or similar, the data scientist can start exploring the data and creating statistical and machine learning models. Remember from chapter 2, such models are programs trained with machine learning algorithms to process data for specific purposes.
Such models are used, among other things, to help us predict the future. They learn from historical data to assess the likelihood of future events, such as who will win an election, what the weather will be like next Thursday, or when a wind turbine blade should be replaced to prevent downtime and accidents.
This step, which lies between data clean-up and model development, involves exploring the possibilities of what can be done with the available data.
For instance, you might use descriptive statistics to look at average values and dispersion in the data, or create diagrams and charts to examine averages, fluctuations, and deviations. In our wind turbine example, this could be related to downtime and repairs.
Getting familiar with the data in this manner, gaining a deep understanding of what you’re dealing with, will provide a strong foundation for choosing what data to focus on and how to best develop and adjust a model.
It’s only after this deep dive into the data that the real work of model development starts. This stage involves crafting various models, assessing them, and making comparisons. In the end, the model that best aligns with our initial goal and delivers superior results will be chosen.
A successful model in this case could predict when a wind turbine will need maintenance before it actually breaks down. Or, it could indicate what operational adjustments should be made to minimise wear and tear. This approach can prevent downtime and reduce costs.
Finally, the model must be implemented, or “put into production”. This involves regularly collecting new data, updating results, and maintaining the model. The accuracy of the results also needs to be continually assessed and adjusted if necessary.
As you can see, data science involves a lot of statistics, mathematics, and programming, but efforts are being made to make these processes more accessible through user-friendly tools. This is closely related to data literacy: Over time, more non-specialists will be able to perform tasks that are currently handled by a data scientist.
The next chapter will delve deeper into working with statistics, analysis, models, and visualisations, in essence, extracting insights and value from data. But first, let’s learn about databases.