Tools for Data Analysis
Excel, you can handle just fine. But when you hear terms like artificial intelligence, data mining, algorithms and models, it might all seem a bit distant and too advanced.
In reality, it isn’t all that mysterious. At the core of these fancy buzzwords, we find solid and practical statistics, mathematics and programming. We find rows and columns of data. Probability theory, averages, medians and linear regression. Custom programs where we can enter formulas and commands and get answers and results.
Someone who has the exploration and value creation from data as their expertise—a data scientist—has all of these skills and techniques in their toolbox.
Let’s take these tools out of the cluttered toolbox and lay them neatly on the workbench, so to speak, to get a bit of an overview.
Remember, these categories can blend together. For instance, data mining might involve coding with R or Python, while both data mining and Business Intelligence could involve machine learning, and so on.
Toolbox
Let's take the following tools out of the messy toolbox and neatly lay them out on the floor, so to speak, to get a bit of an overview:
Data and coding
As you know by now: A bit of basic knowledge about programming can be a huge asset when it comes to analysing data, even for those who do not speak fluent code. Let’s build on what you’ve already learned about programming and show you can apply it for real life usage.
Data analysis is where the programming language R shines.
The syntax in R aimed at data processing, rather than for example object creation. R code resembles mathematical equations and expressions more than many other languages.
You do not need expensive enterprise programs and extensive cloud solutions to work with these things; you can get very far with structuring data for example in a CSV file (which you remember stands for Comma Separated Values) and add a few lines of code on top of it with a free code editor.
R is primarily used for statistical analysis and data mining, and is popular in both academic environments and in industry. If you’re familiar with maths and have used Excel formulas, then R will feel pretty intuitive to you. It’s designed to be user-friendly, especially for those already versed in stats and spreadsheet tools.
Work smarter with reusing and googling
Developers are typically far above average when it comes to googling skills—and in building solutions on top of others’ work. After all, there’s rarely any point in figuring things out on your own when others have done it before, or to start from scratch when there are free and easily accessible resources you can utilise.
That being said, just like in other areas, it’s essential to be critical of your sources when working with code. This also relates to having a basic understanding of what’s happening in the code and of best practices. For example, not every piece of code is going to be equally efficient or fit perfectly into your project.
As you learn more, you’ll get better at asking the right questions and spotting the best solution for your problem. Being able to ask those smart questions and judge your sources is often even more valuable than just being good at coding, especially when it comes to data and statistics.
Keeping that in mind, here are some handy tips to remember:
Example: Similarities between Excel and R
If you’re already well-versed in Excel, you’ll have no significant issues learning basic R, as we’re essentially doing much of the same thing. Let’s take a specific example:
We have a dataset where we have collected temperatures from a certain location, and we want to find the highest and lowest value.
We’ll keep it extremely simple and say that the dataset consists of only three values: 5, 10, and 15. Of course, you don’t need either Excel or R to figure out which number is the highest and lowest of these three. The point is that you could have entered millions of numbers in this dataset and used the same method. Let’s see how we can do this—first in Excel, and then in R:
Largest and smallest in Excel
All formulas in Excel start with an equal sign (=). This is similar in principle to using programming languages; we use certain characters or keywords in specific places for the machine to understand what we mean.
In this case, we will use the formulas MAX (largest value) and MIN (smallest value). Since the dataset is so small, we can write the values directly into the formula. We select the cell in which we want the result to be displayed, and write:
=MAX(5,10,15)
When we press Enter, we will now get the value 15 in the selected cell, since it is the largest number in the selection. Similarly, we can write the following to get the smallest value:
=MIN(5,10,15)
Note: As most guides and documentation on the Internet are in English, it might be a good idea to set English as the preferred language in the program, making it easier to follow these step by step.
Largest and smallest in R
Exactly the same thing we just did in Excel, we can also do with R. We use a code editor, like Jupyter Notebook or similar (see fact box below). Here we can simply write:
max(5,10,15)
or:
min(5,10,15)
Then we run the code, and get 15 or 5 as a result, respectively. And as you can see: it’s virtually identical!
Insight:
Examples of code-based tools related to data analysis
- Jupyter Notebook: A free, web-based tool based on open source code and widely used for data analysis. It supports various programming languages, including R and Python. The data behind the Large Synoptic Survey Telescope, the first telescope in the world to show how the universe changes in real time, was processed in Jupyter Notebook!
- Gephi: Another free program, also based on open source code. Gephi is often used to analyse and generate visual representations of data, especially of complex networks.
RapidMiner is also a program that allows us to analyse data, but this costs money. It is largely template-based, which means you don’t necessarily have to code yourself.
If we actually had years of temperature measurements to deal with here, and not just three values, we would do this in a slightly different way. More specifically: we wouldn’t manually write the values as part of the formula. Inside the parentheses, we can instead define which data we want to run the formula on. This applies to both Excel and R.
For example, if we put “Temperature” as the heading in the first column in Excel and input the values below, they would go into cells A2 to A4 (first column, row 2 to 4). Then we could write the formula =MAX(A2:A4), and get the same result as before. If we had thousands of different temperatures, and they continued all the way down to row 17933, the formula would be =MAX(A2:A17933).
Or even better: we can use dynamic ranges. If a new value is added every day, we don’t want to go into the formula and expand to 17994, then 17995, and so on. We keep all the data in one column and write =MAX(A:A) to include all cells with data in A. Brilliant!
Are you following?
If you had to find these top and bottom values manually in a dataset with millions of values, it could easily take you weeks, months, or even years, depending on the quantity. In both Excel and R, the same exercise only takes milliseconds for a modern computer.
But why use R if Excel can already do the calculations in the same way? Well, they both have their strengths and limitations. Therefore, it might be advantageous to be familiar with both.
R is generally a more powerful tool for statistical analysis and data mining than Excel, and also much more flexible in terms of the possibilities you have and the data you work with. You can visualise in exactly the same way in R as in Excel, and also achieve much more customisable and more powerful analysis. But the language requires more knowledge to master.
Excel simply makes it easier to analyse and visualise data without needing to know how to code, and is therefore more accessible to most people. However, both tools require a certain mathematical understanding, and it pays off in both cases to familiarise yourself with basic concepts and functions from statistics and build on this as you become more comfortable with the tools.
If you get good at R, you can make exciting, creative, and striking visual representations of your data.
Now we’ve covered a couple of tools. But the most important tool for a data scientist is statistics and statistical techniques. Therefore, let’s delve a little deeper into what it really is.