What do we call a data science tool?
It has become an umbrella term for quite a variety of sources, like data science software, languages, databases, and platforms. You may think of these tools as something that data scientists use to apply mathematical concepts and their statistical derivatives to the big amounts of information.
Each data science tool serves a goal to make sense out of big data. It may be exploring some tendencies, prediction, and creating a self-learning algorithm. This order matches the growth of the complexity of the tasks the toolkit must solve.
Alongside with the complexity, your end-user may also have an influence over your choice. If you only need to handle your results over to another data scientist, you may stick to programming languages. If you need to report your results to a project manager or other person without a natural science background, you may prefer to include visualization tools in your arsenal.
This overview was written with the needs of different people in mind: whether you are a data science specialist, a business owner, or even an academician.
The following tools have a proven success record and promise to exist for a while. So, take your time to get to know them!
For beginners and lesser big data
Imagine that you are only planning to start with big data, or you already know that the scope will stay rather moderate. In such cases, it may worth conducting data analysis in Microsoft Excel or Access.
A few Pros
Excel and Access will save you some money since they both are natively included in Microsoft Office.
They are also one of the most undervalued data science tools. By now, Excel allows a lot of basic manipulations with the data and covers all essential data formats, such as strings, numbers, and dates. Converting data from one format to another, concatenating and splitting strings, looking for string patterns, doing simple calculations, applying conditioning clauses, descriptive statistics, and unsophisticated time series — all these are available in Excel.
And many Contras
Although, you would soon notice that the Excel way of working with data — the formulas — can get really messy. You do not compile them statement by statement, but rather embed one statement in the other, thus, making a formula difficult to read.
Depending on the hardware, Excel can quickly come to its memory limits. The same operations can be performed in Access that can digest a bit more data than Excel can.
Apart from it, these two Microsoft products offer a little space for maneuver when it comes to making changes in the structure of the data. You can summarize your data using pivot tables, but it gets tricky if you want to do the opposite thing and increase the granularity of the data.
Last but not least, Excel and Access are more suitable for a one-time analysis. Yes, Excel has macros, but they are quite vulnerable. To perform full automation in Excel, you would need to know such a rare programming language as Visual Basic.
For intermediate level
As soon as you need automation, you have to look around for the actual data science software and set up a data storage. Typically, the last one would be an SQL-type database. This leads us to the next must-have in your data science toolkit: the Structured Query Language.
The SQL family
You will use SQL on different stages of your data science career. A database administrator needs to know the differences between MySQL, NoSQL, SQLite, and Postgresql fluently. A user that only writes queries would not get overwhelmed by the variety. A switch between the members of the SQL family is quite easy.
Most of the data science software tools have embedded connectors for different types of SQL databases. SQL remains one of the easiest querying languages to learn, which has not much changed over time.
An SQL database can process queries very fast. For this reason, some versions are barely affordable for a small company or a scientific project without a big budget. And although you feel very comfortable with doing a simple exploration of the data, any complex manipulations or data pipelines are not possible.
At the intermediate level, we rather talk about data analysis than data science. You explore data but do not change the underlying structure significantly.
At this stage, the focus is also on visualizing results. For instance, if you need to present your first findings to the broader audience.
If you want to stay in the Microsoft universe, you may consider using PowerBI. It will involve some additional license costs, but it opens you a door to automation and advanced data visualization. PowerBI allows using a drag-and-drop interface even for the operations that you would “code” in Excel. This replacement of the huge formulas was a great update Microsoft has managed to implement.
Unfortunately, as soon as you go beyond a simple replacement of Excel and do some refined analysis, you will be confronted with the native PowerBI languages: DAX and M. They have two basic disadvantages: they are used only in this single tool and they suffer from this unnecessary complexity that all Microsoft products share.
In this regard, Tableau is more maneuverable, easy to learn, and the charts convince with elegance and unprecedented beauty.
Please, do not get it wrong: Tableau can do pretty advanced things too! Both Tableau and PowerBI has an exhausting list of database connectors and allow not only data visualization but also data preparation and building automated pipelines.
Regarding the last option, PowerBI cannot run scheduled pipelines out-of-the-box: you’ll need to connect it to an Azure gateway.
Which tool to use, may highly depend on your stakeholders. If they prefer dry data fitted in tables and simple charts, you use PowerBI. If you want to go to a conference and showcase some mind-blowing data visualization, then you use Tableau.
Tableau is a better option if your stakeholder is not your colleague, but a client. You can customize your Tableau report so that it looks pretty much like an authentically programmed website so that the client has a feeling of exclusivity.
I personally found Tableau much more intuitive to work with.
For advanced ambitions
As your data grows, your focus moves even more to the process automation. That is what previous tools lack or do not offer in a smoothly working form.
Drag-and-drop solutions: process builder
RapidMiner d is a drag-and-drop tool for data mining; it has a long list of operators that help you to alter your data. Even data analyst in an advanced career stage won’t be disappointed. You can also buy a server license to schedule your processes.
I cannot name any dark sides of RapidMiner! It was nice to use and it offered an interface to show the database tables, all columns in one window. It saved a few seconds of writing an SQL query.
When the processes become complicated, the visual representation of data pipelines — which looks like a mind map — got quite messy. At this stage, I wanted to start coding. And I could! You can insert an R or Python script in one of the boxes in the pipeline to replace a clumsy part.
Drag-and-drop tools for advanced statistics
If you need to go beyond descriptive statistics and analyze correlations, variations, co-variations, and other kinds of dependencies in the data, IBM SPPS is at your disposal.
It is not quite user-friendly and it requires very advanced knowledge of statistics. I mean, it is not enough to know what normal distribution is, but to be able to write mathematical equations explaining normal distribution. You won’t code in this tool, but each operation you perform has so many customization opportunities that you need to be able to tell them from one another.
IBM SPPS is therefore more popular in research institutions. It helps you to build visualizations, but they are far from being fancy. I have not worked with it for ages, but they used to be … only black-and-white back in 2014!
At the beginning of 2010s, SPSS could be run only locally, which meant you needed a powerful computer. It saved the data in its own format, which made you a slave of the PC you used it on.
But clinical testing, advanced social research — they all still get along well with IBM SPSS and they have their good reasons for continuous cooperation.
Data science programming languages
To keep climbing the data scientist career ladder, you’ll come to learning programming sooner or later.
In addition to SQL, R and Python are essential languages for data science. R is made for customizing advanced statistic modeling and I got to know it as an “extension” of IBM SPSS.
I would not say that R is simple to learn, but because there are so many online programming courses around there, I cannot say it is impossible or particularly challenging.
You can create an interactive reporting with R Shiny. I nevertheless find data visualization options of R rather annoying, since the code often looked quite chaotic and the parameters do not really have a solid system.
Also, R tends to be unstable and depends on the tool you run it into. For this reason and because I needed more flexibility as well, I switched to Python.
I don’t want to repeat myself since I have written about Python before. A couple of words, why, and how Python opens new horizons for you. You can program apps and data pipelines and marry them in Python. You can embed Python-based dashboards on your website or online publication.
Exactly as with R, using a Python tool, like Jupyter, will cost you nothing as long as you do not need additional data storage or a cloud computing kernel. You can integrate a Python script into AWS and many others.
For fluent and AI and real-time (big) data
I have not managed to go beyond data science and business intelligence. In one of my jobs, I have supported an AI team that was programming in R. But later from the numerous job announcements, I have learned that Python is more common in this subfield.
Summary: what’s hot and what’s not
For your first approach to the data and if the volume is rather modest:
- Microsoft Excel
- Microsoft Access
We cannot really talk about any advantages, other than you basically need not learn anything new. We get familiar with Excel at school and Access is no rocket science, too. The limitations are:
- not suitable for big amounts of data
- no changes in the data structure
- little to no automation is possible
For data analysis and business decision-making:
These tools enable advanced data analysis; the last two do a great job of building compelling data stories and reports.
The limitations are:
- Analytical functions are pre-defined and cannot be customized
- Expenses are high
- You need some training, which is not always for free
The last two reasons lead to the unpopularity of at least PowerBI in academia.
For advanced and fluent level:
- IBM SPSS
R and Python are both open-source and must-haves for a statistician working in academia. They are highly flexible, the documentation is reached and up-to-date, especially for Python. With the last one, you enter the whole Python universe that offers much more than data science.
The limitations are:
- RapidMiner and IBM SPSS are very expensive
- R and Python programmers are not cheap — a disadvantage you as a business owner may experience
The knowledge of these tools is beneficial, since they are wide-spread and, looking for a new job, you won’t be limited by any exotic applications or languages only few companies need.