As we know, working on a Data Science project involves many activities, just to mention some, we can start thinking about gathering and cleaning the data, passing by analyzing and processing, and concluding with the visualization and sharing of the findings. In all those, the person in charge of the task (who can be a Data Scientist, Data Engineer, etc.) can take advantage of the different tools available on the market (some shown in Figure 1). However, choosing a tool is not an easy task. There are so many things to consider, for example, available time, if the analysis will be online or offline, amount of data, interests of the data consumers, if you know how to program or not, etc. Also, another important aspect to keep in mind is if the tool is for free use or if you need to pay for the licensing, which can be a key factor when deciding.
Figure 1. Big Data & Analytics Landscape 2019 .
If we know our data and there is a clear goal at the beginning of the project that helps us understand where we are heading and what needs to be done, it will be easier to choose what tools to work with.
Data Science Tools
Let's define what a Data Science Tool is. Tool used to work with complex data in order to clean it, process it and analyze it, with the aim of generating useful information.
In a Data Science project, the first thing we need to think about is our data, how are we going to obtain it? where is it from? is it going to be structured or unstructured? etc. There are different techniques that can be used to collect it, we can configure APIs, use sensors, apply web scrapping, query a database, to mention some. I will not explore tools in this path, since data acquisition is highly dependent on the data's origin and in many cases, the tools used for it are developed in-house. However, it is important to mention that based on the analysis we want to perform, it is key to ensure that the data is up to date and there is enough space in our environment to store it.
Once we have our raw data, the next steps are to clean, organize and store. If we do not clean our data, it is very likely that we will obtain unexpected or unrealistic results. In order to perform these tasks, we can use an ETL tool. ETL (Extract-Transform-Load) is a process that allows us to combine diverse data sources, organize their content and store it into a centralized repository. Diverse vendors offer ETL tools, such as IBM, InfoSphere DataStage ($$); Amazon, AWS Data Pipeline ($$); and Microsoft, Azure Data Factory ($$); which provide friendly visual interfaces that let users with no programming experience define the information flow. In most cases, they can be integrated with their own cloud platforms and other services. However, sometimes these tools offer features that can be more than what is needed to solve a problem. On the open source side, we find tools such as Apache Camel, Apache Nifi, Apache Airflow, Logstash, among others. Open source tools usually include less features and can be slightly more complex to use. Sometimes, they allow us to adjust their code so we can personalize them based on our application needs. However, if what is available in the market doesn’t suit your requirements, there is always the option of developing your own ETL using your favorite programming language. The last step of the ETL process is loading the data, which brings the following questions: where are we going to store it? how are we going to store it? Nowadays, there are many flavours of databases we can use, and there is also the option of choosing the location, in-house or in the cloud. Some popular databases are listed in the following table:
Table 1. Comparison between databases .
It is important to keep in mind that a database is not the only option available for storing data. There are platforms such as Hadoop, which allow us to store and process large data sets. After all the data has been stored in the repository, what tools can we use to do some analysis and processing? If you are not so good on the programming side, some tools that can be of your interest are Rapid Miner ($$), Data Robot ($$), Trifacta ($$), Excel ($$), SAS ($$) and IBM Watson Studio ($$). These tools have been designed with many features to cover most application scenarios. However, if your scenario is not within the available options, you may not be able to complete the analysis with them. On the other hand, if you have good programming skills, there are two tools that lead the market in terms of their use within Data Science projects, Python and R (RStudio). The recommendation is to use R for offline analysis and use python for online analysis and algorithm implementations. Tools like IBM SPSS ($$), Matlab ($$), BigML ($$), Tensor flow and Weka can be of good use when we are interested in using Machine Learning. When processing time is key, lets have a look into Spark, BigQuery and Hadoop.
To finish this entry, I will talk about visualization and information sharing. Completing all the previous steps in the correct way does not warranty the success of a project if we are not displaying the findings in a clear and adequate manner. When presenting information, it is key to understand the audience, their interests and their background, in order to choose the right approach for visualizing and sharing the results. Some popular tools used in Data Science projects for visualization are tableau ($$), ggplot2 (works with R), Matplotlib (works with Python).
In this post, I have mentioned some of the most popular Data Science tools related to the different phases of a DS project. At the end of the day, the best tool for you and your project will depend on some many variables, but it is always good to have a start point. If you are not familiar with either R or Python, it would be good to have them in your learning list, along with different data analysis techniques. Of course, this is just the tip of the iceberg and there are many other relevant skills to acquire.
Tools will help us to solve the problem (store data, apply algorithms, etc.), but it is important that the people involved in the project also have a good understanding of data bases management, statistics, data modelling, etc. in order to obtain the expected results.
About the Author
I am Silvana, a software performance analyst in the Spectrum Scale project.
@ISMAEL SOLIS MORENO
@AYRTON DIDHIER MONDRAGON MEJIA
@GLORIA EVA ZAGAL DOMINGUEZ
@GONZALO SEBASTIAN AYALA MERCAPIDEZ