What is the Purpose of this Blog?
This blog aims to describe how a group of IBMers are discovering together the world of data science, by participating in BDA projects that help another IBM teams to improve their daily activities. Most of the teammates involved in this initiative, started without any data science experience, and that includes, no experience in data science technologies, processes or data analysis. As the work items of the initiative moved forward, we have been able to learn and apply different concepts of data science in a real project, which has been very helpful for us in order to expand our knowledge and skills.
In the further blog entrances, we will describe our experience and share some pieces of knowledge we have been acquiring in our data science path. Then you will be able to understand how we perceive data science, what we think a data scientist is, and how we are applying all the concepts in a real project. We hope all this know-how can be useful to others that are also interested in BDA and data science topics.
I am Gonzalo Ayala and work as a Software Developer in IBM, mostly with technologies as DB2 and Java. Time ago, I was curious about discovering the Data Science world, and for that reason I joined this initiative, which has been a huge opportunity to improve my skills.
One of the most important things I consider when starting the path of data science is to clearly understand what data science is and what are the activities and responsibilities of a data scientist are to figure out what skills do we need to acquirer.
What is Data Science?
Data Science is a term that is becoming very important these days. If you look for a definition, you will notice that there are several of different opinions. Ones refer to Data Science as the study of the data; others link Data Science directly with machine learning or Big Data. In my opinion, Data Science is the study of the data, with the purpose of getting something valuable from its analysis, independently of the amount or the method used for processing.
We are applying Data Science every day in our lives, no matter if we are engineers, lawyers, housewives, etc., we are always analyzing data and taking decisions based on that analysis. For example, when we go to the supermarket and we want to buy some product, we analyze the price, content, quality of the brand, and based on those characteristics we take a final decision of what product to buy.
In the strict context of computing science, I see data science as the application of the scientific method to extract value from diverse datasets using statistical and computational techniques and tools.
What is a Data Scientist?
A Data Scientist is a person that studies the data and gets something valuable from it. Data by itself, doesn’t have any value if nobody analyze it and get some benefit from it. A benefit could be economic, improve a process, knowledge that can help us to take better decision in some aspect, etc. A Data Scientist should be curious about the data, he must be able to discover patterns, behavior, tendencies within the data that nobody else sees.
Depending on the size, a Data Science project can include several of different areas, and for that reason, a Data Scientist can be more focused in some part of the process and have an important contribution in that area. For example, we can build a Data Science team with people good in statistics, other people good in programming languages, others with good skills in visualization reports, data bases, etc.
With that said, is important to remark that a medium-high data science project consists in a group of multi-disciplinary experts including data scientist working all together in different parts of the project.
Data Science in Computing Engineering.
Today, the amount of data that is being generated by the human being is reaching extraordinary limits. No only humans are generating data, electronic devices connected to internet are generating and storing information in a huge amount of space. For that reason, engineers need to develop new technologies in order to be adapted to the new era of data. This includes the design of new types of databases, such as NoSQL ones, that breaks the paradigm of the relational databases, the most known so far.
Data Science involves some important areas, such as mathematics, statistics and machine learning. As said before, an optimal medium-high data science project, requires people with knowledge in all those areas. A good understanding in mathematics and statistic will help to understand our data and be the basis of our analysis. A knowledge in machine learning techniques can help us to apply advanced algorithms that will improve the quality and precision of our solution. Another important area of a Data Science project is the Visualization, and this includes how our results will be displayed to the final user. This must be clear, easy to understand and useful for who is using it.
What is Data Driven Performance?
The Data Driven Performance (DDP) initiative is born from the need to improve the quality of IBM products by using descriptive, predictive and cognitive analytics during testing processes. This will be developed by volunteer IBMer’s, from different areas of the Guadalajara campus. So far, performance testing in the GPFS team is done manually, and every tester use their own methods to create the performance testing reports. Apart of that, the information is collected in wikis, making the historical information hard to get, and a lot of potential important data is being wasted and even lost. The objective of DDP is to collect all important data, centralize it in a central repository and take advantage of it to show important results that will help the team to take better decisions in the future. Also, it wants to help in the GPFS configuration tasks, storing historical configuration information and helping with new set up suggestions for particular configurations.
Why Data Science in DDP?
As mentioned in the section above, DDP provides enough data to think in a Data Science project there, that will help the performance team on their daily work, but also will help the project manager to take better decisions and have a better idea of the performance status anytime. Besides of that, is a good opportunity for the people working on the project, to start being involved closely in the Data Science world, applying and learning about new technologies, statistics, machine learning and more.
What Technologies do we have Applied in the Project?
Some of the technologies we use in DDP are MongoDB, Cassandra, Python, Django, Spark, Scala, Py4J, and the necessary connectors between Python and the databases. The relationships defined among these components are shown in the following figure.
What is Next?
This is just the beginning of our trip, in future entries of this blog, other team members will describe more about this initiative, such us architecture, technologies, methodologies, data science and big data in general, and more, so stay tuned for futures updates.
Ismael Solis Moreno
Silvana De Gyves Avila
Ayrton Didhier Mondragon Mejia
Gloria Eva Zagal Dominguez