Global AI and Data Science

 View Only

Why We Need a Data Science Methodology and Data Life-Cycle Management?

By GLORIA EVA ZAGAL DOMINGUEZ posted Wed June 17, 2020 10:22 AM

  

I am @GLORIA EVA ZAGAL DOMINGUEZ​, working for IBM, I have been involved in Business Analysis projects, for insurance, banking and retail. My interest in data science arose during my undergrad studies, especially with neural networks and cryptography. These two subjects boosted my interest in the area. In my education time, I have been dedicated to research topics on of Data Science. For this reason, I joined the DDP initiative, which is a great opportunity to gain skills and learn from the experience of other IBMers.  I am currently focused on the IBM's data science path, mainly working with DB2-related technologies, MySQL, using SCRUM processes for the development of products and best practices.

In our previous entry, we established our understanding of data science and briefly introduced DDP. This time, I will discuss on data science methodologies and the data-life cycle.

Ask Yourself, do you Need to Follow a Methodology, if it is Only About Analyzing the Information?

In our daily life, we carry out data analysis for each action we want to perform. From simple things, such as planning our activities for the next day, to more complex decision as getting a new professional project and define an action plan. Sometimes, we act based on what we observe, or even adapt to fit in and engage with the execution of our activities, but ... is this leading us to meet our target?

When I started with this adventure, noticed that with so much information, sometimes there was no good analysis. This generated poor interpretation of the data, and from time to time, I came to be confused. The time passed and I could not understand what the failure was. I have been gradually understanding that when there is no planning and we improvise; the outcome won't be something that adds value.

This prompted me to ask myself a number of questions: what would be the right way to work with data? Is there a specific way to interpret that information? Should the results always be favorable? What if the result is not the one expected by the interested parts? After so many questions, I realized that interpreting data helps you find the right solutions to the problems and it is very important to be clear about what the goal is since the very beginning.

This is a great challenge that any data scientist has, to put into the shoes of other people, to understand the context, the problems from various perspectives, technological, business and social. In this hard work, the data scientist is branching out the decisions that have been taken. He/She begins to analyze his environment and from this point, is starting to work with a methodology.

But you must be wondering, why all this helps me, if I just want to be a Data Scientist?

To start the path in this world of data science, we will need to follow a methodology that  help us to handle and produce insights using the data in our hands.

What is the Standard Model that a Data Scientist Follows?

Let’s do, an analogy. As an explorer, you must consider that you need to eat, and in order to eat, the first thing needed is food. As a data scientist, to be able to go forward, you need data. Imagine that you are in the forest, alone and surrounded by trees of various species. You have no context of anything and the only thing you have is appetite and need to look for food. The first thing you can do is to explore the place and start harvesting different fruits. In a similar way, a data scientist begins to collect information from various sources, but these sources can be in different formats. Sources are broadly divided into three categories:

  • Structured data (e.g.: CVS, relational databases).
  • Unstructured data (e.g.: e-mails, text processor files, PDF files, images, videos, audios, etc).
  • Semi-structured data (e.g.: HTML, XML, or Json).

Then, what happens if you don't know how to handle those formats? Once you have your data set, before you panic, you need to consider that you will need to work in different ways and use different tools to be able to deal with this heterogeneous data set and produce useful information.

You should note that data sets contain values that may sometimes be out of order, or even have null values, erroneous/corrupt data, outliers, etc. So, this will prevent us from the ideal scenario being carried out. When I say ideal, I mean execute the analysis without any complications, have total data organized, well-structured and clean.

 Therefore, it will be necessary to perform the processing of the information, that is, the preparation of your data that will generate a smaller set which can improve the efficiency of your process. The objective is having a set of clean data that we can use for analysis without modifying the original data obtained. This process is important and requires knowledge on the analysis-context to make the appropriated decisions, since cleaning the information is not just cleaning for cleaning. Take the example of collecting the fruit, here you will have to remove the impurities, the shells, etc.

After we have performed the cleaning, we are ready to start sorting the data. Here we will store it in a consistent way to match the semantics. With semantics, I mean the sense of the data that corresponds to the situation of the information or expected results. Because of this, it is very important that you have very clear the outcome of your analysis since the beginning of the process. It is simple, if you know what you want, it will be easier for you to prepare the data in a way you can produce the expected outcome.

Once you have already stored your fruit and sorted by color, species or size, we are going to transform it. In our analogy this means produce meaningful information from the data through the calculation of indicators, trends, alerts, models and any other type of analytic. This is the analysis itself, taking the whole, breaking it into parts and producing something useful.

Then, the outcome of our analysis needs to be visualized. Having in mind the example of the fruit, think the following: What would you do if you see something weird with the meal you have prepared with the fruit? Would you eat it?

The same goes for the analytics visualization. Visualization is the “cherry on the cake”, “the tip of the iceberg”, were we are able to appreciate the result of our analysis and make decisions. This is why visualization becomes very important. A bad or incorrect presentation of our results will derive in misunderstanding and bad decision-making.  

A little bit of patience, we're almost done!

Finally, we carry out the communication of the results obtained. That is, we will have to tell our history of what we had to go through, the challenges that we face, the stumble, the successes and above all, the teachings that help us find the right solutions.

The following diagram illustrates in general terms the broad path that a data scientist walks to produce meaningful information:

 

image

 

If you are curious to know more about the subject, there are implementations of this general methodology for data exploration and data mining that can be further addressed. These implementations have particular variations and are supported by toolsets, but in general, they follow the stages described in this entry. Some examples are:

Also if you are interested, here you can find the course for data science methodology suggested by IBM for the data scientist career path:

 

Data Cycle

As you can see, during all the stages of the analysis methodology, data falls into different states, from “raw data” to “useful information”. Taking into account the rapid growth and complexity of data, methodologies that can be adapted to particular scenarios are required. It is worth to mention that there is not a chef's recipe that should be applied in all projects, so that the data scientists must determine the type of flow that best fit for specific cases of data analysis. This will mostly be based on the data characteristics and the nature of the expected results.

In the following schema you can view the difference between traditional Data life-cycle and Big Data life-cycle which is mostly defined by data characteristics and analysis requirements. In the case of Big Data due to characteristics such as volume, variety, velocity and veracity the data stages are extremely dynamic and can change very quickly in comparison to traditional data analysis: 

image
 

 

How does DDP Carries out the Methodology and the Data Cycle?

As mentioned earlier, the initiative Data Driven Performance  (DDP), uses large data sets to create information following the analytics methodology described above. We are working with large amounts of data in various formats, and we made various types of integrations to collect, store, analyze and visualize the data produced during the performance test phases of IBM products. Thus, help with decision-making and improving the development and testing processes.

 

Keep it up!!.

Maybe you will be our next Data Scientist!!


#GlobalAIandDataScience
#GlobalDataScience
0 comments
18 views

Permalink