My name is Ismael Solis, I am a data scientist and performance analyst for the IBM Spectrum Scale Project, as well as an application architect for the Data-Driven Performance project. During the last 8 years, I have collected experience on how to start defining the foundations and architecture of Big Data and Analytics (BD&A) applications.
In previous blog entries, tools, and methodologies that can be used during the development of BD&A projects have been presented. In this entry, I discuss how to put together all these methodologies and tools to define your first BD&A architecture. That is to determine the necessary components to provide the expected functionality and expected results.
Figure 1 illustrates the most typical components of a BD&A solution. It does not mean that we need all of them in every solution. We can integrate some of them to solve particular problems. However, I show all the options to have a broad idea about the components we will need when architecting a BD&A application.
Figure 1. Common Elements in a BD&A Solution Architecture.
In general terms in any BD&A architecture we need to consider the following:
- Data Sources. Data sources are all the entities that are generating data such as third-party systems, machinery, sensors, social networks, among others.
- ETL modules. ETL will allow us to collect the data from the different sources and in some way transform it into the appropriate formats to be used by the system to produce insightful information.
- Processing engines. Processing engines such as Spark or Hadoop are required during different stages of data processing. First during the ETLing for cleaning, blending, and pre-processing the data. But also, during the actual data processing and analysis processing engines are needed.
- Data Storage. Data needs to be stored in someplace, and depending on the nature of such data and the objective of the analysis, the storage needs to have certain characteristics, not only at the hardware level but also at Data Base Management System (DBMS) level. In most cases, a hybrid solution including SQL and NoSQL is needed.
- Data Analytics platform. To produce insightful data some frameworks or tools to analyze the data will be needed in our solution. This includes data browsers, statistical analysis tools, and machine learning libraries.
- Data Visualization framework. One of the most important stages is to present the analytics to the end-user. If data is not correctly presented, then it doesn’t matter if you are using the best algorithm or the best data processing technology, because the end-user is not going to understand it properly. Therefore, having a strong visualization framework is always very important.
Having in mind the steps of the general methodology for Big Data Analytics and before we start defining any solution architecture, we need two things:
- The first one is to clearly understand the problem, what we want to solve. In other words, we need to plainly identify the analytics, statistics, or models we want to derive from the data. If we do not know what we are looking for, it will be like looking for a needle in a haystack without knowing how a needle looks like.
With these two points, we have the beginning (understanding of raw data) and the ending (analytics or results to be delivered to our users) of the road. Then, it will be easier for us to determine the intermediate points that we need to integrate into our solution following simple questions as illustrated in Figure 2.
- The second one is to clearly understand the data that we have in our hands. If we do that, we will be able to decide if the data is enough to estimate the analytics, statistics, or models that we have identified. Moreover, we will be able to find the appropriate ways, mechanisms, or technology to transform such data from its original state to what we are looking for. In simple words, if we do not understand the nature and characteristics of the data, we won’t be able to transform it into meaningful analytics.
Figure 2. From raw data to insightful analytics.
Considering the start and the end of this road, we need to start thinking on:
- How to collect the data given the characteristics of the data itself and the data sources?
- How to process the data given its size and the velocity of the required analytics?
- How to store the data based on the type of analysis to be conducted and the structure of the data?
- What algorithms do I need to clean, blend, and process the data as well as to build the expected data models?
- How to visualize the final results to make them easier to represent for the end-users based on interactive interfaces such as dashboards, reports, and graphics?
To answer these questions there are some major considerations:
First what type of data do we have? Is it structured, semi-structured, or weakly structured? That will help to identify the characteristics of the technology to use in our application. For example, the type of database to use to store data. SQL approaches and some NoSQL such as Cassandra are focused on structured or semi-structured data. Others, such as Mongo and HBase can be employed to host weakly structured data including unformatted text.
Second, the size of the data and how fast it grows. Are we dealing with big data at rest or big data at motion? Understanding this will help to determine the appropriate mechanisms to collect the data, the size of storage, and the framework to process it, as well as the algorithms to produce the results. For example, having in mind that we have big data in motion growing at a fast rate, a clustered storage system is desirable instead of centralized monolithic approaches to reduce the processing time, i.e. NoSQL along with clustered processing engines such as Spark or Hadoop.
Third, the type of analysis we are conducting. Are we doing asynchronous or synchronous analytics? With asynchronous analytics, we have the data and the results are not needed at the immediate moment. On the other hand, with synchronous analytics, we need to produce the results near real-time. This particular requirement will define the processing engine, i.e. here we cannot use batch processing. Also, some tools such as Tableau or IBM Cognos Analytics are designed to produce more interactive dashboards to observe the patterns and trends for synchronous analysis.
In summary, the two major considerations we need to have in mind at the moment of creating a BD&A architecture is the analytics we want to obtain and the data we have available to produce such analytics. Then, think about processing engines, storage, algorithms, and visualization frameworks. Review the different solutions that are in the market for each data transformation stage considering the characteristics of your data: volume, velocity, and structure. Also, consider the type of analysis you need to conduct and select the appropriate tools.
In further blog entries, I will review the use of some BD&A tools in particular use cases as well as the considerations you need to make in order to select the best technology for transforming your data.
Thanks for your colaboration in this blog
@SILVANA DE GYVES AVILA
@AYRTON DIDHIER MONDRAGON MEJIA
@GLORIA EVA ZAGAL DOMINGUEZ
@GONZALO SEBASTIAN AYALA MERCAPIDEZ