Editor’s note: This is the second in a two-part series on big data analytics. Part 1, "IBM Doubles Down on Big Data Analytics," was published in March. This series is based on content initially published in SHARE's President's Corner blog.
In 2012, the amount of global digital data generated is expected to reach about 2.7 zettabytes, a 48 percent increase from 2011. McKinsey Global Institute
estimates that by 2009, nearly all sectors in the U.S. economy had an average of at least 200 terabytes of stored data per company with more than 1,000 employees. Many sectors had more than one petabyte of stored data per company. In response, the number of servers (virtual and physical) worldwide will grow 10-fold in the next decade.
This influx of data—both floating around the digital universe and stored by IT organizations—is dubbed big data. The challenge for enterprises can be boiled down to two words: Speed kills.
As David Corrigan, director of strategy for IBM’s InfoSphere portfolio, told ITBusinessEdge
in a 2011 interview, “velocity” is one factor that defines big data: “By velocity, we’re talking about the pace at which the information is ingested, so streaming analytics is an example—the pace of huge volumes. You could call it batch, but these really are bursts of information.”
Another v-word defining big data, according to Corrigan, is “variety,” as he explained in the ITBusinessEdge article: “Big data isn’t just about volume. It equally has something to do with the variety of data. In other words, when you’re not just dealing with structured information or semi-structured information, but you get into text and content, video, audio and the need to analyze data from all of those different variety of sources to come up with an answer or to solve a particular use case.”
Data accumulates so quickly it’s difficult for IT organizations to not only maintain enough storage capacity but also keep pace with new architectures, technologies and methodologies springing forth to generate big data analytics. In sum, it’s increasingly difficult to determine appropriate strategies for analyzing—and gleaning value from—big data sets.
Piecing it Together
But there’s hope. Storage economics have changed with the times. The cost of storage has dropped substantially as processing power has sped up, and technologies, such as compression and deduplication, have shrunk capacity requirements. Combined, these developments have helped companies keep even big data sets manageable.
At the same time, technology for extracting value from big data has evolved. According to a recent report from The Data Warehousing Institute
(TDWI), the emerging category of big data analytics has developed to encompass a collection of techniques and tools enterprises can use to handle enormous data volumes. This tool set includes predictive analytics, data mining, statistics, artificial intelligence and natural language processing.
The IBM Entity Analytics group, for example, develops the InfoSphere Identity Insight Solutions that enable streaming analytics utilizing sets with potentially billions of rows of data—in real time with sub-millisecond decisions. In part, the solutions accomplish this feat by counting “entities” and determining those that are the same. IBM Distinguished Engineer Jeff Jonas, chief scientist of the IBM Entity Analytics group, explains:
“Imagine a giant pile of puzzle pieces—giant—with different colors, sizes, shapes … and you don’t know if there are duplicates, if there are pieces missing or if it’s one puzzle or fifty puzzles,” he says. “We call that big data. What we do in Entity Analytics is we take each puzzle piece and see how it relates to each other. When you do that, it ends up getting this much richer understanding and it allows you to make higher quality decisions. The advantage of big data is when you blend together the blue, green, yellow, magenta puzzle pieces. Then, the quality of your understanding is so much better and your decisions start to get really smart.”
And there’s Apache Hadoop
—an open-source programming framework that supports the processing of massive data sets in distributed computing environments. While the platform hasn’t yet achieved wide-scale adoption, the TDWI report found that 24 percent of IT organizations surveyed are using Hadoop.
Why the buzz about Hadoop? Unstructured data. Studies estimate as much as 90 percent of data being generated in the digital universe is unstructured. It comes from diverse sources that continue to multiply—sensors, devices, Web applications, images, voice, video surveillance and social media. Hadoop breaks down not just volumes of data for query, but also a wide variety of data types. Companies such as IBM and Cloudera are developing commercial tools and services that sit on top of Hadoop.
“It’s a new idea,” said Cloudera CEO Mike Olson
in a recent YouTube interview conducted by tech blogger Robert Scoble. “You can ask a question in a reasonable time that touches every single byte and terabyte—not just touches it, but manhandles it…. We’ve never been able to solve problems like that [by thinking in scale], so we don’t even think of questions like that. Thinking in that way is a new skill.”
This means the key to unlocking the value of big data isn’t just having technology to handle the volume. Analytical thinking must evolve regarding big data in order to design big analytics from it.
“We have to put it all in perspective,” says Neil Raden, vice president and principal analyst at Constellation Research Group
, who focuses on analytics and business intelligence (BI). “From 2000 to 2011 the amount of data IT organizations are handling has shown exponential growth. What’s really happened in response is we’ve worked with a set of technologies for exponentially expanding data [capabilities] and we’ve taken them pretty far.”
“The real question is … What are you going to do with it?”
Two business trends are changing the way big analytics are derived from big data—and both have more to do with dollars than zettabytes. The first trend is the global economic recession, which, in many cases, has forced companies to re-evaluate the way they do business and reengineer processes based on the bottom line. That means understanding what customers want—or are saying they want—and being able to respond to those needs on the fly. The second trend is risk mitigation. Big data analysis helps organizations cope with each.
A number of analytical approaches—and technologies—help companies react quickly to business changes, from streaming and predictive analytics to data visualization and sentiment analysis. And their uses span vertical markets.
Streaming analytics essentially is the ability to analyze data in real time. Examples include location information or sensor data, where companies need to react fast to changing scenarios. Last year, Yahoo! open-sourced its S4 platform for developing real-time MapReduce applications. (Developed initially by Google, MapReduce pushes code down to data in Hadoop for analysis.)
Streaming analytics also has huge potential in healthcare. The University of Ontario Institute of Technology, for example, has been working with IBM to detect changes in streams of real-time data to measure, respiration, heart rate and blood pressure. The analytics can be applied to models to compare the differences and similarities of diverse populations of premature babies. The results can be used to tune rules that alert specialists in neonatal intensive care units when symptoms occur in real time.
Predictive analytics are not a new concept and sometimes, as in the case with IBM’s InfoSphere Identity Solutions, overlap with streaming analytics. In the mid-1990s, IBM developed a statistics-based program to help the NBA be more predictive about how players play a basketball game. What’s changed with big data analytics is the ability to explore different data types to look at trends, patterns and deviations to predict the probability of outcomes. SAS Institute and IBM are two vendors developing predictive capabilities for big data analytics.
In October 2011, IBM announced new predictive analytics software, SPSS Statistics 20.0
, with a mapping feature that can be used across industries for marketing campaigns, retail store allocation, crime prevention and academic assessment. At the same time, the company announced its IBM Content and Predictive Analytics for Healthcare capability that uses natural language processing to help doctors and healthcare professionals advance diagnosis and treatment by understanding the relationships buried in large volumes of clinical and operational data.
Visualizing Big Data
In its “Big Data Analytics” report, TDWI found advanced data visualization (ADV) to have the strongest potential among options for big data analytics. While 20 percent of respondents said they are currently using ADV tools, 47 percent said they will be using them in three years. More importantly, 58 percent were committed to implementing ADV at some point in the future. A number of smaller companies have sprung up—Yellowfin, Tableau, Tibco Spotfire—with ADV products that are simple and visually powerful.
Cornell University began using Tableau’s visually based analytics software nearly five years ago as a reporting tool that, initially, would allow college deans to keep better track of key performance indicators. Today more than 600 employees use it to do all manner of analysis, from dissecting the student applicant pool, evaluating risk and analyzing university expenditures to visualizing faculty salary statistics, keeping track of which students are in what classes and managing contributor relations.
Renee Boucher Ferguson has more than 10 years experience covering enterprise tech—including ERP, data analytics, cloud computing, infrastructure, storage and mobile platforms—as both a technology journalist and senior research analyst.