By 2020, IDC estimates there will be 35 zetabytes (ZBs) of data, a staggering increase from the current estimated 1.2 ZBs. Let’s put that in perspective: That adds up to roughly 28,200 U.S. Library of Congress print collections, which are estimated at 10 TBs each.
Big data is a worldwide phenomenon that touches everyone who uses a cell phone, searches the Web, trades on Wall Street, formats a report, streams video, or types on a computer.
“Every click of ours is generating data,” says Anjul Bhambrhi, IBM’s vice president of big data products. “When the Internet boom started, I don’t think at that time people anticipated that so much unstructured data was going to be created.”
We all contribute to the explosion of big data, loosely defined as datasets that grow beyond the ability of run-of-the-mill database tools to handle. “In a digitized world, consumers going about their day—communicating, browsing, buying, sharing, searching—create their own enormous trails of data,” states a March 2011 McKinsey Global Institute report, “Big Data: The Next Frontier for Innovation, Competition and Productivity.”
The number of servers deployed around the world grew six-fold in the past decade to 32.6 million worldwide, according to Dr. Gururaj Rao, IBM Fellow, Systems and Technology Group. Storage grew 69 percent in the same time period, Rao said during a SHARE conference in August 2011. Meanwhile, he noted, the number of Internet-connected devices is growing at a 42-percent yearly clip.
Big Deal, Indeed
Big data poses a seemingly insurmountable challenge for enterprises in a gamut of industries—retail, finance, healthcare, manufacturing, communications and government—to make sense of the growing volumes of information they produce. According to IDC, most of the data—80 percent or so—is unstructured, which complicates the ability to store, mine, analyze and act upon it.
But let’s say you could do all that efficiently, what would be the benefit? What is the big deal about big data? Benefits range from the mundane to seemingly pie-in-the-sky scenarios:
• Better-targeting consumer products
• Improving traffic flow and urban planning
• Catching and fixing potentially dangerous automobile flaws
• Preventing credit card fraud
• Predicting infection in at-risk newborns
• Saving lives
Enterprises would develop better, safer products they can more precisely target to customers. The healthcare industry could use big data to boost efficiency and quality while reducing costs by 8 percent, according to the aforementioned McKinsey report. Retailers could boost operating margins by more than 60 percent.
Robert Rosen, a former SHARE president currently working in the government, says big data analysis led to a recent Volkswagen recall of nearly 170,000 diesel vehicles over potentially faulty fuel lines. The problem was identified by analyzing vehicle-sensor and repair data. “There’s an example of extracting information from lots of unstructured data,” he says.
Where to Store It All
The benefits of big data analysis are seemingly endless, but it poses some big challenges. Organizations have to figure out where to store it all and implement recovery policies and technology. Industries such as healthcare, law and finance must archive certain types of digital information and have it accessible for recovery in case of legal disputes, audits and data loss.
With data is growing at a projected 44 percent clip, it would take millions of storage systems to handle it all. According to the McKinsey report, the United States in 2010 had 16 exabytes (EBs) of storage capacity, while Europe had 11. Combined, they could store only a fraction of the current 1.2 ZBs of data, since 1 ZB equals 1,024 EBs.
“We are not going to have enough disks to store all this data,” says Rosen.
Vendors such as IBM, Samsung and GE Global are working hard on developing new technology. Be it laser-based, crystal disks, atomic holographic nanotechnology or something we don’t know about yet, the future of storage technology is critical to our ability to collect, organize and analyze big data. You can increase disk density by only so much, and once we reach the limit, says Rosen, “we’ll need something new.”
A Fine Balance
Another issue big data creates centers on privacy. “Personal data such as health and financial records are often those that can offer the most significant human benefits, such as helping to pinpoint the right medical treatment or the most appropriate financial product,” according to the McKinsey report.
It will take a fine balance, however, to ensure data from medical, financial, human resources and legal records isn’t exposed. Much of that data travels over the public Internet, which means securing it from a technology standpoint is critical. But that’s only part of the challenge: Actually accessing private data for analysis can itself be problematic, and requires thoughtful policy-making.
“Policy makers need to recognize the potential of harnessing big data to unleash the next wave of growth in their economies. They need to provide the institutional framework to allow companies to easily create value out of data while protecting the privacy of citizens and providing data security,” according to the McKinsey report.
Needles in Haystacks
Extracting actionable information from the growing morass of unstructured data is like finding needles in haystacks, Bhambrhi says. It’s not easy to identify nuggets collected from laptops, databases, medical devices, smartphones, RFID tags and GPS devices—to name a few—for real-time insights and to spot historical patterns for long-term benefits.
Rosen says enterprises are just beginning to understand the magnitude of the big data challenge. And even though agencies such as NASA have worked on it for decades, he says, “we are still in the early stages.”
IBM is working with enterprises in various industries through its Smarter Planet initiative to collect, analyze and make data actionable. Bhambhri offers several examples:
• Utilities are using analytics on data collected from sensors to prevent malfunctions.
• Credit card companies are analyzing use patterns to spot signs of fraud.
• Marketers are collecting social media data to target their promotions.
In Dublin, Ireland, an IBM InfoSphere Streams project has been collecting traffic data from buses and sensors at intersections. With 4,000 detectors in place in a road system with 700 intersections, the project is receiving 20,000 data records per minute, a pace of more than 300 per second. A thousand Dublin buses engaged in the project are sending 3,000 GPS readings per minute, a rate of 50 per second. On average, each bus sends location data every 20 seconds.
Why so much info from the streets of Dublin? One benefit the project aims to deliver is a system that triggers traffic signals to give any bus that approaches an intersection a green light. In addition to trimming operational costs for diesel fuel and electricity consumed while idling, the project could boost ridership, as citizens opt for timely bus routes to avoid traffic jams.
Meanwhile in Canada, Project Artemis, a collaboration of IBM, the University of Ontario Institute of Technology and Toronto’s Hospital for Sick Children is collecting data from bedside devices and notes from doctors and nurses to help newborns. The goal is to use data to spot potential signs of life-threatening infection 24 hours in advance.
“Close to 200 pieces of information get generated per second for every baby,” Bhambhri says. It is humanly impossible to properly analyze all this information without technology. Project Artemis uses IBM InfoSphere Streams, a new processing architecture that employs targeted algorithms, to give doctors information in near-real time to make potentially life-saving decisions.
InfoSphere Streams is part of the IBM BigInsights Enterprise Edition analytics platform, which enables rapid, large-scale analysis of diverse data. Built on the open-source Apache Hadoop platform, BigInsights supports unstructured and structured data.
IBM is taking a leadership position in big data with its Smarter Computing initiative. Other companies, such as SAP, Oracle and Google, also have big data initiatives. Microsoft, meanwhile, wants users to view its upcoming SQL Server 2012 release as a platform to help them unlock insights from big data.
SHARE’s Peer Guidance
Along with IT companies, SHARE is taking an active role in helping enterprises with big data by sponsoring conferences and publishing educational materials on the topic. Rao’s Smarter Computing presentation, for instance, was delivered at the organization’s August 2011 conference in Orlando.
SHARE is an independent association with membership from companies large and small in industries such as finance, insurance, manufacturing, retail and utilities, as well as universities and colleges, government organizations and consultants. Its mission is to provide enterprise IT professionals with continuous education and training, and facilitate peer networking.
SHARE a valuable resource for IT professionals trying to tackle the big data challenge, says Rosen. There’s a lot of information on the subject in journals and case studies, but it has limitations because typically those sources cover only the successes, he adds.
To find out about failures so they can avoid them, IT professionals have to rely on each other, and that is an important role SHARE fulfills as a peer group. “Telling me what works is great, but telling me what doesn’t is even greater to make sure I don’t go down that path,” says Rosen.
One of the paths to taming big data is through the mainframe. The mainframe’s bulk-processing capacity at high rates of speed presents a “real solution” for big data, says Rosen. It has a central role in helping enterprises keep track of all their data.
“What we are seeing with our customers with mainframes,” says Bhambhri, “is they want to look at the data from the instance that it enters the enterprise.” To help those customers, IBM has developed the zEnterprise BladeCenter Extension (zBX), which extends System z mainframe management capabilities across the vendor’s server platform and connects with the large-scale DB2 database for business analytics.
“Our customers that have invested in the mainframe and have workloads on mainframe don’t have to change anything,” Bhambhri says. “What we are providing is something that helps extend their data platform. We are providing capabilities that allow them to analyze large volumes of data.”
Enterprises can run workloads on the mainframe while extracting information for analysis on the BigInsights platform. Organizations don’t necessarily know what they will find out from the data, says Bhambhri, but once they run it through the analytics programs, they are sure to find actionable information useful to the business.
Of course, many companies have no mainframes, but still process large amounts of data. For them, cloud-based computing resources present a way to gain big data insights while keeping costs down. Bhambhri says several are “kicking the tires” in the cloud.
McKinsey says cloud computing knocks down technology barriers and reduces costs, and it lets organizations collaborate with partners and customers on business functions such as R&D, marketing and customer support. In a big data context, you could envision different parties working together to gain and share mutually beneficial insights.
In a June 20, 2011, press release IDC predicted: “Cloud computing will continue to reshape the IT landscape over the next five years as spending on public IT cloud services expands at a compound annual growth rate of 27.6 percent, from $21.5 billion in 2010 to $72.9 billion in 2015. But the impact of cloud services will extend well beyond IT.”
Real-time big data analysis will help drive this growth.
Rosen says both the government and private enterprise are seriously looking at how cloud computing can help with big data. Security and reliability, however, remain a concern. “And sometimes the cloud is really complicated because you have to be concerned in some cases where the data is being stored,” he says.
What to Do
Realizing big data’s vast potential will require organizations to understand the value of the large volumes of data they generate and process, and act upon it. On the technology side, innovation is needed to address storage and security challenges, and to continue to fine-tune analytics to make sense of the data. Policy makers also play a critical role in devising strategies that facilitate analysis of big data while protecting the privacy of personal data.
Education, of course, is fundamental. To that end, attendees at SHARE’s next semi-annual event
, scheduled for March 2012, are sure to receive plenty of actionable information to help them meet the big data challenge.