Simply put, Apache Spark
is an open-source, cluster-based framework that was developed in 2009 and became open source in 2010.
In conversations about Spark, you’ll often hear the phrase, “Spark is the operating system for analytics.” As Linux and the mainframe are very well matched, so is Spark for z/OS and Linux equally
. It’s an open-source project and an application framework for completing repetitive analysis of very large volumes of data.
Keep Data in Place
Spark allows you to locate analytics where your data resides. Less data movement equals goodness. Historically speaking, the need was to move the vast amounts of data to where the processing would take place. Those high volumes of data did typically reside on a z Systems environment due to their high value and highly secure nature.
In leveraging Spark’s consistent interfaces in conjunction with very rich analytics libraries, programmers can build very high-level analytic applications across a wide variety of IT environments. The main initiative is to give the ability to use the data in place and then federate the analytical processing in a way to provide the best fit for the environment. Depending on the type of analytic processing required, Spark can offer both batch and real-time solutions to accomplish this need.
Another important aspect of Spark that shouldn’t be overlooked is its cost savings. Because Spark is Java-based, it can be used for z Systems transactions, customer-based applications, and IBM solutions to utilize and leverage zIIP-eligible MIPS, thus lowering the cost of overall processing. Adding to this is that the new IBM z13 system supports up to 10 TB of memory, enabling incredible performance.
Apache Spark has a complete separation of its programmatic interfaces and interfaces over the Spark Core functions. This sets up nicely to give the capability to enhance and optimize the solutions running below the interfaces. This means enhanced capabilities with security (z/OS system authorization facility interfaces); compression technology; single instruction, multiple data optimizations
; and improved resource management.
Several components that make up Spark stack include:
- Spark Core and resilient distributed data sets (RDD)
- Spark SQL
- Spark Streaming
- MLIB-Machine Learning Library
- GraphX, a distributed graphing facility that runs on top of Spark. APIs are utilized for easy access.
The Spark Core component including RDD is really where it all begins. This is where I/O, scheduling of tasks and dispatching occurs. RDD are read-only collections of records in memory or on disk that can be processed with the analytic solution.
Spark SQL is a feature that runs on top of Spark Core to provide further use of the RDDs and can be used to access structured and semi-structured data. Spark Streaming is just that, streaming data for operations on RDDs. MLib or machine learning framework, is very important to the overall Spark picture as it implements common machine learning code such as regressions, decision trees and correlations as well as others. Finally, GraphX is a graphical processing engine and API enabler that provides graphing.
Future With Machine Learning
Now that we know a bit about Apache Spark and how it works, lets discuss the future. Life as we know it is all about prediction of results whether they are positive or negative. In the world of business analytics, it means progress and better use of the data that the business generates. This is basically a smarter way to approach business. If a human had to do this, it would be next to impossible. Enter the world of machine learning. Simply put, machine learning is systems that have the ability to learn from data.
If you think about all of the examples of machine learning in the world today, they would include ideas like self-driving cars, automatic fraud detection and speech recognition. Machine learning is much more desirable than having a programmer code for myriad use cases that could occur. Not only would that be impossible, but the amount of resources and time wouldn’t fly. This all being said, the Apache Spark APIs are much easier to manipulate and provides lightning-fast computational speed. Comparing to other solutions, the processing speed can approach 100-plus times faster.
Any machine learning application requires automation and optimization. As the application continues to learn and evolve with additional intellect, Spark has a feature called hyperparameter tuning, which allows you to determine the best way to train the learning algorithm. Automation and optimization is a solid strength of Spark. Machine learning is all about prescriptive analytics. This is the art of predicting that something will happen, the reason that it will happen and what the best course of action will be once it does indeed happen. We are seeing this now, and this is our future. I can’t help but think back to watching Terminator as a kid.
Spark is the OS of analytics. It’s proving to be a cost-effective alternative and efficient way to process very large amounts of data in place. Machine learning and prescriptive analytics will continue to evolve and change the very way that business is transacted today. And it runs on IBM z.
Patrick Stanard is a z Systems Architect Manager for IBM. He’s a 34-year professional in the industry spanning roles as a systems programmer, developer, manager, adjunct faculty member and director of operations. He has a Bachelor of Science in CIS from Saginaw Valley State University and an MBA from Michigan State University.