Data Science

SPSS Algorithms Optimized for Apache Spark & Spark Algorithms Extending SPSS Modeler

By Steve Barbee posted Fri November 06, 2015 08:00 PM

  
At the Insight 2015 conference last week, IBM announced that more than 15 of its core analytics and commerce solutions, including the SPSS predictive analytics portfolio, are now integrated with Apache Spark. (Press release, Interview with Rob Thomas, VP Product Development, IBM Analytics)



As a result, data scientists using IBM SPSS Modeler now benefit from the performance and scalability advantages of running predictive analytics operations on Apache Spark. This post summarizes the predictive algorithms enabled on Spark and significant features each includes. More information on each algorithm is available online (note that GLE and Linear-AS are categorized under Statistical Models).

Nine Algorithms for Big Data — Spark and/or MapReduce




Nine Algorithms have been released in 2015 for Big Data using Spark. Predictive analytics methods in five major areas are available in IBM SPSS Modeler for use with Spark via IBM SPSS Analytic Server. However, even customers who have not yet adopted Spark can benefit from these algorithms since Analytic Server will automatically leverage Hadoop Map/Reduce processing when Spark is not available. The appropriate algorithms (performing classification and/or regression) can be found in the Auto Classifier and Auto Numeric Nodes.



spark_as_nodes






    Market Basket, Clustering, Time Series & (Geo)Spatiotemporal




  1. Association Rules for market-basket analysis, frequent item-sets

    WHAT'S NEW? — Rules can include "nearness" to geospatial features (e.g., within city block X, near street Y)

  2. Spatiotemporal Prediction for 2D & 3D + time

    WHAT'S NEW? — Combine time forecasts with location forecasts by iteratively performing spatial and temporal auto-regression in 2-dimensional or 3-dimensional space

  3. 2-Step Clustering

    WHAT'S NEW? — Produce silhouette metrics down to the cluster level, visualize inter-cluster distances and list small outlier or anomalous clusters

  4. Temporal Causal Modeling for time series analysis

    WHAT'S NEW?— Forecast your target and your predictors together, find indirect and direct causes affecting, or affected by your target; "what if" effects downstream of a change; outlier detection and root cause analysis



  5. Classification & Regression



  6. Linear Regression

    WHAT'S NEW? — Over-fit prevention by automatic train/test analysis during feature selection

  7. Generalized Linear Regression for Logistic Regression , Poisson Regression, Gamma Regression, Loglinear and Complementary Log-Log Regressions, plus many others using eight distributions (including Tweedie) and 16 link functions

    WHAT'S NEW?
    • feature selection through regularization by LASSO, Ridge Regression or Elastic Net

    • automatically detects your target and selects the right distribution and link function

  8. Linear Support Vector Machine

    WHAT's NEW? — with feature selection by L1 or L2 regularization

  9. Tree (CHAID)

    WHAT'S NEW? — Automatically lists the top 10 classification or regression or class rules

  10. Random Trees

    WHAT'S NEW? — Builds random ensembles of trees with IBM's version of this popular method that produces accurate models with little overfitting








From the descriptions above, you'll see that we've added new functionality (STP, TCM, Random Trees and regularization in LSVM and GLE) for users conducting predictive analytics in Hadoop. We have also converted many existing algorithms to provide distributed, parallelized capability (Association Rules, Tree-AS, Linear-AS, LSVM and GLE). In addition, some existing Modeler functionality is overlapped or superseded by the new Spark-enabled algorithms (Regression, Anomaly, Feature Selection and Logistic Regression).



Extend SPSS Modeler with Spark Algorithms

In his October 26, 2015 post, Armand Ruiz explained that Spark integration also enables Data Scientists to extend Modeler with additional functionality. Typically, Data Scientists would use Modeler's Custom Dialog Builder to create extensions that allow non'programmer and novice users to exploit R, Spark MLlib algorithms and other Python processes. Python for Spark support in Modeler's Custom Dialog Builder's provides access to:


  • Spark & its machine learning library (MLlib)

  • Other common Python libraries, such as Numpy, Scipy, Scikit-learn and Pandas.


Using Modeler's Custom Dialog builder to abstract code behind a GUI makes Spark usable for non-programmers.


spark_stream1105


There are many algorithms that can be added to Modeler from Spark MLlib. To assist customers, IBM has already created custom Modeler Nodes for Collaborative Filtering and Page Rank. These have been posted in the IBM SPSS Predictive Analytics Gallery. However, all of the algorithms listed below are accessible to Modeler through a build process described in the Knowledge Center (see SPSS Modeler Extensions Help —> Supported Languages —> Scripting with Python for Spark; also see SPSS Modeler Extensions Help —> Creating and Managing Custom Nodes).




spark_table1105



There is some overlap between Spark MLlib algorithms and native SPSS Modeler Nodes:








Spark MLlibSPSS Modeler'AS Node
Linear SVMLSVM
Logistic RegressionGLE
Random ForestsRandom Trees
LASSO & Ridge RegressionGLE





To learn more about SPSS' integration with Apache Spark, watch this webinar.
More information about the Spark MLlib algorithms is available here.
#Uncategorized
0 comments
7 views