Global AI and Data Science

Global AI & Data Science

Train, tune and distribute models with generative AI and machine learning capabilities

View Only

Back to Blog List

Economics of Scaling Machine Learning Workloads: Architectural Lessons from Data Science Engagements

By Siva Anne posted Fri May 22, 2020 10:27 AM

The rise of the digital era has been fueling data explosion and the adoption of AI across all industries. With access to growing volumes of data, every business must build scalable Machine Learning solutions that can process massive datasets. Processing at scale imposes a unique set of challenges, and enterprises of all sizes are compelled to engineer solutions than not only scale but optimize the associated costs of running the workloads.

Based on learnings from Data Science engagements, this is the first part of two-part blog series that attempts to summarize solution architectures that help address the critical challenge of optimizing costs associated with scaling Machine Learning workloads.

Scaling Machine Learning

An end-to-end machine learning workload is a multi-stage workflow with stages across data preparation, feature engineering, model training, and model scoring. Scaling machine learning requires capabilities to scale every phase of the workflow.

Multiple factors impact the scalability of an end-to-end machine learning solution.

Scalable ML Framework

Machine Learning has existed for years, but the renewed adoption of AI at scale has led to a massive increase in the evolution of multiple ML Frameworks. In recent years, the choice of innovative ML algorithms has only accelerated further with significant contributions from open source communities.

Typically, factors like use case, support for a specific language (Python, R, Java, or C/C++), performance, and community support influence the choice of ML framework. While ML frameworks like Spark MLlib, TensorFlow, Caffe, and PyTorch scale inherently across systems, the popular Python framework Scikit-learn is limited to a single machine. This limitation imposes constraints for training or scoring datasets that don’t fit into the memory of a single node.

For some models and datasets, the learning curve levels off after a specific data size and training on more data do not yield improved performance. In such cases, using a sampling approach, the ML pipeline can train on sampled datasets that are representative subsets of the entire data. Other alternatives to scale Scikit-learn require using either Scikit-learn’s out-of-core API with mini-batches of data or ‘Dask.’ Dask implements a flexible library to scale Python across a cluster of nodes by parallelizing computations. Python packages like Numpy, Pandas, Scikit-learn, XGBoost can be scaled with minimum rewrites.

The inherent capability of ML framework to scale beyond a single system is critical for building scalable Machine Learning solutions.

Scalable Data Processing

The data preparation and transformation stages of an end-to-end Machine Learning workflow drive data processing requirements. As the data scale increases, Machine Learning solutions must build on scalable data pipelines for data processing.

Distributed computing frameworks like Apache Spark, Apache Flink, and Presto parallelize operations across a cluster to drive data processing at scale. Apache Spark offers a unified analytics engine for SQL, ML, Graph, and Stream processing of data. Apache Flink is best known for its continuous flow operators for Realtime processing of streaming data. Presto is primarily a distributed SQL Query engine that highly parallelizes queries for best performance on big data.

Apache Spark leads the pack as a popular choice for most implementations. While enterprise database vendors offer optimized Spark connectors to push down SQL processing for boosting performance, popular ML packages like XGBoost and LightGBM have distributed implementations for Spark, making it the natural choice. With support for an extensive range of data sources, data formats, and language environments, Apache Spark has become the standard for scaling data processing in ML pipelines.

Dynamic, High Compute Capacity

High compute resources are essential to driving performance at scale. The economics associated with owning high compute resources compels organizations to either share on-premises HPC clusters across workloads or leverage Cloud computing. Cloud offers flexible, on-demand compute capacity at economic price-points.

Scaling Machine Learning requires engineering pipelines to either leverage shared, on-premises HPC clusters or burst to Cloud and use dynamically provisioned, on-demand compute resources.

Beyond the technical essentials, the enterprise IT landscape imposes additional factors that impact the viability of Machine Learning workloads.

Flexible Deployment for On-Premises or Hybrid Cloud or Multi-Cloud

For most enterprise organizations, all their data does not exist in one place. And for many industries, regulatory compliance and security requirements restrict the moving of data. The enterprise constraints, coupled with the gravity of data at scale, demands that ML processing must move closer to data rather than moving data around. Some workloads are best fit for Cloud while some are best fit for on-premises or hybrid cloud.

For enterprises, business needs mandate flexibility in the deployment of ML workloads. In addition to architecting for scale, Machine Learning solutions need to execute in on-premises or hybrid Cloud or multi-cloud environments without significant rework.

Skilled Staff

Open-source contributions are fueling rapid innovation. Organizations aspiring to adopt the latest innovations must keep up with the need for advanced technical skills. On one end of the spectrum, born in Cloud organizations like Netflix, Uber, or Facebook are leading innovations by building open software stacks to address their business needs. This approach requires staff with cutting-edge technical skills to build, run, and manage their ML workloads. On the other end of the spectrum are the traditional businesses that rely on off-the-shelf, vendor-built enterprise platforms to build ML solutions as the tools help lowers the skills barrier.

For organizations of all sizes, access to highly skilled staff is the most critical barrier that limits technology choices for building scalable Machine Learning applications.

Summary

Enterprises must build scalable Machine Learning solutions to process growing volumes of data. Multiple factors like the choice of ML and data processing frameworks, access to scalable compute resources, flexibility in deployment options, and skilled staff impact the scalability of Machine Learning workloads.

In the 2^nd part of the two-part blog series, let's look at architectures that help optimize computing costs for driving ML workloads at scale.

Need help kick-starting your data science project with the right expertise, tools, and resources?

Request a free consultation: ibm.co/DSE-Consultation

Connect with us: Visit ibm.co/DSE-Community to explore our resources and learn more about Data Science and AI Elite.

#GlobalAIandDataScience
#GlobalDataScience

0 comments

13 views

Permalink

https://community.ibm.com/community/user/blogs/sivakumar-anne1/2020/05/22/scalingml

Global AI and Data Science

Global AI & Data Science

Economics of Scaling Machine Learning Workloads: Architectural Lessons from Data Science Engagements

By Siva Anne posted Fri May 22, 2020 10:27 AM

Permalink

Additional
Resources

Office

Quick Links

Global AI and Data Science

Global AI & Data Science

Economics of Scaling Machine Learning Workloads: Architectural Lessons from Data Science Engagements

By Siva Anne posted Fri May 22, 2020 10:27 AM

Permalink

Additional Resources

Office

Quick Links

Additional
Resources