Global AI and Data Science

Global AI & Data Science

Train, tune and distribute models with generative AI and machine learning capabilities

View Only

Back to Blog List

Infrastructure for Data Science

By Austin Eovito posted Mon October 14, 2019 05:20 PM

By Austin Eovito and Vikas Ramachandra

Other blogs in MLOps series:
Operationalizing Data Science
Infrastructure for Data Science
Model Development and Maintenance
Model Deployment
Model Monitoring

Introduction

Ad hoc data science operations bear little resemblance with traditional DevOps regarding development and deployment. Solutions to infrastructural challenges posed by data science applications are well understood when they overlap with traditional software development. However, when these applications diverge from traditional development endeavors, firms may not have the infrastructural capacity or architecture to accomplish their goal. Furthermore, specific challenges such as edge computing and heterogenous data sources provoke research and investment in infrastructure to supplement data science teams.

Solutions from Software Engineering

Data science operations share three broad classes of challenges with software engineering organizations: infrastructure abstraction, infrastructure management, and enterprise readiness. Infrastructure abstraction refers to the setup of the distributed cloud system–obscuring the details of the underlying hardware; infrastructure management refers to entities like Amazon Web Services (AWS)–a set of virtualized server instances [1]. Enterprise readiness is the martial art of cloud computing, referring to an organization’s ability to withstand and respond to cyber-attacks [2].

Many firms have heterogeneous compute, storage and networking environments. These environments comprise clouds both public and private in nature, computing models such as virtual machines, containers, and serverless computing. To increase usability and interoperability, a layer of abstraction on top of the compute infrastructure allows data scientists and software engineers to carry out computations independent from the underlying hardware.

Management and maintenance of compute-infrastructure requires logging, analytics, and dashboards. Enterprise-class firms experience similar problems regarding their compute-infrastructure. These challenges include security and scalability of the infrastructure, compliance with required external standards such as the Health Insurance Portability and Accountability Act (HIPAA) and the Payment Card Industry Data Security Standard (PCI-DSS) as well as internal policies and preparation for future audits. Large firms may also need to facilitate the chargeback of usage fees to sub-organizations proportionate to their use of the infrastructure.

Since data scientists and software engineers share an overlapping subset of problems with one another, there is an indication that the solution may not be derived by either occupation independently. Traditional software engineering has had the time to produce well studied and well documented solutions to these problems, data scientists should partner with Infrastructure IT and DevOps teams to adapt (previously) well-known solutions to data science operations.
Problems Unique to Data Science

Conversely, data science operations encounter bespoke infrastructure problems as well–problems that do not have a direct parallel in software engineering. The first of these problems is support for the toolchains needed for data science. There are many different and fast-growing tools in data science today, each with their own universe of libraries, modules, and connectors. Each of these, in turn, have multiple, and sometimes incompatible, versions forcing additional maintenance requirements with independently evolving tool chains.

Data Science attributes much of the increases in efficiency and throughput to do with graphical processing unit (GPU) support for libraries, reducing the overall maintenance of codebases (code can be parallelized in dynamic languages like Python), and increasing workload across the network. Many of the data science tools have specialized libraries for faster execution on GPUs. Data scientists’ preference for this specialized support further compounds the challenge of maintaining multiple, rapidly evolving tools.

Data Scientists and DevOps can manage this additional complexity, along with the infrastructure abstraction challenges mentioned earlier, by using, as an example, Docker containers to host data science models and the associated toolchains. They can also make the model production-ready by using Kubernetes to provide elasticity and manual compute power allocation to an instance [1]. Finally, data scientists employing DevOps can layer REST APIs on top of the models so that the models can be invoked independently from the toolchain supporting the model.

Virtual machines (VMs) emulate hardware systems allowing for the running of multiple operating systems on the same physical server. [3] This means that Windows 10 can run on the same machine, and at the same time, as the Linux kernel. Containers are the lightweight analogue of virtual machines, hosting the underlying operating system kernel and being mere megabytes as opposed to gigabytes in size. [3] Both allow data scientists to efficiently access tools that would otherwise be segregated by operating systems or proprietary licensing.

Return on Infrastructure Investments

The overlap between data science and traditional software engineering needs to be utilized by personnel in both camps to avoid reinventing the wheel. Advances in data science infrastructure are paying dividends by allowing the communication between tools that would otherwise be siloed by language and use-case. Although the aforementioned abstractions imply upfront cost, the investment pays dividends via overall reduction in error and unexpected behavior, as well as improved economies of scale and reduction in future technical debt.

Resources

[1] Chamberlain, D. (March 16, 2018). Containers vs. Virtual Machines (VMs): What’s the Difference? Retrieved from: https://blog.netapp.com/blogs/containers-vs-vms/

[2] Kolb, S. (2019). On the Portability of Applications in Platform as a Service. Retrieved from: https://books.google.com/books?id=H4uRDwAAQBAJ&pg=PA46&lpg=PA46&dq=#v=onepage&q&f=false

[3] Miller, B. A. (March 09, 2018). Enterprise Readiness: Is Your Organization Prepared. Retrieved from: https://www.advisorycloud.com/board-of-directors-articles/enterprise-readiness-is-your-organization-prepared

#GlobalAIandDataScience
#GlobalDataScience

0 comments

30 views

Permalink

https://community.ibm.com/community/user/blogs/austin-eovito1/2019/10/14/infrastructure-for-data-science

Global AI and Data Science

Global AI & Data Science

Infrastructure for Data Science

By Austin Eovito posted Mon October 14, 2019 05:20 PM

Permalink

Additional
Resources

Office

Quick Links

Global AI and Data Science

Global AI & Data Science

Infrastructure for Data Science

By Austin Eovito posted Mon October 14, 2019 05:20 PM

Permalink

Additional Resources

Office

Quick Links

Additional
Resources