Global AI and Data Science

 View Only

Model Development and Maintenance

By Austin Eovito posted Mon October 21, 2019 02:38 PM

  

By Austin Eovito and Vikas Ramachandra

Other blogs in MLOps series:
Operationalizing Data Science 
Infrastructure for Data Science
Model Development and Maintenance   
Model Deployment
Model Monitoring


Introduction

In recent years, applied data scientists have found themselves rushing models to production in response to business needs. In many cases, the models are expedited through the development process at the expense of long-term system health. Compounding this issue is the usage of data intensive tasking, which introduces its own set of problems such as computational complexity, scalability, and data governance [5]. As firms mature in their use of data science, they need to develop systems for managing the lifecycle of the models that they deploy.

Initial Creation

Beyond ideation about the initial business solution, data scientists begin the ‘prototype,’ or research phase, where data scientists experiment with data scope and resourcing, model choices, and model parameters. It is not uncommon to see data scientists carry out these experiments on laptops or non-scalable computing environments. This inadvertently creates technical debt that will need to be addressed before business value can be achieved from deployment of models in a production environment.

Broadly defined, technical debt is a phrase that captures the short-term gain in performance at the expense of long-term cost in software engineering practices (glue code, refactoring, etc.). In machine learning applications, small-scale prototypes can obscure the fragility or resistance to change of a of a full-scale system [4].

To the extent that data scientists have access to well-maintained data science infrastructure, technical debt can be reduced or eliminated at low cost via utilization of the infrastructure. Usage of said infrastructure supplements standardized computing primitives, programming language versions, and library versions, which in turn reduce the overhead associated with model deployment. 

Similarly, in many cases, a model is needed to augment a software application. Pressed for time, a data scientist may make the design choice to tightly couple their model to the application. Since modern applications are often hosted in the cloud or have access to cloud resources at runtime, it is advantageous to buffer the model from the application such that the model can have a lifecycle independent of the applications scope. Decoupling the model from the application allows the model to be improved incrementally and independently of the application, whilst promoting reusability.

Subsequent Updates

In many ways, data science models are like software libraries–they evolve as data scientists incorporate feedback from users and respond to changes in computing infrastructure (such as new module availability) or the underlying data distribution. As a model evolves it is logical to treat new versions of the model as one would new versions of a library (wary of deprecated features).

In particular, it is helpful to version (or label) releases of a model and to associate release notes with the version. Doing so makes explicit the contents of any particular version of the model. Equally, it gives data scientists, software engineers, and DevOps teams, a clean way to reference the appropriate models for various uses (such as deployment to production and testing). 

Reuse

Model development and maintenance is expensive given the typical cost of human capital, data sources and the compute infrastructure involved. It is therefore logical to extend the life of a model by promoting the recycling of existing models. A prerequisite for reuse is discoverability, i.e, where the model lives. Model versioning and model release notes complement discoverability, but most firms with large or growing data science teams need to go further by creating a single, visible repository of models.

Conclusion

Technical adaptation developed in the confines of software engineering and traditional developer operations can be extrapolated and applied to the development of data science applications. In doing so, data science operations can become more efficient by both reducing hurdles to new model deployment and extending the useful lifecycle of existing models.

Resources

[1] FirstMark. (January 25, 2018). There are 3 main types of technical debt. Here’s how to manage them. Retrieved from: https://hackernoon.com/there-are-3-main-types-of-technical-debt-heres-how-to-manage-them-4a3328a4c50c

[2] Vimarsh Karbhari. (August 7, 2018). Technical Debt in Data Science Series – Part 1. Retrieved from: https://medium.com/acing-ai/technical-debt-in-data-science-series-part-1-7b44c10c660a

[3] Scully, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M. (2014). Machine Learning: The High-Interest Credit Card of Technical Debt.  Retrieved from: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/43146.pdf

[4] Scully, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J., Dennison, D. (2015). High Technical Debt in Machine Learning Systems. Retrieved from:https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf

[5] Islam, M., Buyya, R. (December 3, 2018). Resource Management and Scheduling for Big Data Applications in Cloud Computing Environments. Retrieved from: https://arxiv.org/pdf/1812.00563.pdf

 

 

 


#GlobalAIandDataScience
#GlobalDataScience
0 comments
20 views

Permalink