Global AI and Data Science

 View Only

Model Deployment

By Austin Eovito posted Wed October 30, 2019 12:43 PM

  

By Austin Eovito and Vikas Ramachandra

Other blogs in MLOps series:
Operationalizing Data Science 
Infrastructure for Data Science
Model Development and Maintenance   
Model Deployment
Model Monitoring


Introduction

Efficient, rapid deployment of both ephemeral and permanent systems is of paramount importance to data scientists. Large data science teams may deploy hundreds or thousands of models in a given year whereas their smaller counterparts may only deploy a handful. Irrespective of size, these teams require a scale-agnostic process to balance the development and deployment of heterogenous systems.

Validation and Live-Testing

Candidate models usually go through a validation phase before deployment. The validation phase encompasses all aspects of a candidate model’s performance as opposed to simply focusing on predictive validity, the primary focus of most data scientists. Data scientists constantly tinker with feature engineering, model parameters, and hyperparameters to eke out incremental increases in predictive validity with throughput characteristics receiving less attention.

 However, when a model gets deployed in a production environment, the latency and throughput characteristics matter just as much as predictive validity. Depending on the context in which the model is used, latency and throughput can greatly affect the end-user experience of the application supported by the model. Latency and throughput also directly impact cost. More infrastructure may be needed to get acceptable latency and throughput from a model.

 At scale, the process of running candidate models through a battery of tests that define the prediction, latency and throughput characteristics of the model should be automated [1]. Models that do not have reasonable performance across all three criteria should be blocked from deployment and iterated upon. Further, models that are accepted for deployment should go through a live-testing phase where a fraction of the load serviced by a predecessor model is passed through the candidate model (i.e. the candidate is A/B tested). The candidate model should replace its predecessor only if has better aggregate performance metrics in the live test.

Upgrades and Downgrades

For professional data science operations at scale, not only is it important to automate the validation of new candidates, but it is also important to use a version management tool (such as GitHub) to promote candidate models to live-test and then to full deployment. Conversely, if model performance degrades, the version management tool should be able to downgrade to an older version of the model with stable performance.

Model Statistics

As models move from validation to live testing to full production deployment, it is helpful to record the performance metrics of the model systematically. Keeping score is important not only in making decisions about which models to promote and which to send back for reconsideration, but it is also important for model discovery and reusability. Data scientists working on new projects ought to investigate previous models that have been deployed with good results. Such investigation is difficult to complete without good statistics on past usage.

The most obvious machine learning metric is correctness, the extent to which a model “gets things right.” [1] Equal in importance, however, is the degree of overfitting, the extent to which a specific model is optimized only for its training set, failing to generalize when tested on unseen information. For example, consider the fitting of a simple linear regression. If the underlying phenomena is linear, we have reason to expect a good fit. A better fit would result, however, if we simply connected all of the dots. By connecting the dots, we could perfectly fit the data (overfitting). But given a new dataset, the odds would be against us, as this model would fail to generalize.

Additional metrics of importance include robustness, the performance of a model when provided with data of suboptimal quality, fairness, or efficiency. [1] How can we be sure that model is not illegally discriminating against protected classes unless we have quantified its past performance? How can we ensure that a machine learning system’s components are secure from illicit access? How do we ensure that a machine learning model is not computationally prohibitive?

Conclusion

The record of model performance metrics allows data scientists to determine model performance on several competing priorities. Machine learning models must excel not only in terms of predictive validity but also in terms of generalizability, robustness, efficiency, fairness, and on domain-specific metrics as well. Finally, machine learning models must be scalable if the hype regarding big data is to live up to its lofty ideals.

 

Resources

  1. Machine Learning Testing: https://arxiv.org/pdf/1906.10742.pdf

#GlobalAIandDataScience
#GlobalDataScience
0 comments
13 views

Permalink