Global AI and Data Science

 View Only

ML Ops Day, OSCON 2019

By Paco Nathan posted Mon September 09, 2019 04:06 PM


Tuesday, July 16, 2019: at the OSCON open source conference this year in Portland, we hosted a track called ML Ops Day: Managing the end-to-end ML lifecycle – sponsored by the IBM Data Science Community. Even though our call for proposals for the track was open for only a few weeks, talks from enterprise organizations poured in, exploring many aspects of the machine learning lifecycle.

The emerging shape of ML Ops

Our goal for the track had been to host a discussion about the emerging practices for ML Ops across different companies. Effectively, we wanted to compare notes and share best practices. As it turned out, many other people wanted to join in that discussion.

It’s interesting to note: the track had been titled “AI Ops Day” originally. We received early feedback, including a few talk proposals, that AI Ops tends to mean “AI used in Operations” in contrast to our intended theme of an end-to-end ML lifecycle. RedHat had been using the term “ML Ops” for the latter, so we decided to standardize on that. It turned out to be a good call, plus a useful lesson learned.

Now let’s unpack the definition for ML Ops further so that we’re communicating in the same language. What does this term mean?

Consider how the software development life cycle (SLDC) is well-defined at this point: planning, creating, testing, deploying, maintaining – or some variant, depending on your software methodology. The gist remains consistent. Computer software runs “logic” in hardware, the test suites are repeatable, it’s a relatively deterministic process. Consider how, with machine learning, we’re working with probabilistic systems instead. Instead of writing code as instructions, we’re guiding these systems to learn from data. The life cycle required for a probabilistic system is much different. For example, testing is considerably more complex since probabilistic workflows tend to have some randomness built in, and are difficult to reproduce exactly from one run to the next.

The risks in the ML life cycle are also different since machine learning models have become pervasive in so many aspects of everyday consumer life – so much of which is tightly regulated. As machine learning models help automate important decisions in a wide variety of industries – banking, health care, airline schedules, telecom, shopping, entertainment, and so on – they become subject to much scrutiny about compliance, audits, needs for explainability, concerns about fairness and bias, privacy laws, security concerns, etc. Many of those activities are regulated, for important reasons. While more traditional software engineering similarly has security concerns, audits, etc., the stakes are not nearly as high: code can be debugged. Machine learning, especially when driven with large scale data, is substantially more difficult to trace and “debug” compared with coding. That typically becomes a matter of advanced statistical analysis in place of setting breakpoints or doing stack traces in an IDE – an integrated development environment such as Eclipse or IntelliJ.

When preparing for the ML life cycle, the steps are very different steps. First, you’ll likely need to spend a substantial amount of time cleaning up the data. That’s before exploring the data to identify predictive features. Far after you’ve started identifying features from your cleaned training set, you might then start building a model - followed closely by trying to understand why a model makes specific predictions. 

It’s additionally complex because of the many roles involved in your ML process. The people who handle early stages of that workflow (e.g., data prep) may not be the same people who build and tune your machine learning models. The people who build your ML models are often not the people who deploy and manage models in production. They are most certainly not the people who’ll audit those models, either. Among those roles involved, who decides when an ML model in production use must be retrained or retired?

The description above gives shape to the story of ML Ops, although we don’t know the full story yet since it’s still unfolding and evolving. We do know the intent and expected outcome: that ML Ops is the practice of taking something really powerful, such as a machine learning model and its adjacent pipeline, and making the life cycle for that become repeatable, scalable – and most importantly, have it drive impact for business needs.

Why ML Ops as a focus?

First: each story about data-driven operations is complicated and unique. Data coming from a singular data source can be different from one minute to the next. The data that gets used to train and test ML models during their development, then gets used to automate decisions or predictions in production. When you consider how ML pipelines in enterprise tend to aggregate data from many different data sources, the complexity of these problems escalates.

Second: from an enterprise perspective, the stories of ML solutions are extremely valuable. Predictive models can make the difference at critical steps: retaining a customer, converting their shopping cart to a completed sale, getting a good first (and lasting) impression of your company as an innovator – or, conversely, creating a bad experience for the customer and having them forever be a critic of your business.

We were excited to hear how the presenters in the track had developed solutions for managing their ML pipelines, particularly by using open source.


One of the reasons IBM wanted to sponsor this track was because its traditional clientele are large enterprise organizations, which are now almost all trying to make sense of ML pipelines and related practices – on one level or another. Talent in this field is short. Some of these technologies can be expensive to use; although they're getting cheaper. Meanwhile, IBM has a mission to help bring machine learning capabilities to all, so we can all participate in the AI economy responsibly. Consequently, there are many different participants and stakeholders in this emerging field of ML Ops.

These are important challenges for enterprise teams to discuss right about now: if we  practitioners don't build these enterprise pipelines correctly, the popular services which affect consumers could degrade in the worst possible ways. Spanning across a range of large risks – model bias, ethical conflicts, compromised privacy, attack surfaces, and so on – the impact of ML models and the pipelines that produce them affects so much more than our digital shopping carts.

We must be able to do more than simply train and deploy ML models. Instead we need to treat them as living products. Ask: Are they healthy? Are the representative? Are they biased? The sum of those concerns creates a substantially different, new definition for Operations.

Also, the setting of OSCON was quite helpful. There were tutorials about closely adjacent tools and skills, happening throughout the conference. For example, how to handle bias in ML pipelines by Ana Echeverri and Trisha Mahoney, and how to train deep learning models by Patrick Tiztler, Va Babarosa, and Jeremy Nilmeier. Ideally, people participating in the ML Ops track can augment with other tutorials, to help build well-rounded careers in AI.


Here are links to the videos for each of the talks, and also links to slides where those are available:

Paco Nathan from Derwen AI

“Model as a service for real-time decisioning”
Niraj Tank, Sumit Daryani from Capital One
(video, slides)

“Machine learning vital signs: Metrics and monitoring of AI in production”
Donald Miner from Miner & Kasch
(video, slides)

“AI pipelines powered by Jupyter notebooks”
Luciano Resende from IBM
(video, slides)

“Kubernetes for machine learning: Productivity over primitives”
Sophie Watson, William Benton from Red Hat
(video, slides)

“Practical DevOps for the busy data scientist: Alice’s adventures in DevOpsland”
Tania Allard from Microsoft

“Machine learning infrastructure at GitHub using Kubernetes”
Michal Jastrzebski, Hamel Husain from Github

“The OS for AI: How serverless computing enables the next gen of machine learning”
Jonathan Peck from Algorithmia
(video, slides)

“Democratizing AI: Making deep learning models easier to use through containerization and microservices”
Saishruthi Swaminathan, IH Jhuo from IBM

“End-to-end ML streaming with Kubeflow, Kafka, and Redis at scale”
Nick Pinckernell from Comcast

Key Takeaways

Each of the talks provided excellent insights and many suggestions. There’s a Twitter moment which links the tweet threads about each presentation in the track. This captures the related discussions online. In addition to the speakers it’s generally quite helpful to keep note of the “hallway track”: Q&A from the audience, sidebar discussions during breaks, and what’s referenced online as well.

If you are new to this topic overall, start with the presentation by Sophie Watson and William Benton from RedHat. Their coverage of ML Ops from soup to nuts is comprehensive, highly accessible to a wide audience. That provides an excellent introduction, beautifully illustrated and brilliantly narrated.

Credit: Sophie Watson, William Benton @ RedHat


A rare gem revealed at ML Ops Day was the talk by Donald Miner, where he explored the core descriptions for ML Ops concerns, then suggested a set of vital signs to monitor. That talk represents a “phenomenology” – in other words, when exploring a new area of analysis, you generally want to begin with a qualitative description to identify issues and concerns. That happens before quantitative work can start to measure and model processes. Think of word problems in math: how do you determine from a text description what the variables and equations need to be? 

Credit: Donald Miner @ Miner & Kasch

Pull quotes:

  • “Humans love to give feedback on AI. Give them the ability to do that … It’s maybe the most obvious (thing to do), but it’s easily overlooked.”
  • “Track what your models are doing...and then you should watch it … I didn’t say anything too far out of the obvious, but nobody’s doing this.” 

Look toward Donald’s talk as helping to establish structure for other discussion about enterprise practices. Especially going forward, this provides a good framework for how to evaluate work in ML Ops. The vital signs provide rationale for what metrics to be monitoring, and more importantly why to be monitoring for those. From a pragmatic view, that can serve as a checklist for starting or evaluating your own ML Ops practice.

Talks from Capital One and Comcast served to bookend the track, opening and closing with  interesting case studies of well-established enterprise practices for ML Ops. Naraj Tank and Sumit Daryani from Capital One opened the day with a detailed description of their “Model-as-a-Service” platform. It provides patterns for comprehensive workflows for both model building (data science teams) and model serving (production engineers). This creates a separation of concerns where those two kinds of teams can work independently, each with their own specific process and tooling, but still work closely together on the ML models for sophisticated features of model deployment and monitoring. It was also interesting to hear about risk management and oversight by model governing bodies (compliance) involved at several steps within this architecture.

Credit: Naraj Tank, Sumit Daryani @ Capital One

Pull quotes:

  • “The model is just one part of your whole business application. That’s traditionally how everyone approaches it – we’ll package it as one application. So the model becomes a part of that large-scale application. What that means is that the data scientist that has gone to school/university to study about modeling and data science, they’re not engineers. They don’t understand the models are a product …They have have to rely on engineering teams … That makes it hard to go to production.”

Nick Pinckernell from Comcast closed the day with a detailed description of their platform which handles both streaming and on-demand analytics, based on an architecture which integrates several key open source components, including Apache Kafka, Apache Spark, Kubernetes, Kubeflow, Redis, and Seldon Core. Requirements for the project included zero code refactoring or rewriting of between research and production. Key metrics include model invocation times and multiple latency measures. Other highlights include taking care to manage the backpressure for streaming data, as well as accelerating feature aggregation through Redis. It’s a case study in system architecture best practices for machine learning platforms at scale.

Credit: Nick Pinckernell @ Comcast

Overall, we lost track of how many presentations cited the “Hidden Technical Debt in Machine Learning Systems” paper by Sculley, et al., from NeuIPS 2015. That’s quite telling. Probably best to study that paper if you haven’t read it yet. One other keen insight, by Michal Jastrzębski from GitHub:

  • “CI/CD is generally an unsolved problem. If you want to write a great thing for machine learning … I think that’s a great project right there.”


Many thanks to all of our speakers who participated, for their insights shared and discussions. Thanks also go out to IBM Data Science Community for sponsoring and making this track possible, and to the kind staff from  O’Reilly Media who helped so much along the way to produce this event.

We hope this introduction gives you a better understanding of the range of concepts, questions, and practices in ML Ops, and we look forward to continuing the dialogue here online and in other events in the future!

co-author: William Roberts

Try the products you saw at OSCON, for free >>