Global Data Science Forum

Taxonomy of Failures in Deep Learning Systems

By Michael Mansour posted Tue December 03, 2019 03:12 PM


Taxonomy of Failures in Deep Learning Systems



Before publishing this paper, no in depth analysis of the failures in systems or projects implementing deep learning using Pytorch, Tensorflow and Keras has been published.  Previous studies only characterize problems of the DL frameworks themselves. The authors offer an approach for categorizing the many types of failures. While they don’t offer guidance on avoiding or solving these failures, the outcomes from this study should be of interest to anyone working with Deep Learning to help identify where trouble spots might appear. 

Undertaking a rigorous meta-level analysis using data from nearly 500 Stack Overflow QA’s, 600 Github issues + PR’s from popular projects, and interviews with 20 practitioners, they generalize issues into 5 main areas, of which may branch out into several subcategories.

Main Fault Taxonomy Categories

  • Model
    • Examples of “model” faults include: wrong DNN choice or poor architectural decisions such as too many layers.  These faults mainly impact performance.
  • Inputs
    • Input shape, such as the dimensions or an accidentally transposed matrix, might be easy to get wrong when crafting a new architecture.  It also appears that data-validation between functions in a pipeline is problematic where incorrect data types cause bugs, or even silent issues like when reverse-ordered data is fed to the model.  This affects the actual operation of the system.
  • Training
    • This is the largest category in the taxonomy, and comprises mostly critical failures. Includes training data quality and pre-processing, hyperparameter tuning, and testing/validation.
  • GPU Usage
    • GPU’s cause a number of highly specific issues in their dataset; problems appear at inconvenient times after a significant amount of computation resources have already been spent, and are further compounded by parallelism and trying to share the resource among subprocesses.
  • API Usage
    • Using the frameworks’ API incorrectly: This is caused by misunderstanding the underlying functionality of an endpoint, or possibly an update to an endpoint in a framework’s updated release.

The fault categories are also rated on how detrimental they are and effort required to rectify them:



Some of the exemplar comments on each of these issues are quite interesting.  They notice a large divide in the type of issues they see along expertise lines. One class of issues arise from those new to the field who might not know best practices or lack intuition into why a NN behaves in some way.  The second class of issues stem from experienced practitioners who may be attempting a complicated new architecture or facing a deployment issue. This might suggest that best practices and guidance should follow a gradation in order to help a larger number of people succeed.

“most engineers in the industry have more experience in implementing, in tracing the code”, so their problems are not like “how should I stack these layers to make a valid model, but most questions on SO are like model building or why does it diverge kind of questions”. Another argument was that the questions on SO are mostly “sort of beginner questions of people that don’t really understand the documentation” and that they are asked by people who “are extremely new to linear algebra and to neural networks”.

From interviews with professionals, they cite the general dissatisfaction with DL framework documentation. This allows faults to happen more easily, but suggests that there also might not exist many resources written for creating DL systems.