Global Data Science Forum

Andreesen Horowitz’s Tips for ML Success in Production in Startup Contexts

By Michael Mansour posted Wed August 26, 2020 06:21 PM


Andreesen Horowitz’s Tips for ML Success in Production in Startup Contexts

As more and more startups are building AI-centric products, a commonality is emerging in the realm of problems they face: The underlying “economies of [AI] data”. The “economies of data” analogy draws a parallel to economies of scale, however, the real relationship is inverse.  A large proportion of the problem-space startups are trying to solve exist in a long-tail of available data.  As more data is collected and edge cases increase, the marginal returns decrease at an exponential rate; tweaking and tuning here eats up most of a data scientist’s time.  That long-tail data is furthermore hard to collect and maintain but might represent critical failure modes of a product.

There’s no easy way of directly solving your dataset issue in a complex problem space, however, they offer some tricks to reformulate the problem to minimize the length of the data long-tail. 

  • Componentize the problem: Instead of trying to solve a global problem, break it into components such that a model tackles a slice of the data.  Deep domain expertise helps guide where those delineations should be made.

  • Build around the long tail: Consider reducing the number of acceptable user inputs with something as simple as an auto-complete functionality.  With this, the length of the tail is truncated. 

  • Build an edge-case engine: Focus on gathering samples of data from the long-tail in a repeatable fashion.  It’s expensive, but pays off and may enhance an active-learning solution