Global Data Science Forum

Deploying an Internal Browsable Knowledge Repository with Jupyter Notebooks

By Michael Mansour posted Wed August 26, 2020 05:38 PM




As an organization grows, how do we make sure that an insight uncovered by one person effectively transfers beyond the targeted recipient?  Airbnb solves this with its knowledge repository project: An open-source framework for self-hosting a browsable collection of notebooks, code, and experiment write-ups that anyone internally can access. This tool allows:

  • Reproducibility: Everything to reproduce the results is contained in a single place, from DB queries to code.
  • Quality: Using Git’s pull requests, only quality materials may be committed (depending on your organization’s culture)
  • Consumability: Code can be hidden so non-technical colleagues can get to the results faster.
  • Discoverability: Structured metadata and tags allow for easy searchability.  It also comes with a topic-based feed so users can find new information in an RSS-type fashion. 
  • Learning: Scientists no longer have to reinvent the wheel -- it’s easy to borrow code from a past project or jump start your analysis.

How does it work?

  • A user submits a writeup and Jupyter notebook as a post through a UI, and this is committed into github
  • The post metadata is stored in a SQL server
  • A pull-request successfully adds the post
  • A Flask-app on gunicorn manages serving

The tool has fairly extensive documentation for users and deployment to make it accessible and understand it’s under the hood operation.

My Thoughts

The lack of efficient and common knowledge sharing across different team silos within a company likely contributes to a non-negligible amount of wasted time and resources.  Sure tools like Confluence or Wiki’s might exist, but they don’t have a common platform for hosting data-science specific experiments, code, and results.  At a former company I worked at, we mused at homebrewing something similar to this to allow the outputs of the R&D division to be more widely propagated to other engineering stakeholders in different silos. Aside from the R&D group, any data science group could promote their internal value by having other teams discover and easily implement their solution.

The highlights of this tool allow you to share results for multiple audience types, from non-technical users to other engineers.  To date, it does not appear that there is even a good paid solution for this problem, outside of perhaps cloud-hosted notebooks with Domino Labs -- but even that doesn’t appear to have the same features.  Unlike paid solutions, Airbnb’s opensource framework here would make it trivial to extract + transfer your repo.

If you plan to implement this tool internally, I’d love to hear about your experiences with your use cases and benefits you find.  It’s worth also sharing non-traditional use-cases of this tool to help inspire others!