This article introduces a landscape diagram which shows 50 or so of the most popular Python libraries and frameworks used in data science. Landscape diagrams illustrate components within a technology stack alongside their complementary technologies. In other words, “How do the parts fit together?”
For example, where do some of the IBM-sponsored projects fit into a broader context of the open source Python ecosystem for data science work? Jupyter Enterprise Gateway is a good example, bridging a gap between Project Jupyter and Apache Spark, and allowing Jupyter notebooks to run atop enterprise-grade cluster computing.
Landscape diagrams provide useful learning materials, helping people conceptualize and discuss complex technology topics. Of course it’s important to keep this diagram curated and updated as the Python ecosystem evolves. We’ll do that.
One caveat: trying to fit lots of complex, interconnected parts into a neatly formatted 2D grid is a challenge. Any diagram must “blur the lines” of definitions to simplify the illustration, and those definitions could be debated at length. On the one hand, the diagram does not include an exhaustive list. We chose popular libraries among widely-used categories, but had to skip some. For example we didn’t go into the varied universe of audio processing libraries, which get used in speech-to-text recognition work. On the other hand, let’s talk about that! Let us know about any updates that you would suggest. We’ll start a forum discussion here.
Why Python? At this point, Python has become the lingua franca of data science work. The language is relatively simple to learn, and there are amazing open source libraries for just about any need you can imagine. If not, it’s quick to create new libraries as needed. Python also gets lots of use in production for web apps, operations, etc., so it’s good for integrating data science projects into larger applications contexts. While other languages such as R, Java, and Scala also work well for data science work – let’s focus on Python for now, otherwise that landscape diagram will get really crowded.
Let’s also talk about versions. There were significant changes between the 2.x and 3.x versions of the language. If you’re just getting started in Python programming, I recommend that you start with version Python 3.x unless for some reason your organization requires Python 2.7 still – some do, and the reasons are varied and complex. However, if you have an opportunity to start with the latest stable version (currently Python 3.7.1) that’s the best approach.
If you are just getting started with Python coding, here are two popular, highly recommended resources:
You may hear also Python developers talk about something called “PEP8”. That’s the most current style guide for Python coding, which has excellent advice plus lots of useful examples. The documentation in general provides many coding examples.
Guido van Rossum, the original author of Python, noted that “code is read much more often than it is written,” so readability became an essential aspect of the language. Allen Downey (the author cited above) found that Python’s readability gives it unique properties. For example, if you’re familiar with pseudocode used to describe algorithms in academic papers, Downey has shown how Python implementations can be more concise than equivalent pseudocode. Using concise, readable code comes in handy for data science teams, where people must understand each other’s code and be able to reproduce results.
The layer at the base of the landscape diagram is labeled package management. In other words, how to install and update the Python libraries you’ll need.
Python has two recommend options for that:
- pip: a general-purpose manager for Python packages, the “official” one which uses the PyPi package index
- conda: a cross-platform environment manager, which is language-agnostic, and uses the Anaconda distribution
Note that pip and conda are quite different tools, intended for very different purposes. Python package management just happens to be the one thing they have in common – and they are mostly equivalent for that purpose. A few years ago, I might have said “Conda is more popular among academic researchers and Pip is more popular for production work”, but that’s no longer quite the case. If your organization uses one or the other, then it’s best to follow that practice; trying to mix them can lead to troubles. See also an excellent discussion about these two tools in “Conda: Myths and Misconceptions” by Jake VanderPlas.
Having chosen a package manager, next it’s highly recommended to use virtual environments before you begin installing packages. In other words, create an environment which has its own installation directories, to avoid modifying the Python libraries needed in other environments. Depending on your projects, you may end up having several virtual environments active at the same time.
If you’re working with conda, it has features to set up virtual environments. With pip, there are multiple options. For all the details, see “Improved Workflows With Isolated Jupyter Environments” by Colin Carroll – especially the history on slides 11–14. I recommend using virtualenv, which works well and is found in so many of the code examples online.
Currently on my laptop I’m using pip and virtualenv and switching among four virtual environments for Python for various projects. Most of my coding examples will show how to use those, but can those examples be translated into conda as well.
Speaking of managing libraries, check out libraries.io which automatically keeps track of the packages on which your code repositories depend, even across many different package managers or languages.
One layer up from the bottom, let’s consider how to run Python applications. Code needs to run somewhere. Your data science work probably has requirements for security, data privacy, resource management, monitoring, and so on. In enterprise environments, running code within those compliance requirements can be challenging.
Of course one simple way to run Python programs is from the command line. That may be fine for a personal dev/test loop; however when you need to collaborate and when your code needs to run in production there are other options. Those other options can help with for parallelizing workloads, too, since some use cases may require multiple servers for scale and speed.
Apache Spark became of the most popular frameworks for data science workflows, and it’s easy to use PySpark to run Python code. You can do that in standalone mode for Spark, i.e., from a command line on your laptop, or on a Spark cluster which can help with parallelizing workloads.
Project Jupyter has also become quite popular for managing data science workflows, and Python is the “py” part of the name “Jupyter”. Peter Parente tracks an estimate for the number of public Jupyter notebooks on GitHub, which recently passed 3 million. Jupyter provides ways to edit code, visualize results, and edit documentation – all within the same document. Colleagues can re-run your notebook to repeat your analysis, or they might adapt it to use other data sources, different parameter settings, etc. JupyterLab is now the recommended way to edit and run notebooks.
While not strictly speaking a Python library, if you want to let many (thousands?) of people run the same notebooks across your organization, JupyterHub provides ways to spawn and manage many concurrent instances.
Jupyter Enterprise Gateway provides enterprise-grade resource management for workflows built with Jupyter. Data science teams can leverage distributed resources and dramatically expand the maximum number of kernels running in parallel. With added enterprise-level security, any team which uses JupyterLab in their workflows has potential use cases. It’s built to take advantage of Big Data tooling and distributed resources, enterprise security, while including all of the UX and integration capabilities of JupyterLab, etc.
Distributed frameworks such as Spark or JupyterHub tend to require systems engineering overhead. Spark uses Scala and sometimes requires sophisticated JVM tuning to work around performance bottlenecks. Moreover, PySpark requires use of special data collections to achieve scale and parallelism. Often that doesn’t translate directly to the popular Python libraries used for analysis and modeling – which means you may need to rewrite code to make run it efficiently in Spark.
Other options require less overhead, but still run Python code at scale and require even less code rewriting. Dask is a popular for parallelized workloads in Python that can scale-out to HPC supercomputers – or simply run on your laptop. It follows the same idioms and data structures as popular Python packages for analysis and modeling, such as NumPy and scikit-learn. PyWren provides similar ways to scale-out Python code on serverless cloud APIs, without needing to rewrite data structures or manage a cluster, and it’s proven valuable for ad-hocs queries where building a full pipeline may be overkill.
Ray is another distributed framework in Python, from UC Berkeley’s RISElab, which is used for multi-agent reinforcement learning. Frankly, there’s much more to this framework: think of it almost as “next-generation Spark”, from the same lab five years later – this time around with advanced machine learning at its core. Large production use cases for RL are beginning to surface in industry; stay tuned, you’ll hear more about this soon.
Sometimes data science projects need to run as microservices, or provide OpenAPI integration, or for other reasons get built into web apps. Clearly there’s been lots of focus on using a combination of Docker and Kubernetes for managing microservice architecture based on containers. While it’s possible to expose a simple service with Jupyter using the Jupyter Kernel Gateway, you may need to move up to other web frameworks to implement services at scale and for other performance or compliance concerns in enterprise (see more about this in the next section). Flask is a popular “microframework” in Python, and it’s easy to use. Gunicorn is a WSGI-compliant HTTP server which is quick (3 lines of Python) to integrate with Flask. Translated: Gunicorn is a great way to plug Flask web apps directly into high-performance, highly-secure web server frameworks such as Nginx. That can be a great way to fit your data science work into your organization devops practices (e.g., security measures, load balancing, edge cache, etc.) even for sophisticated environments.
Moving up the stack, sometimes you need to integrate with networked resources. Perhaps you’re providing data, or consuming data, or sharing data among other apps?
PyArrow is a Python binding for Apache Arrow. It’s relatively new, and provides a “cross-language development platform for in-memory data … organized for efficient analytic operations on modern hardware.” For example, you might have a bunch of data collected in Node.js apps, with analysis running in Spark Streaming, plus other reporting in Python … and for IoT applications, that’s a plausible scenario. PyArrow lets you share data among different technology stacks, with zero-copy; the applications share each other’s memory directly, which is super fast and efficient. It’s already becoming integrated into Spark.
Scrapy is a popular Python framework for web crawling and web scraping. For example, you may need to crawl millions of web pages and collect data from them, to create a custom search engine. Requests provides a popular API for making web requests – as a more general case than Scrapy – and it’s quite easy to use. Highly recommended if you need to call APIs to get data, run ML models, etc. The project’s tagline speaks volumes:
Requests is the only Non-GMO HTTP library for Python, safe for human consumption.
Flasgger runs in the opposite direction. When you need to publish an API for your data science work, use this as a simple way to publish OpenAPI specs and Swagger tools for web apps running on Flask. Translated: create a self-documenting API from your Python code, where people can test their integrations in a browser. Here’s a simple example: experiment with running the /api/v1/info endpoint https://derwen.ai/apidocs/ on my website.
Istio provides a way to deploy internal microservices with enterprise-grade management for scale, security, monitoring, etc. Translated: data science teams can create apps which devops teams love to run.
Note that the combo of Flask + Gunicorn + Flasgger + Istio is great for turning Python data science apps into enterprise-grade microservices deployed at scale. Translated: moving ML models into production rapidly with very little extra code required, while keeping the operations teams happy.
Step one in data science: get your data prepared. Step two in data science: go back and spend more time getting your data prepared. Face it, you spend lots of time preparing data, accessing data, loading data, etc. We could write volumes about this layer, where oh so many Big Data tools and data science platforms tend to focus. Assuming that tools like Spark help with ETL and loading data from many formats already … let’s consider a few of the other popular data access methods/frameworks that you may need to code yourself.
SQLAlchemy is a popular “swiss army knife” for accessing SQL databases from Python. It supports a wide range of different database platforms and their features. It’s built to be DBA-friendly and is a fully functional ORM (object-relational mapper) for those who need to integrate with Java database frameworks - much like the J2EE + Hibernate toolkits.
Pillow provides image processing through the Python Imaging Library. Video and image data is currently the most popular source of data for deep learning. Here’s a popular way to read image files.
BeautifulSoup is one of the most popular packages for reading HTML and XML documents. It’s especially forgiving about bad HTML formatting, which basically describes most of the Internet. BeautifulSoup also converts documents into Unicode automagically, and for that it nearly deserves a Nobel Prize. Once you’ve scraped millions of web pages for your NLP project, use BeautifulSoup to convert the HTML markup into text data which NLP libraries can parse.
Moving up the stack to the data representation layer — after you’ve loaded data into your data science workflows, it needs to “live” in somewhere, preferably in efficient data structures. Roughly speaking, this layer is where feature engineering typically happens. Depending on the use case and types of data involved, there are several popular options.
Pandas is probably the most thoroughly data science-y Python package in existence. If our landscape diagram showed only one rectangle, that rectangle would read “Pandas”. Python for Data Analysis by Wes McKinney gives all the details. Once you’ve accessed your data – via SQL queries, reading image files, scraping HTML pages, etc. – then slice and dice that data in Pandas prior to the next stages of your workflow, i.e., for visualization, reporting, feature engineering, modeling, evaluation, etc.
Right alongside the Pandas library, NumPy is “the fundamental package for scientific computing with Python.” In other words, NumPy is the workhorse for Python data. Note that in data science and machine learning, so much of what we do is work with large arrays and matrices, essentially running lots of linear algebra, over and over. NumPy provides highly optimized data structures for lots of linear algebra.
There are times when Big Data gets too big for its own good. For example, if you need to count billions of items so that you can divide one really enormous number by another really enormous number, that can create unnecessary processing bottlenecks. If your end result is to calculate ratios and you only need a few significant digits, say within a 95% confidence interval, then odds are good that you’re performing at least two orders of magnitude more compute than what’s needed. Approximate instead. As they said at Twitter – where probabilistic data structures became a significant advantage in Big Data work – “Hash, don’t sample.” The datasketch library provides some of the better implementations for probabilistic data structures in Python. For examples, check out my tutorial of the same name. As an alternative or complement to NumPy, this becomes extremely useful for NLP applications in particular and for feature engineering in general.
Modin is Pandas on Ray. In other words, scale-out a Pandas workflow by changing a single line of code. Again, this is basically next-gen Spark five years later – plus Ray is more idiomatic for Python libraries for data science.
Moving over to the NLP region of the diagram, I cannot say enough good things about spaCy. If you need to do natural language work in Python – for example, text analytics – use spaCy. It’s the most advanced, fastest, most popular NLP package in Python, and it plays well with others. Note that spaCy is an “opinionated API”, which is to say that its authors included what is needed, but not the entire kitchen sink.
NLTK comes from the previous generation of natural language libraries for Python. In contrast to spaCy, generally NLTK is (a) slower, (b) less advanced, and (c) includes not only the kitchen sink, but many kitchen sinks stacked inside other kitchen sinks. Even so, you’ll encounter lots of code that still uses NLTK.
RDFLib is a Python library for working with RDF, OWL, and other semantic web formats. Translated: how to read and write from knowledge graphs. Note that RDF comes from an earlier generation of AI work, more than a decade ago. Techniques may have evolved, but the data formats remain somewhat standardized. Knowledge graph use cases are trending, since they provides good ways to add some of the context which deep learning approaches tend to miss.
Analysis and Modeling
Moving up another layer, since we’ve prepared our data, performed feature engineering, and transformed the data, now it’s ready to train ML models or run through other kinds of analysis.
SciPy is the “fundamental library for scientific computing.” As such, it straddles two layers in our landscape diagram: on the one hand SciPy provides numerical analysis, advanced linear algebra, plus a whole range of Python-atop-FORTRAN code for scientific computing … on the other hand SciPy includes functions for reading special formats, such as image files. In general, this library pairs well with NumPy.
The scikit-learn family of algorithms is arguably the most popular machine learning library for Python. It fits well sandwiched between Pandas + NumPy + SciPy below and Matplotlib above. Frankly, I use scikit-learn more than any other ML library. For an excellent (and very popular) guide, see Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurélien Géron.
StatsModels is a general purpose statistics package in Python, used alongside NumPy. Results get tested against existing statistical packages to ensure correctness. I especially enjoy the “Pitfalls” and “Examples” sections in their documentation.
Use of deep learning has become so widespread, and Python has a few of the most popular frameworks:
- TensorFlow from Google, which is the most highly popularized approach
- Theano, which stopped development last year but is still widely used
- Keras provides an easy-to-use abstraction layer atop TensorFlow and Theano
- PyTorch from Facebook, which has been gaining share steadily over TensorFlow
AllenNLP provides deep learning for NLP, built on top of PyTorch. Hint: this research project competes with research based on TensorFlow, and sometimes they publish papers within weeks of each other to “one up” benchmark results.
Rasa has gained a large following as one of the more popular NLU (natural language understanding) libraries in Python. It’s particularly good for classifying intents – in other words, building chatbots and voice apps. You can build pipelines atop spaCy, scikit-learn, TensorFlow, and other base NLP+ML technologies, depending on the use case.
Gensim is another popular library for topic modeling, vector embedding, and related text mining algorithms in Python.
Moving over to the graph region of the diagram, NetworkX gets my vote for “most under-appreciated, poised to become huge” library in Python data science work. It provides a package for creating, manipulating, and analyzing graphs in memory. Frankly, graph databases tend to get in the way of serious graph algorithm work, especially for large-scale knowledge graph work. NetworkX allows you to work with large graphs in memory, customizing graph algorithms and analysis for your use case, generally in ways that are much faster and more flexible that graph database frameworks allow. If you need to create a knowledge graph, look toward NetworkX as an excellent tool.
PyMC3 provides a popular Python package for Bayesian statistical modeling, probabilistic programming, advanced machine learning algorithms, and much more. If you need to run MCMC, you’re probably well-acquainted with PyMC3 already.
Airflow is a Python framework – originally from Airbnb – for building, running, and monitoring workflows. It’s interesting to see how Airflow, AllenNLP, TensorFlow, Rasa, etc., are beginning to define another emerging layer for orchestrating pipelines and workflows.
Moving up to the visualization layer: got data? (check), got features? (check), got models? (check), and now you need to view results, to evaluate your analysis and modeling work. See the book Grammar of Graphics by Leland Wilkinson for details of the theory underlying some of these packages.
The landscape diagram lists six of the most popular general-purpose visualization libraries in Python, because this part of the tech stack is especially crucial in data science:
- Matplotlib, arguably the most widely used although sometimes a bit difficult to understand for the “uninitiated”
- Seaborn, an abstraction layer based on Matplotlib which produces beautiful graphics, easier to use
- Altair, which is a declarative statistical visualization library, i.e., more concise and simpler to understand
- Bokeh, built for interactive data visualization on web pages (e.g., Jupyter)
- plotnine, also declarative and a Python implementation of ggplot2
- Plotly, an online editor for interactive D3 charts
What if your data visualization needs to be represented as a map? In other words, what if you’re working with geospatial data? Cartopy is part of Matplotlib which creates base map layers (shapes, etc.) over which you can plot other analytics and visualization layers. GeoPandas extends Pandas to work efficiently with geospatial data, which supports plotting by Shapely. Rasterio renders raster data, such as satellite images.
Back over in the graph region of the diagram, Pydot is a Python library adapting GraphViz. Pydot pairs nicely with NetworkX. This comes in especially handy in NLP use cases and knowledge graph work.
Explainability, Fairness, Bias, Ethics
In the top layer of the landscape diagram we show a set of Python packages which address explainability of machine learning models, and also address issues of fairness, bias, and ethics in data science.
AIF360 is more formally known as the “AI Fairness 360 toolkit” from IBM. This detects unwanted biases entering a machine learning pipeline and helps to mitigate those biases.
Skater provides a Python package for ML model interpretation. It builds atop multiple strategies, including LIME, Bayesian rule lists, and deep learning model interpreters.
The deon package is an ethics checklist for data scientists. It’s built to integrate simply into Git code repos, and become used as part of a data science team’s engineering process.
Aequitas is a bias audit tool for risk assessment of ML model use cases.
Okay, that’s been a whirlwind tour of data science in Python. While many data science projects may use only a handful of these libraries, hopefully some of these introduce new features and techniques which help enhance your practice.
Again, this is intended as a basis for discussion and we’ll start a forum here for more discussion. This landscape diagram is a proverbial “Version 1.0” so in particular let’s discuss about how you might suggest improving the diagram?#Python
#General Data Science#Diagram#Pandas#scikit-learn