How about a tour through some useful resources for learning data science? Recently a few friends who teach data science and adjoining fields were asking online about resources for their new courses. I took notes since these kinds of resources are great for both people just starting to learn data science and also for people on data teams – to enhance your projects.
One thread that caught my attention was where Allen Downey asked on Twitter about recommendations for data visualization. Allen is an engineering professor at Olin College, a noted thought leader for how to use Jupyter in education, and a highly recommended author. His books such as Think Python, Think Stats, Think Bayes, etc., are required reading in most of the courses I teach. The results of that Twitter quest? Lots of suggestions, which were summarized in the blog post, “The Library of Data Visualization” last month.
Let’s take a tour through that summary. First up, Claus Wilke has been tweeting for nearly a year about sections of his upcoming Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures, and I’ve really been looking forward to it. A preview of the book is available online, but the full text is also available! It provides excellent examples and goes through the mechanics of how to visualize data. I especially appreciate how it conveys an experienced, confident voice about what to do and what not to do, regardless of how many interesting features our favorite Python libraries may provide.
Another good book recommendation is Kieran Healy’s Data Visualization: A practical introduction. It is excellent material, and with in-depth discussions about the “how” and “why” of data visualization. It expands on work by Edward Tufte: if you’re interested in data visualization and haven’t read Tufte, then drop what you’re doing and go sign up for one of his courses. It’s the best $380 I’ve ever spent on training. Especially when hand-written examples of data viz by da Vinci begin to circulate through the class! The cost of the course include $150 worth of quite useful books.
While we’re talking about the foundations of the field, two other core texts in my data visualization canon are The Elements of Graphing Data by William Cleveland and The Grammar of Graphics by Leland Wilkinson. So much of what you’ll find in the data viz libraries you use is based on theoretical perspectives which originate from those two books.
Other recommendations on Allen’s list include a compilation called Beautiful Visualization: Looking at Data through the Eyes of Experts edited by Julie Steele and Noah Iliinsky, and Interactive Data Visualization for the Web; by a good friend and colleague Scott Murray. Scott’s book is particularly helpful for when you want stakeholders to get hands-on experience with your data science analysis and results.
In addition to books, a wealth of other online resources were recommended. Nathan Yau’s FlowingData is probably the top website recommendation that comes to mind for me. Words would not do justice, this is such a fantastic resource. It gets updates frequently, so keep checking back. Nathan has a book out called Visualize This.
Sometimes resources show up in unexpected places: I’ve found that Pinterest is good for turning up data visualization resources, and I have a Pinterest board about Data Science. That helps to generate interesting recommendations as other people save related links. Another great resource is The Graphic Continuum by Jon Schwabish and Severino Ribecca. If we talk about how Wilkinson provided a grammar for data visualization, the infographics on Continuum provide a vocabulary. It’s important for teams to have a working vocabulary about data visualization and related best practices.
Junk Charts and Terrible Maps provide hilarious, and sometimes horrific, counter-examples for what not to do with data visualization tools. Tufte has written about data ink as a metric for quantifying the “badness” of terrible charts. For what it’s worth, years ago I wrote an open source Java tool to calculate Tufte’s metric – which should probably be updated in Python.
Speaking of Python, there are three good resources online specifically about Python data visualization:
Last but not least from Allen’s list, there’s the Data Stories podcast with Enrico Bertini and Moritz Stefaner. While a podcast that covers visualization brings to mind a famous quote – how William Burroughs described writing about music as “dancing about architecture” – in fact I’ve learned so much in data science through podcasts. Another outstanding podcast is the Data Show by my friend and co-author Ben Lorica.
Continuing our tour of data science resources, beyond Allen Downey’s excellent data viz list, let’s look at other resources online. R2D3 (aka Stephanie Yee and Tony Chu) has an ongoing experiment called A visual introduction to machine learning – see Part 1 and Part 2. This has to be one of the best examples of an “online book” that I’ve ever seen. Another fine set of examples is in Distill plus its parent organization Parametric Press in general. They present really interesting ways to use notebooks for interactive publications. The “Feature Visualization” one in particular helped me understand more clearly about how deep learning works. Also, no discussion of data visualization and notebooks would be complete without checking out Observable by the D3 creator Mike Bostock.
For a data scientist, the act of obtaining good datasets is an essential step long before visualizing. Our next stop on the tour is the Data.gov open data repository hosted by the US federal government. I was working recently with the Housing Affordability Data System datasets provided there, which link HUD surveys about relative cost of living over time in different metropolitan areas with US Census data. These kinds of datasets are rich.They provide a good basis for time series analysis, geospatial work, linked data, NLP, etc. On the one hand, I think it’s phenomenal how much data there is available through the federal government in the US. On the other hand, working with those resources can be difficult. For example, the metadata can be complicated to follow. That reinforces the maxim that as data scientists we spend most of our time cleaning up data.
Of course, Data.gov is merely one of many large repositories for open data. When I was working in Madrid late last year, I got to meet with researchers from the European Space Agency who’d developed an open source portal called SkyESA. I will try to avoid puns (“sky’s the limit”, “astronomically large datasets”, etc.) but seriously they make much excellent data available, plus they’re eager to work with data science teams to help demonstrate interesting uses. For example, another good friend Gema Parreño created predictive modeling in Python called Deep Asteroid which won a NASA machine learning challenge. Gema’s project uses deep learning and reinforcement learning to predict trajectories of “near Earth objects” (NEOs. i.e., when are asteroids likely to strike Earth?). Open source and open data in AI, for the win!
At a more human scale, I like using metro bike share data in the course I teach. A really good bike share dataset is Capital Bikeshare. Much like the federal datasets, the many varied bike share datasets have multiple useful features and applications: text, times series, geospatial, linked data, to name a few. For those who’ve watched my Introduction to Apache Spark video course from 2015, the section on graph algorithms leveraged data from Capital Bikeshare to approximate trip times in Google Maps, using Dijkstra’s algorithm implemented in Spark. Bike share datasets are available as open data from many major cities, and given their geographic focus they can be somewhat simpler to use that federal datasets – or at least more contextual. In other words, bike share stations can be associated with neighborhood data, so you can join datasets for interesting applications. For example, here’s an open source project called CoPA which took two datasets from the City of Palo Alto open data portal – one about tree locations and another about road busy times – then joined with fitness mobile app data which we’d collected. The results estimate “Where can I find a quiet, shady place in summer to walk and take a mobile call in Palo Alto?” That was intended as an example for how to leverage city data to create novel apps. I look forward to seeing the apps you make using open data and data science.
Switching gears a bit, it’s really important to stay current in data science. That can be challenging. For the next stop on our tour, let’s look at resources for keeping up to date on the latest research. With so much research happening in machine learning, simply trying to keep up with the top papers becomes an enormous challenge!
One of my areas of expertise is in natural language – NLP, NLU, NLG, etc. - and there’s been a flurry of research over the past two years that it makes it difficult to even try to find the “State of the Art” results published from ICML . A good friend and colleague Daniel Vila Suero – who’s an expert in natural language and knowledge graph work – did a talk recently in Spain to survey the recent NLP research. For an audience peppered with NLP experts, we were shocked by how much we’d missed! My biggest takeaway from Daniel’s talk was to track the latest and greatest work through the handy site NLP Progress by Sebastian Ruder. That provides a compendium of the latest work in natural language, both in terms of research papers and associated open source projects. Each section includes “Current SOTA” for state of the art benchmarks along with the metrics and datasets used to obtain those benchmarks. In other words, it’s heading towards a democratization of science: you can get the tools, the data, the best models, and latest methods all described in one convenient location.
Recently, Sebastian Ruder extended the SOTA compendium work to cover much more than NLP – for a much wider range of machine learning. Check out Papers with Code. That leverages mutual open data licensing agreements and and open source web scrapers among a community of research groups, which include AI progress through data and SQuAD. It’s astounding both as a center of knowledge about data science as well as a foundation for reproducible science.
Effectively, our work in data science is much about leveraging reproducible science on behalf of business needs in industry. In other words, can multiple teams within an enterprise organization develop similar insights about the organization from its data? If so, analytics work in that organization is probably in good shape. If not, that organization may have fundamental data infrastructure issues to fix before much data science will get accomplished. Coming back full-circle for this point, another great Twitter thread by Allen Downey recently was about what he called the “Reproducible Science Starter Kit”. It’s recommended reading for data science teams to level up their effectiveness in enterprise.
By the way, if you’d like to flex some data science skills towards building SOTA compendiums like Sebastian Ruder has been showing, check out this thread by Andrew Mauboussin. That’s for a “news aggregator” called PCA News which suggests papers to read. The gist is that pre-print research papers get published early through arXiv, so then PCA News follows a set of Twitter accounts for people who read and discuss recent research on arXiv, parsing and aggregating tweets to help keep track of research. Chip Huyen took this approach a step further and published the open source project SOTAWHAT also as an arXiv aggregator, if you’d like to experiment with building your own.
Other Essential Skills
Last stop on this tour is the soft skills arena, which is also very important in data science. For example, speaking and presenting are a big part of the job. If you’re like me, you probably got started in analytics more through the science, math, and engineering, while speaking was a secondary consequence of the role. To be candid, although I spend a lot of time presenting onstage, realistically I’m an introvert. Speaking was a learned skilled for me. I’ve found that many of the best data scientists on my teams have also been deeply introverted, and some were quite apprehensive about public speaking. One helpful resource for that is Toastmasters. I’ve even seen companies start a new Toastmasters club as a way of helping their data science teams and other staff get more comfortable with presenting.
Another frequent question that I hear is “How does one get involved in team projects to practice data science?” In other words, before getting hired into a role, what are some resources for learning through project-based experience within a team context? An excellent way to get experience is by volunteering in community and nonprofit projects, for example through DataKind. There are chapters world-wide where mission-driven organizations – charities, nonprofits, foundations, local governments, etc. – can submit projects. Then data scientists volunteer time to work on data analytics. Of course, DataKind is one of many organizations which help coordinate data for social good. Think of it as a proactive way to seek mentoring-on-the-job. Highly recommended.
TLDR; List of Resources
If you don’t have time to read the full context of why these resources are so great (though I strongly recommend doing so), or if you simply want to have a reference to quickly click back to, here’s the list of resources quoted or linked in the above article.
Thank you for joining this tour. I hope some of these resources are helpful. Many thanks to @William Roberts for much help on this article. If some of these names/resources sound familiar, you may be remembering personas like Ed Tufte from my other post “What is Data Science?” or conversations about data visualization in “A landscape diagram for Python data.”
As always I’m eager to hear your ideas and suggestions. See you on the forums!