is a free resource of over 29,000 (as of 3/20/2020) scholarly articles about COVID-19 and the coronavirus family of viruses. The Allen Institute for Artificial Intelligence
and Semantic Scholar
have been working together with other partners in order to "mobilize researchers to apply recent advances in natural language processing to generate new insights in support of the fight against this infectious disease." This dataset is being collected by crawling various websites research libraries, and research preprint servers, including PMC
, and medRxiv and standardizing them into a json format with paper id's, title and author lists, abstracts, citations, and the articles' full text.
Kaggle and Preprocessing CORD-19
Kaggle is hosting a COVID-19 Open Research Dataset Challenge
, which is challenging data scientists and developers to utilize the CORD-19 dataset in order to better our understanding of the virus. The challenge is based around nine tasks, each with a $1,000 prize and an initial deadline of April 16th. I wanted to give IBM Data Science Community members a head start on these tasks by providing a notebook the collects and transforms the json's that make up this dataset into a Pandas DataFrame for analysis in Python, and a csv to use in Python and everywhere else. This notebook can be found here
(the csv it produces can be found here
), and was made possible by Xing Han Lu's notebook on the same topic. His notebook
and many others
working on the CORD-19 dataset are being made publicly available on Kaggle. We will be walking through some interesting data science, deep learning, and NLP techniques that can be used on this dataset soon, if there is anything in particular you'd like to see please let me know in the comments or feel free to post about your own CORD-19 notebooks here.
For more on CORD-19, learn how to train models to generate abstracts in TensorFlow here.