Keeping track of all the data efforts regarding the Coronavirus is hard. That is why I have gathered all the relevant datasets and data efforts in one place. The list is being updated on a daily basis, so check it out often.
Since the corona erupted into our world, research institutes and governments have released many databases publicly to allow research groups (and independent individuals) to analyze the data around the corona’s spread. These databases are scattered under numerous initiatives and sources. The purpose of this blog is to organize all the major open databases and data initiatives around the world. Know another important repository? Feel free to add it in the comments or through this form.
Datasets and data challenges:
COVID-19 Open Research Dataset Challenge (CORD-19)
In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 44,000 scholarly articles, including over 29,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up.
COVID19 Global Forecasting.
The White House Office of Science and Technology Policy (OSTP) pulled together a coalition research groups and companies (including Kaggle) to prepare the COVID-19 Open Research Dataset (CORD-19) to attempt to address key open scientific questions on COVID-19. Those questions are drawn from National Academies of Sciences, Engineering, and Medicine’s (NASEM) and the World Health Organization (WHO).
Oxford Covid-19 Government Response Tracker
Governments are taking a wide range of measures in response to the COVID-19 outbreak. The Oxford COVID-19 Government Response Tracker (OxCGRT) aims to record these unfolding responses in a rigorous, consistent way across countries and across time.
Novel Corona Virus 2019 Dataset (Day level information on covid-19 affected cases)
From World Health Organization — On 31 December 2019, WHO was alerted to several cases of pneumonia in Wuhan City, Hubei Province of China. The virus did not match any other known virus. This raised concern because when a virus is new, we do not know how it affects people. So daily level information on the affected people can give some interesting insights when it is made available to the broader data science community.Johns Hopkins University has made an excellent dashboard using the affected cases data. Data is extracted from the google sheets associated and made available here.
MIDAS 2019 Novel Coronavirus Repository
The MIDAS Coordination Center released an online portal for COVID-19 modeling research. The portal improves navigation and search of COVID-19 information. Moving forward we will use the online portal as landing page for COVID-19 data and information and the COVID-19 GitHub repository for sharing of computable (CSV) files with data, parameter estimates, software, and metadata. All community contribution functionality of this repository will be maintained, so continue to send pull requests or issues for questions or contributions!
COVID-Net and COVIDx Dataset
The COVID-19 pandemic continues to have a devastating effect on the health and well-being of global population. A critical step in the fight against COVID-19 is effective screening of infected patients, with one of the key screening approaches being radiological imaging using chest radiography. It was found in early studies that patients present abnormalities in chest radiography images that are characteristic of those infected with COVID-19. Motivated by this, a number of artificial intelligence (AI) systems based on deep learning have been proposed and results have been shown to be quite promising in terms of accuracy in detecting patients infected with COVID-19 using chest radiography images.
Johns Hopkins Virus Dashboard Repository
This is the data repository for the 2019 Novel Coronavirus Visual Dashboard operated by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). Also, Supported by ESRI Living Atlas Team and the Johns Hopkins University Applied Physics Lab (JHU APL).
GISAID — Global Initiative on Sharing All Influenza Data
Laboratories around the world are generating in an unprecedented manner, more and more genome sequences and related clinical and epidemiological data associated with the newly emerging coronavirus (hCoV-19) rapidly made available via GISAID. The pandemic virus was first identified in late December 2019 in Hubei Province, where patients were suffering from respiratory illnesses such as pneumonia. Since then, hCoV-19 is detected across the globe.
COVID-19 Coronavirus data (EU)
The dataset contains the latest available public data on COVID-19 including a daily situation update, the epidemiological curve and the global geographical distribution (EU/EEA and the UK, worldwide). On 12 February 2020, the novel coronavirus was named severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) while the disease associated with it is now referred to as COVID-19. ECDC is closely monitoring this outbreak and providing risk assessments to guide EU Member States and the EU Commission in their response activities.
Tableau Aggregated Coronavirus Data Set
We are facing an unprecedented public health crisis with the Coronavirus (Covid-19) outbreak. We believe that data-driven decisions, and people working together for the greater good, are the best way through this difficult time. Right now, it’s more important than ever to have the resources to answer critical questions that matter to your organization and people. This includes having access to timely, detailed, and trustworthy data to think quickly and move fast. We have gathered the power of our Tableau Community and our technology to create a free Covid19 Data Resource Hub to help you make confident decisions with data.
Coronavirus (Covid-19) Data in the United States
The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.
Since late January, The Times has tracked cases of coronavirus in real-time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak. We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak. The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository.
Courses, visualization and more
CS472 Data Science and AI for COVID-19
This project class investigates and models COVID-19 using tools from data science and machine learning. We will introduce the relevant background for the biology and epidemiology of the COVID-19 virus. Then we will critically examine current models that are used to predict infection rates in the population as well as models used to support various public health interventions (e.g. herd immunity and social distancing). The core of this class will be projects aimed to create tools that can assist in the ongoing global health efforts. Potential projects include data visualization and education platforms, improved modeling and predictions, social network and NLP analysis of the propagation of COVID-19 information, and tools to facilitate good health behavior, etc. The class is aimed toward students with experience in data science and AI, and will include guest lectures by biomedical experts.
This article originally appeared on Towards Data Science on March 26, 2020.
Note from the editors: Towards Data Science is a Medium publication primarily based on the study of data science and machine learning. We are not health professionals or epidemiologists, and the opinions of this article should not be interpreted as professional advice. To learn more about the coronavirus pandemic, you can click here.