Global Data Science Forum

CORD-19: Filtering the Dataset and Generating Abstracts

By Nick Acosta posted Thu April 02, 2020 05:15 PM



Last week I introduced the CORD-19 dataset, and showed how to bring the dataset from a collection of json's into Python for analysis. I wanted to continue on this topic and demonstrate how I processed the data further and created a character-based RNN that generates new abstracts.

For review, the CORD-19 dataset is a collection of over 25,000 research papers on the coronavirus. Each entry in the dataset has a number of different fields, including the paper's title, authors, full text, and abstract. According to the University of Wisconsin, the abstract of a research paper is a "short summary that prepares readers to follow the detailed information, analyses, and arguments in the full paper." 

An example of a written abstract 


A recurrent neural network can be used to predict the next character in a sequence of characters in order to generate text, such as in this example from TensorFlow, but cannot used on these abstracts without addressing some issues first. There are over 30 million characters that make up these research papers' abstracts. Because this dataset contains papers from research institutions from all over the world, the abstracts are comprised of 936 unique characters, including the N'ko character ߚ and the feminine ordinal indicator ª. The last layer the RNN is a softmax layer that predicts the next characters based on the sequence of characters in the input, and the number or nodes it contains is equivalent the number of possible characters it could generate. Keeping this at 936 would greatly increase the number of parameters a model has to train, leading to slower training times, greater risk of over- or underfitting, and could lead to the production of nonsensical output that mixes languages, especially when many characters, such as the Sundanese letter ᮊ, appear very infrequently among the over 30 million characters in these abstracts (ᮊ appears twice in abstracts, or less than .00001% of all characters). This notebook shows how to reduce the number of unique characters by over 80% from 936 to 193 while still maintaining over 99% of the total abstracts used.

Generating Abstracts

I have created another notebook that takes the resulting, filtered abstracts to generate new ones using the TensorFlow example link to above. Soon, I will be showing how generated output can be improved via new NLP techniques like transformers.

An example of a generated abstract