Global Data Science Forum

Call for Code Useful Datasets

By Susan Malaika posted Fri February 21, 2020 08:55 AM


Introduction to open datasets and the importance of metadata
More data is becoming freely available through institutions and research publications requiring that datasets be freely available along with the publications that refer to them. For example Nature magazine instituted a policy for authors to declare how the data behind their published research can be accessed by interested readers.

To make it easier for tools to find out what's in a dataset, authors, researchers, and suppliers of datasets are being encouraged to add metadata to their datasets. There are various forms for metadata that datasets use. For example the US Government site uses the standard DCAT-US Schema v1.1 whereas Google Dataset Search tool relies mostly on tagging. And many datasets have no metadata at all. That's why you won't find all open datasets through search, and you need to go to known portals, and additionally explore if portals exist in the region, city or topic of your interest. If you are deeply curious about metadata, you can see the alignment between DCAT and in the DCAT specification dated February 2020 The datasets themselves come in a variety of forms for download, such as CSV, JSON, GeoJSON, zip. Sometimes datasets can be accessed through APIs.

Another way that datasets are becoming available is through government initiatives to make data available. In the US has more than 250,000 datasets available for developers to use. A similar initiative in India has more than 350,000 resources available.

Companies like IBM sometimes provide access to data, like weather data or give tips on how to process freely available data, e.g., An Introduction to the NOAA ( National Oceanic and Atmospheric Administration) weather data for JFK Airport; used to train the the open source Model Asset Exchange Weather Forecaster- and you can see the model artifacts on Github. You may also be interested in IBM's Data Asset Exchange (DAX) Data Asset eXchange where you can explore useful data sets for enterprise data science.
You can also register to access IBM Weather Operations Center (WOC) data sets and Geospatial Analytics capabilities at https// These data sets are normalized and easy to use. To get access to the WOC, send email to  indicating that you are a Call for Code participant. Include your name, email address, and your company or organization and IBMid. You should gain access within 2 working days.

When developing a prototype or training a model during a hackathon, it is great to have access to relevant data to make your solution more convincing. There are many public datasets available to get you started. We’ll go over some of the ways to find them, and access considerations. Note that some of the datasets may require some pre-processing before they can be used, e.g., to handle missing data, but for a hackathon, they are often good enough.

weather clouds

Ways to Find Datasets: Dataset Search

You can use Google Dataset Search - With the dataset search tool you can locate datasets through keywords such as a country or city name, or a category such a medical or agriculture. And there are additional filters you can apply such as how recently the dataset was updated, the download format (e.g., JSON, image etc), usage rights (commercial or non-commercial), and whether the dataset is free. Dataset search is a great tool for datasets where metadata (such as tags) have been supplied with the dataset. However, there are datasets that do not yet have metadata in the form that Google Dataset Search uses so that's when you go to locations where there are many datasets. Of course, some datasets can be found using both methods.

Ways to Find Datasets: Go to locations where there are many datasets

Many governments, and institutions such as the UN, and the World Economic Bank provide datasets. The following are some examples:

Dataset Aggregator Sites and Miscellaneous Catalogs: Some sites collate datasets into categories sourced from other locations including datasets from the sites. It is worth taking a look at these sites, noting that some do charge for specialized access, but these aggregator sites do give you an idea of what's available. Examples of sites that aggregate collections of datasets or provide introductions to open datasets include :

License and Privacy Considerations

It is easier to use factual datasets such as measurements, tabular data, land mass, reservoirs, weather – avoiding personal data, such as names, pictures of people that may have privacy concerns which vary from country to country.

Occasionally you will find datasets which will state that they are for academic use only. The owners’ are usually fine with the dataset being used in a hackathon setting, but it is best to check. An example of such a dataset is a multimodal (image and text) Deep Learning For Disaster Response dataset> which states that it is available for download only for academic purposes. In this case, we have confirmed with the author that she is agreeable that the dataset may be used in hackathons, particularly those for social good. You can take a similar approach. And please note if you move on and start selling the software you created in the hackathon, or make it part of a product, then you should not use datasets that are marked for academic use.

Many datasets, where there is a license specified, will have a Creative Commons (CC) license. An example of such a dataset is the earthquake data EEW Be aware that the CC by NC variant means that the dataset cannot be used for commercial purposes.