Global Data Science Forum

Where is the data I need?

By Jacques Roy posted Fri January 25, 2019 01:41 PM

Everybody knows that without data, there is no machine learning or AI.
Everybody looks for data in corporate repositories, on the internet, in social media.

For me, I even have problems finding data I found before!

Would it be nice to share what we found with others? Of course, we have to control
what we share. Some data may be sensitive, some fields or attributes must be masked.

That's what catalogs are for. Do you need only one catalog? No.
You may have one catalog for some experimentations, unofficial projects. You can have
a catalog for development, one for production, one as an official repository for
any project and so on.

Here's another important point: you don't want to copy data from one place to another
if it can be avoided. There are some datasets where you won't mind but, specially for
corporate data, you don't want another copy and potentially introduce another
version of the "truth".

This is where being able to point to the data instead of copying it is important.
For example, the IBM Knowledge Catalog supports an array of data sources from IBM
And third-parties, and the list keeps growing:

Catalog connection sources
What about if you want to share more than just data? In data science projects, you
may also want to share notebooks and models.

So, if you are planning to deal with a lot of data and other types of assets
corporate and otherwise, think about looking into a good catalog platform.

For more on data science, take a look at the latest byte-size data science video at: