When developing a classification model, ML practitioners often use stratified sampling to generate a subset of samples from the original dataset. For a specified input column, stratified sampling generates a subset that preserves the distribution of values in the input column. Let’s say, we have a dataset of 1000 customer records of which 550 belong to male customers and the rest to female. If we want to get 10% of the samples by stratifying on the gender column, we will get a subset of 100 customer records with the ratio between male and female customers will be close to 0.55:0.45.
Popular ML development environments, such as open-source Python-based IDEs, require that the original dataset be first loaded into memory before we generate stratified samples from it. This can be a challenge if the dataset is large and stored in a separate system, such as a relational database management system (RDBMS), which is the most common type of storage of enterprise ML datasets. Fetching a large dataset from a RDBMS source over the network, loading it to the memory of a development environment, and then finally generating stratified samples is not efficient.
IBM Db2 provides an elegant solution to the above sampling use case. Db2’s in-database ML library includes a stored procedure (SP) for random sampling with support for stratified sampling.
Here’s an example of how I used it in a demo project recently:
I have the following table, called GOSALES, with 60252 synthetic customer records.
This table’s GENDER column has 3 types of values: Female (F), Male (M), or null. These values have the following distribution:
I generated a 10% subset of records from this table using the following SQL statement,
call IDAX.RANDOM_SAMPLE('intable=SHAIKHQ.GOSALES, fraction=0.1, outtable=GOSALES_SAMPLE, by=GENDER')
The above SQL statement calls Db2’s random sampling SP with four parameters:
- intable = SHAIKHQ.GOSALES (name of the table where the original dataset is stored.)
- fraction = 0.1 (the generated subset will have 10% of the records from the original table.)
- outtable = GOSALES_SAMPLES (the name of the output table, which the SP will create automatically, to store the generated samples)
- by = GENDER, (the name of the column on which to stratify)
After executing the above SQL statement, I have a new Db2 table with 6026 records (=10% of the original table with 60252 records.)
This generated subset has the following distribution of values in the gender column. This distribution is identical to the distribution of values in the gender column of the original dataset.
Now I can use this stratified subset of my ML dataset to develop my ML pipeline inside Db2, using Db2's in-database ML stored procedures, or I can copy this dataset to an external ML development environment. In the latter case, copying the dataset will be faster, since I do not need to copy the original larger dataset over the network, nor do I need to generate the samples in my dev environment.
To learn more about Db2's Random Sample stored procedure and other in-database ML stored procedures, check out the following links.
IDAX.RANDOM_SAMPLE - Random sampling
In-database machine learning with IBM Db2