When developing a classification model, ML practitioners often use stratified sampling to generate a subset of samples from the original dataset. For a specified input column, stratified sampling generates a subset that preserves the distribution of values in the input column. Let’s say, we have a dataset of 1000 customer records of which 550 belong to male customers and the rest to female. If we want to get 10% of the samples by stratifying on the gender column, we will get a subset of 100 customer records with the ratio between male and female customers will be close to 0.55:0.45.
Popular ML development environments, such as open-source Python-based IDEs, require that the original dataset be first loaded into memory before we generate stratified samples from it. This can be a challenge if the dataset is large and stored in a separate system, such as a relational database management system (RDBMS), which is the most common type of storage of enterprise ML datasets. Fetching a large dataset from a RDBMS source over the network, loading it to the memory of a development environment, and then finally generating stratified samples is not efficient.
IBM Db2 provides an elegant solution to the above sampling use case. Db2’s in-database ML library includes a stored procedure (SP) for random sampling with support for stratified sampling.
Here’s an example of how I used it in a demo project recently:
I have the following table, called GOSALES, with 60252 synthetic customer records.
![](https://dw1.s81c.com//IMWUC/MessageImages/18e420128c3e416ba763b981117f2b39.png)
This table’s GENDER column has 3 types of values: Female (F), Male (M), or null. These values have the following distribution:
![](https://dw1.s81c.com//IMWUC/MessageImages/0824832105924d07a89ef1b127c96387.png)
I generated a 10% subset of records from this table using the following SQL statement,
call IDAX.RANDOM_SAMPLE('intable=SHAIKHQ.GOSALES, fraction=0.1, outtable=GOSALES_SAMPLE, by=GENDER')
The above SQL statement calls Db2’s random sampling SP with four parameters:
- intable = SHAIKHQ.GOSALES (name of the table where the original dataset is stored.)
- fraction = 0.1 (the generated subset will have 10% of the records from the original table.)
- outtable = GOSALES_SAMPLES (the name of the output table, which the SP will create automatically, to store the generated samples)
- by = GENDER, (the name of the column on which to stratify)
After executing the above SQL statement, I have a new Db2 table with 6026 records (=10% of the original table with 60252 records.)![](https://dw1.s81c.com//IMWUC/MessageImages/7b0d62a7c7ec4742948f50241c957581.png)
This generated subset has the following distribution of values in the gender column. This distribution is identical to the distribution of values in the gender column of the original dataset.
![](https://dw1.s81c.com//IMWUC/MessageImages/745fc1c8ef2b4b4189c43316b59429cd.png)
Now I can use this stratified subset of my ML dataset to develop my ML pipeline inside Db2, using Db2's in-database ML stored procedures, or I can copy this dataset to an external ML development environment. In the latter case, copying the dataset will be faster, since I do not need to copy the original larger dataset over the network, nor do I need to generate the samples in my dev environment.
To learn more about Db2's Random Sample stored procedure and other in-database ML stored procedures, check out the following links.
IDAX.RANDOM_SAMPLE - Random sampling
In-database machine learning with IBM Db2
#Db2