Global AI and Data Science

Global AI & Data Science

Train, tune and distribute models with generative AI and machine learning capabilities

 View Only

Caret Library: Model Training and Parameter Tuning

By Neeraj Jangid posted Mon August 26, 2019 08:11 PM

  

Introduction

Today, we will be using a very fascinating R library which is extensively used for automating algorithms and repeated testing of our algorithms. The CARET package (Classification And REgression Training). CARET is well known for streamlining and build predictive models. The package mainly contains tools which are useful for:

1.     Data Splitting

2.     Pre-processing

3.     Feature Selection

4.     Model Tuning using Resampling

5.     Variable Importance Estimation

We have a little to do of everything today, we will pre-process our data initially, followed by building a model via model training and parameter tuning and lastly a few plots to better understand the model we built.

Parameter Tuning

CARET as mentioned as functions to streamline the model building and evaluation process. The extensively used CARET function is the train() function which can be used for the following:

  1. Evaluate model, using re-sampling the effect of model tuning parameters on the performance
  2. Selecting an optimal model across these parameters defined
  3. Estimating model performance from a training dataset

The Model

Preparing Data

Let us better understand things with an example, for this example we will be using the very popular Boston dataset available in MASS library. Below is the listing to load the required libraries and loading the data in R environment.

If you know already following are the Boston dataset fields for a refresher:

There are 14 attributes in each case of the dataset. They are:

  1. CRIM – per capita crime rate by town
  2. ZN – proportion of residential land zoned for lots over 25,000 sq.ft.
  3. INDUS – proportion of non-retail business acres per town.
  4. CHAS – Charles River dummy variable (1 if tract bounds river; 0 otherwise)
  5. NOX – nitric oxides concentration (parts per 10 million)
  6. RM – average number of rooms per dwelling
  7. AGE – proportion of owner-occupied units built prior to 1940
  8. DIS – weighted distances to five Boston employment centres
  9. RAD – index of accessibility to radial highways
  10. TAX – full-value property-tax rate per $10,000
  11. PTRATIO – pupil-teacher ratio by town
  12. B – 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town
  13. LSTAT – % lower status of the population
  14. MEDV – Median value of owner-occupied homes in $1000’s
Listing for loading data in R
View of first 10 entrees in Boston data set
Summary of Boston data
Structure of Boston data

Our next step involves, loading all the required packages for this example including CARET library into R environment. Followed by setting seed for reproducible purposes. Below is the listing.

Listing for loading libraries and setting seed

If you do not already have the above libraries installed, you will have to first install them using a function install.packages(“”). Example, if you want to install CARET package you use install function as install.packages(“caret”) followed by using the above listing as library(caret).

Our final steps in preparing data are splitting the data into train and test datasets for building our model. For splitting our data into train and test sets we use caret’s inbuilt split function which is createDataPartition(). Our splitting will be for an only class variable which is CHAS – Charles River dummy variable (1 if tract bounds river; 0 otherwise). Below is the listing for slitting our dataset into train and test sets 

Splitting to test and train dataset

Parameter Tuning

Moving ahead, now we will tune our parameters or may customize our parameters for our model fitting. To do this we will be using the train control function of CARET. We need to define a method and we are using repeatedcv or repeated cross-validation for 10 number of times with repetition of 10 times. These numbers are just random and may vary with applications. Below is the listing.

Listing for parameter training

Modeling

Next step involves building our model and the model that we are deploying is GBM or Gradient Boosting Machine, it is also called as MART Multiple Additive Regression Trees or GBRT Gradient Boosted Regression Trees. For this model at each iteration, a regression model fitted for prediction. Below is the listing for forming GBM model.

Listing for fitting GBM Model
View of our GBM Model

The above model view suggests that, the model with n.trees = 50, interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10 is the optimal model. Following are in brief description of the parameters stated above:

n.trees – is the optimal number of trees which are fitted by the GBM model

interaction.depth – number of splits it has to perform on a tree (starting from a single node). As each split increases the total number of nodes by 3 and the number of terminals by 2.

shrinkage – considered as learning rate, in GBMs, shrinkage is used for reducing or shrinking the impact of each additionally fitted base-learner (tree). It reduces the size of incremental steps and thus penalizes the importance of each consecutive iteration

n.minobsinnode – at each step of GBM model a new decision tree is constructed, this parameter has to do with stopping of these growing trees. The furthest we can go to split each node is until there is only 1 observation in each terminal node, this would correspond to n.minobsinnode value of 1. R has a default value of 10 for GBM package.

This model has RMSE (Root Mean Square Error) of 0.24, R-squared value of 95.77% and MAE (Mean Absolute Error) of 0.1279. This comes out to be good model parameters. And for this optimal model below is the summary and graph showing how other independent variables have a relative influence on chas.

Summary of our GBM Model
Plot of model summary showing relative influence

Modeling Grid

Now, for this model, we will use CARET’s expand.grid function, this creates a data frame from all combinations of supplied vectors or factors. Using this we can customize our interaction.depth, n.trees, shrinkage, n.minobsinnode. Below is the listing to define a grid for our GBM Model.

Listing for defining GBM GRID
nrow of GBM GRID

Now, our last steps include fitting a new GBM Model with the GBM GRID defined above. Below is the listing to fit a new model

Listing to fit a model with new GBM GRID
View of our new model with highlighted optimal solution of n.trees = 50 and interaction.depth = 5
Summary plot of our new model showing relative influence

This is just a demonstration of how we can define a GBM or any other model using CARET to find the optimal model for training and predicting. Further, we also learned how to define a GRID for any model as per our requirements. In this way, we can define models, or we can say tailor models which best suits our application, purpose or use. CARET library helps us do this for almost all the models you can think of!

Visualizing our Model

Now we will demonstrate some visual techniques in CARET using caretTheme(), as humans we can better understand a model with visuals than numbers.

Visualizing our model
Plot of our model with CARET theme

We can also plot using our go to plot function ggplot as listed below:

Plotting model with ggplot

Conclusion

As we conclude, we are now confident to state that CARET is a very useful library in terms of streamlining or automizing model training, predicting and visualizing models with CARET themes. And the best part is we can do this for any model which you can think of. There is a lot more to come in this library series. Because, if we know our libraries well and source codes of our functions, we can better define our models.

Previously on library series: Keras library can be found here:

https://thedatum.data.blog/2019/08/11/keras-library-understanding-optimizer/

If you are interested in more such content than like, share & subscribe to The Datum for exciting Data Science and Machine Leaning algorithms each week!

Interested in such readings for Data Science and Machine Learning? Like, Share and Subscribe to The Datum for weekly such algorithm updates. Currently, we are running Library Series wherein we make the best use of the R libraries and its documentation so that we better understand source codes and functions thus making the best of our analytical algorithms!

Your Data Scientifically,
The Datum.


#Trending-blog
#Trending-blog-home
#Hands-on-feature

#GlobalAIandDataScience
#GlobalDataScience
#Hands-on
1 comment
62 views

Permalink

Comments

Mon February 03, 2020 02:50 PM

Hi Neeraj @Neeraj Jangid, Is Caret an all-in-one library for R covering the entire data science flow?