Today, we will be using a very fascinating R library which is extensively used for automating algorithms and repeated testing of our algorithms. The CARET package (Classification And REgression Training). CARET is well known for streamlining and build predictive models. The package mainly contains tools which are useful for:
1. Data Splitting
3. Feature Selection
4. Model Tuning using Resampling
5. Variable Importance Estimation
We have a little to do of everything today, we will pre-process our data initially, followed by building a model via model training and parameter tuning and lastly a few plots to better understand the model we built.
CARET as mentioned as functions to streamline the model building and evaluation process. The extensively used CARET function is the train() function which can be used for the following:
- Evaluate model, using re-sampling the effect of model tuning parameters on the performance
- Selecting an optimal model across these parameters defined
- Estimating model performance from a training dataset
Let us better understand things with an example, for this example we will be using the very popular Boston dataset available in MASS library. Below is the listing to load the required libraries and loading the data in R environment.
If you know already following are the Boston dataset fields for a refresher:
There are 14 attributes in each case of the dataset. They are:
- CRIM – per capita crime rate by town
- ZN – proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS – proportion of non-retail business acres per town.
- CHAS – Charles River dummy variable (1 if tract bounds river; 0 otherwise)
- NOX – nitric oxides concentration (parts per 10 million)
- RM – average number of rooms per dwelling
- AGE – proportion of owner-occupied units built prior to 1940
- DIS – weighted distances to five Boston employment centres
- RAD – index of accessibility to radial highways
- TAX – full-value property-tax rate per $10,000
- PTRATIO – pupil-teacher ratio by town
- B – 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT – % lower status of the population
- MEDV – Median value of owner-occupied homes in $1000’s
Our next step involves, loading all the required packages for this example including CARET library into R environment. Followed by setting seed for reproducible purposes. Below is the listing.
If you do not already have the above libraries installed, you will have to first install them using a function install.packages(“”). Example, if you want to install CARET package you use install function as install.packages(“caret”) followed by using the above listing as library(caret).
Our final steps in preparing data are splitting the data into train and test datasets for building our model. For splitting our data into train and test sets we use caret’s inbuilt split function which is createDataPartition(). Our splitting will be for an only class variable which is CHAS – Charles River dummy variable (1 if tract bounds river; 0 otherwise). Below is the listing for slitting our dataset into train and test sets
Moving ahead, now we will tune our parameters or may customize our parameters for our model fitting. To do this we will be using the train control function of CARET. We need to define a method and we are using repeatedcv or repeated cross-validation for 10 number of times with repetition of 10 times. These numbers are just random and may vary with applications. Below is the listing.
Next step involves building our model and the model that we are deploying is GBM or Gradient Boosting Machine, it is also called as MART Multiple Additive Regression Trees or GBRT Gradient Boosted Regression Trees. For this model at each iteration, a regression model fitted for prediction. Below is the listing for forming GBM model.
The above model view suggests that, the model with n.trees = 50, interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10 is the optimal model. Following are in brief description of the parameters stated above:
n.trees – is the optimal number of trees which are fitted by the GBM model
interaction.depth – number of splits it has to perform on a tree (starting from a single node). As each split increases the total number of nodes by 3 and the number of terminals by 2.
shrinkage – considered as learning rate, in GBMs, shrinkage is used for reducing or shrinking the impact of each additionally fitted base-learner (tree). It reduces the size of incremental steps and thus penalizes the importance of each consecutive iteration
n.minobsinnode – at each step of GBM model a new decision tree is constructed, this parameter has to do with stopping of these growing trees. The furthest we can go to split each node is until there is only 1 observation in each terminal node, this would correspond to n.minobsinnode value of 1. R has a default value of 10 for GBM package.
This model has RMSE (Root Mean Square Error) of 0.24, R-squared value of 95.77% and MAE (Mean Absolute Error) of 0.1279. This comes out to be good model parameters. And for this optimal model below is the summary and graph showing how other independent variables have a relative influence on chas.
Now, for this model, we will use CARET’s expand.grid function, this creates a data frame from all combinations of supplied vectors or factors. Using this we can customize our interaction.depth, n.trees, shrinkage, n.minobsinnode. Below is the listing to define a grid for our GBM Model.
Now, our last steps include fitting a new GBM Model with the GBM GRID defined above. Below is the listing to fit a new model
This is just a demonstration of how we can define a GBM or any other model using CARET to find the optimal model for training and predicting. Further, we also learned how to define a GRID for any model as per our requirements. In this way, we can define models, or we can say tailor models which best suits our application, purpose or use. CARET library helps us do this for almost all the models you can think of!
Visualizing our Model
Now we will demonstrate some visual techniques in CARET using caretTheme(), as humans we can better understand a model with visuals than numbers.
We can also plot using our go to plot function ggplot as listed below:
As we conclude, we are now confident to state that CARET is a very useful library in terms of streamlining or automizing model training, predicting and visualizing models with CARET themes. And the best part is we can do this for any model which you can think of. There is a lot more to come in this library series. Because, if we know our libraries well and source codes of our functions, we can better define our models.
Previously on library series: Keras library can be found here:
If you are interested in more such content than like, share & subscribe to The Datum for exciting Data Science and Machine Leaning algorithms each week!
Interested in such readings for Data Science and Machine Learning? Like, Share and Subscribe to The Datum for weekly such algorithm updates. Currently, we are running Library Series wherein we make the best use of the R libraries and its documentation so that we better understand source codes and functions thus making the best of our analytical algorithms!
Your Data Scientifically,