<<<<<<< HEAD
library(Matrix)
library(xgboost)
library(ggplot2)
library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:xgboost':
## 
##     slice
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

1 1. Introduction

XGBoost is an implementation of a machine learning technique known as gradient boosting. In this blog post, we discuss what XGBoost is, and demonstrate a pipeline for working with it in R.

2 XGBoost

3 2. Load And Explore The Data

We will use the Titanic Dataset. Basically, we try predict whether a passenger survived or not (so this is a binary classification problem).

Let’s load up the data:

titanic_train <- read_csv("./xg_boost_data/train.csv")
## Parsed with column specification:
## cols(
##   PassengerId = col_double(),
##   Survived = col_double(),
##   Pclass = col_double(),
##   Name = col_character(),
##   Sex = col_character(),
##   Age = col_double(),
##   SibSp = col_double(),
##   Parch = col_double(),
##   Ticket = col_character(),
##   Fare = col_double(),
##   Cabin = col_character(),
##   Embarked = col_character()
## )
titanic_test <- read_csv("./xg_boost_data//test.csv")
## Parsed with column specification:
## cols(
##   PassengerId = col_double(),
##   Pclass = col_double(),
##   Name = col_character(),
##   Sex = col_character(),
##   Age = col_double(),
##   SibSp = col_double(),
##   Parch = col_double(),
##   Ticket = col_character(),
##   Fare = col_double(),
##   Cabin = col_character(),
##   Embarked = col_character()
## )

For the sake of brevity, I’ll only keep some of the features:

Let’s have a look at our data after discarding a few features:

titanic_train <- titanic_train %>%
  select(Survived,
         Pclass,
         Sex,
         Age,
         Embarked)

titanic_test <- titanic_test %>%
  select(Pclass,
         Sex,
         Age,
         Embarked)

str(titanic_train)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 891 obs. of  5 variables:
##  $ Survived: num  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass  : num  3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : chr  "male" "female" "female" "female" ...
##  $ Age     : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ Embarked: chr  "S" "C" "S" "S" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   PassengerId = col_double(),
##   ..   Survived = col_double(),
##   ..   Pclass = col_double(),
##   ..   Name = col_character(),
##   ..   Sex = col_character(),
##   ..   Age = col_double(),
##   ..   SibSp = col_double(),
##   ..   Parch = col_double(),
##   ..   Ticket = col_character(),
##   ..   Fare = col_double(),
##   ..   Cabin = col_character(),
##   ..   Embarked = col_character()
##   .. )

XGBoost will only take numeric data as input. Let’s convert our character features to factors, and one-hot encode.

# sparse.model.matrix() will drop rows with NA's (https://stackoverflow.com/questions/29732720/sparse-model-matrix-loses-rows-in-r)
summary(titanic_train) # seems like there are 177 NA's in the Age variable, and 2 in the Embarked variable
##     Survived          Pclass          Sex                 Age       
##  Min.   :0.0000   Min.   :1.000   Length:891         Min.   : 0.42  
##  1st Qu.:0.0000   1st Qu.:2.000   Class :character   1st Qu.:20.12  
##  Median :0.0000   Median :3.000   Mode  :character   Median :28.00  
##  Mean   :0.3838   Mean   :2.309                      Mean   :29.70  
##  3rd Qu.:1.0000   3rd Qu.:3.000                      3rd Qu.:38.00  
##  Max.   :1.0000   Max.   :3.000                      Max.   :80.00  
##                                                      NA's   :177    
##    Embarked        
##  Length:891        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
summary(titanic_test) # 68 NA's in age var
##      Pclass          Sex                 Age          Embarked        
##  Min.   :1.000   Length:418         Min.   : 0.17   Length:418        
##  1st Qu.:1.000   Class :character   1st Qu.:21.00   Class :character  
##  Median :3.000   Mode  :character   Median :27.00   Mode  :character  
##  Mean   :2.266                      Mean   :30.27                     
##  3rd Qu.:3.000                      3rd Qu.:39.00                     
##  Max.   :3.000                      Max.   :76.00                     
##                                     NA's   :86
# We don't want to drop rows. So let's replace NA's with a sentinal value; how about -999?
titanic_train[is.na(titanic_train)] <- -999
titanic_test[is.na(titanic_test)] <- -999

titanic_train$Sex <- as.factor(titanic_train$Sex)
titanic_train$Embarked <- as.factor(titanic_train$Embarked)
titanic_train$Pclass <- as.factor(titanic_train$Pclass) # Could be ordinal, but leaving it is strict categorical
titanic_test$Sex <- as.factor(titanic_test$Sex)
titanic_test$Embarked <- as.factor(titanic_test$Embarked)
titanic_test$Pclass <- as.factor(titanic_test$Pclass) # Could be ordinal, but leaving it is strict categorical

titanic_train_sparse <- sparse.model.matrix(Survived~., data = titanic_train)[,-1]
# Recall, xgboost takes advantage of the sparsity. Sparsity can be induced from 1-hot encoding.
class(titanic_train_sparse)
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"

The data are in the format of a dgCMatrix class - this is the Matrix package’s implementation of sparse matrix.

Let’s have a look at the structure of the data a little closer:

str(titanic_train_sparse)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   ..@ i       : int [1:3032] 9 15 17 20 21 33 41 43 53 56 ...
##   ..@ p       : int [1:8] 0 184 675 1252 2143 2311 2388 3032
##   ..@ Dim     : int [1:2] 891 7
##   ..@ Dimnames:List of 2
##   .. ..$ : chr [1:891] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:7] "Pclass2" "Pclass3" "Sexmale" "Age" ...
##   ..@ x       : num [1:3032] 1 1 1 1 1 1 1 1 1 1 ...
##   ..@ factors : list()

We can check this directly:

dim(titanic_train_sparse)
## [1] 891   7

The names are the features are given by titanic_train_sparse@Dimnames:

head(titanic_train_sparse@Dimnames[[2]])
## [1] "Pclass2"   "Pclass3"   "Sexmale"   "Age"       "EmbarkedC" "EmbarkedQ"

You can convert this data into a dataframe, thusly:

train_data_as_df <- as.data.frame(as.matrix(titanic_train_sparse))

4 3 Hyperparameters

This is a vast topic. Without going into too much depth, I’ll outline some of the more commonly used hyperparameters:

# Full reference: https://xgboost.readthedocs.io/en/latest/parameter.html
#### Tree booster params...####
# eta:                              default = 0.3
#                                   learning rate / shrinkage. Scales the contribution of each try by a factor of 0 < eta < 1
# gamma:                            default = 0
#                                   minimum loss reduction needed to make another partition in a given tree.
#                                   larger the value, the more conservative the tree will be (as it will need to make a bigger reduction to split)
#                                   So, conservative in the sense of willingness to split.
# max_depth:                        default = 6
#                                   max depth of each tree...
# subsample:                        1 (ie, no subsampling)
#                                   fraction of training samples to use in each "boosting iteration"
# colsample_bytree:     default = 1 (ie, no sampling)
#                       Fraction of columns to be used when constructing each tree. This is an idea used in RandomForests
# min_child_weight:     default = 1
#                       This is the minimum number of instances that have to been in a node. It's a regularization parameter
#                       So, if it's set to 10, each leaf has to have at least 10 instances assigned to it.
#                       The higher the value, the more conservative the tree will be.

(I’ve left it as commented code, as I like to past this into my scripts as a quick reference.)

Let’s create the hyper-parameters list:

params_booster <- list(
  booster = 'gbtree', # Possible to also have linear boosters as your weak learners.
  eta = 1, 
  gamma = 0,
  max.depth = 2, 
  subsample = 1, 
  colsample_bytree = 1,
  min_child_weight = 1, 
  objective = "binary:logistic"
)

bstSparse <- xgboost(data = titanic_train_sparse, 
                     label = titanic_train$Survived, 
                     nrounds = 100,  
                     params = params_booster)

The xgb.train() and xgboost() functions are used to train the boosting model, and both return an object of class xgb.Booster. Before we do that, let’s first use xgb.cv() to get some understanding of our performance before we evaluate against our final hold our test set. Important to not that xgb.cv() returns an object of type xgb.cv.synchronous, not xgb.Booster. So you won’t be able to call functions like xgb.importance() on it, as xgb.importance() takes object of class xgb.Booster not xgb.cv.synchronous.

# NB: keep in mind xgb.cv() is used to select the correct hyperparams.
# Once you have them, train using xgb.train() or xgboost() to get the final model.

bst.cv <- xgb.cv(data = titanic_train_sparse, 
              label = titanic_train$Survived, 
              params = params_booster,
              nrounds = 300, 
              nfold = 5,
              print_every_n = 20,
              verbose = 2)

Note, we can also implement early-stopping: early_stopping_rounds = 3, so that if there has been no improvement in test accuracy for a specified number of rounds, the algorithm stops.

res_df <- data.frame(tr = bst.cv$evaluation_log$train_error_mean, 
                     val = bst.cv$evaluation_log$test_error_mean,
                     iter = bst.cv$evaluation_log$iter)

g <- ggplot(res_df, aes(x=iter)) +        # Look @ it overfit.
  geom_line(aes(y=tr), color = "blue") +
  geom_line(aes(y=val), colour = "green")

g

======= --- title: An R Pipeline for XGBoost (and a discussion about hyperparameters) output: blogdown::html_page: toc: true number_sections: true author: 'Orry Messer' date: '2019-10-04' slug: an-r-pipeline-for-xgboost-and-a-discussion-about-hyperparameters draft: true categories: [data-science] tags: - R - data-science - xgboost ---

1 Introduction

XGBoost is …. I will use this post to consolidate my learnings thus far about XGBoost.

2 Load And Explore The Data

We will use the Titanic Dataset which we get from Kaggle. Basically, we try predict whether a passenger survived or not (so this is a binary classification problem).

Let’s load up the data:

titanic_train <- read_csv(directoryWhichContainsTrainingData)
titanic_test <- read_csv(directoryWhichContaintsTestData)

For the sake of brevity, I’ll only keep some of the features:

I’ll use dplyr’s select() to do this:

titanic_train <- titanic_train %>%
  select(Survived,
         Pclass,
         Sex,
         Age,
         Embarked)

titanic_test <- titanic_test %>%
  select(Pclass,
         Sex,
         Age,
         Embarked)

Let’s have a look at our data after discarding a few features:

str(titanic_train, give.attr = FALSE)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 891 obs. of  5 variables:
##  $ Survived: num  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass  : num  3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : chr  "male" "female" "female" "female" ...
##  $ Age     : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ Embarked: chr  "S" "C" "S" "S" ...

XGBoost will only take numeric data as input. Let’s convert our character features to factors, and one-hot encode. We will use sparse.model.matrix() to create a sparse matrix which will be used as input for our model. XGBoost has been written to take advantage of sparse matrices, so we want to make sure that we’re using this feature.

Unfortunately, in using R at least, sparse.model.matrix() will drop rows which contain NA’s if the global option options('na.action') is set to "na.omit".

So we use a fix outlined here:

previous_na_action <- options('na.action')
options(na.action='na.pass')

titanic_train$Sex <- as.factor(titanic_train$Sex)
titanic_train$Embarked <- as.factor(titanic_train$Embarked)

titanic_train_sparse <- sparse.model.matrix(Survived~., data = titanic_train)[,-1]

options(na.action=previous_na_action$na.action)

Alternatively, we could have just used a sentinel value to replace the NA’s.

2.1 Interacting with the Sparse Matrix Object

The data are in the format of a dgCMatrix class - this is the Matrix package’s implementation of sparse matrix:

str(titanic_train_sparse)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   ..@ i       : int [1:3080] 0 1 2 3 4 5 6 7 8 9 ...
##   ..@ p       : int [1:6] 0 891 1468 2359 2436 3080
##   ..@ Dim     : int [1:2] 891 5
##   ..@ Dimnames:List of 2
##   .. ..$ : chr [1:891] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:5] "Pclass" "Sexmale" "Age" "EmbarkedQ" ...
##   ..@ x       : num [1:3080] 3 1 3 1 3 3 1 3 3 2 ...
##   ..@ factors : list()

We can check the dimension of the matrix directly:

dim(titanic_train_sparse)
## [1] 891   5

The names are the features are given by titanic_train_sparse@Dimnames[[2]]:

head(titanic_train_sparse@Dimnames[[2]])
## [1] "Pclass"    "Sexmale"   "Age"       "EmbarkedQ" "EmbarkedS"

If needed, you can convert this data (back) into a dataframe, thusly:

train_data_as_df <- as.data.frame(as.matrix(titanic_train_sparse))

3 Hyperparameters

Tuning hyperparameters is a vast topic. Without going into too much depth, I’ll outline some of the more commonly used hyperparameters:

Full reference: https://xgboost.readthedocs.io/en/latest/parameter.html ### Tree booster params…####

Table 3.1: Parameters
Parameter Explanation
eta default = 0.3 learning rate / shrinkage. Scales the contribution of each try by a factor of 0 < eta < 1
gamma default = 0 minimum loss reduction needed to make another partition in a given tree. larger the value, the more conservative the tree will be (as it will need to make a bigger reduction to split) So, conservative in the sense of willingness to split.
max_depth default = 6 max depth of each tree…
subsample 1 (ie, no subsampling) fraction of training samples to use in each “boosting iteration”
colsample_bytree default = 1 (ie, no sampling) Fraction of columns to be used when constructing each tree. This is an idea used in RandomForests
min_child_weight default = 1 This is the minimum number of instances that have to been in a node. It’s a regularization parameter So, if it’s set to 10, each leaf has to have at least 10 instances assigned to it. The higher the value, the more conservative the tree will be.
# Full reference: https://xgboost.readthedocs.io/en/latest/parameter.html
# ### Tree booster params...####
# eta:                              default = 0.3
#                                   learning rate / shrinkage. Scales the contribution of each try by a factor of 0 < eta < 1
# gamma:                            default = 0
#                                   minimum loss reduction needed to make another partition in a given tree.
#                                   larger the value, the more conservative the tree will be (as it will need to make a bigger reduction to split)
#                                   So, conservative in the sense of willingness to split.
# max_depth:                        default = 6
#                                   max depth of each tree...
# subsample:                        1 (ie, no subsampling)
#                                   fraction of training samples to use in each "boosting iteration"
# colsample_bytree:     default = 1 (ie, no sampling)
#                       Fraction of columns to be used when constructing each tree. This is an idea used in RandomForests
# min_child_weight:     default = 1
#                       This is the minimum number of instances that have to been in a node. It's a regularization parameter
#                       So, if it's set to 10, each leaf has to have at least 10 instances assigned to it.
#                       The higher the value, the more conservative the tree will be.

(I’ve left it as commented code, as I like to past this into my scripts as a quick reference.)

Let’s create the hyper-parameters list:

params_booster <- list(
  booster = 'gbtree', # Possible to also have linear boosters as your weak learners.
  eta = 1, 
  gamma = 0,
  max.depth = 2, 
  subsample = 1, 
  colsample_bytree = 1,
  min_child_weight = 1, 
  objective = "binary:logistic"
)

bstSparse <- xgboost(data = titanic_train_sparse, 
                     label = titanic_train$Survived, 
                     nrounds = 100,  
                     params = params_booster)

4 Training The Model

The xgb.train() and xgboost() functions are used to train the boosting model, and both return an object of class xgb.Booster. Before we do that, let’s first use xgb.cv() to get some understanding of our performance before we evaluate against our final hold our test set. Important to not that xgb.cv() returns an object of type xgb.cv.synchronous, not xgb.Booster. So you won’t be able to call functions like xgb.importance() on it, as xgb.importance() takes object of class xgb.Booster not xgb.cv.synchronous.

# NB: keep in mind xgb.cv() is used to select the correct hyperparams.
# Once you have them, train using xgb.train() or xgboost() to get the final model.

bst.cv <- xgb.cv(data = titanic_train_sparse, 
              label = titanic_train$Survived, 
              params = params_booster,
              nrounds = 300, 
              nfold = 5,
              print_every_n = 20,
              verbose = 2)

Note, we can also implement early-stopping: early_stopping_rounds = 3, so that if there has been no improvement in test accuracy for a specified number of rounds, the algorithm stops.

res_df <- data.frame(tr = bst.cv$evaluation_log$train_error_mean, 
                     val = bst.cv$evaluation_log$test_error_mean,
                     iter = bst.cv$evaluation_log$iter)

g <- ggplot(res_df, aes(x=iter)) +        # Look @ it overfit.
  geom_line(aes(y=tr), color = "blue") +
  geom_line(aes(y=val), colour = "green")

g

>>>>>>> 3b9e6eb27cdc54f3e4e214759cb5130dfd4780c5