Draft

Random forest algorithm helps us to classify items into various groupings.There are two kinds of random forest algorithms: one for regression and one for classifications. Both have similar underlying philosophies.

Before getting into random forest its important to know the workings of the Decision tree algorithm. It works in a way to find features that will split the data in resulting groups that are as different from each other and the members of th these groups are as similar to each other as possible.

Classification

See the plot below to differentiate the blue circles and red triangles, the DT algorithm will split at point 3 and point 4 drawing a straight line to make the clusters as different to each other as possible. Simply it takes into account the majority view and then classifies the data based on that.

Regression

When it comes to regression the style of data changes as dependency comes into place. So the algorithm divides the data into various parts and then in a way classifies it. This also implies that the regression is piecewise constant [INSErt LInk]

Random Forest example

Random forest then relies on the wisdom of the crowds. So Basically the data is split into many parts like what is done in bootstrapping and then analysed.

Example

library(tidyverse)
library(stringi)
library(styler)

library(caret)
library(klaR)

coffee_ratings <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-07/coffee_ratings.csv")


# Remove outlier
mdata <- coffee_ratings %>% dplyr::select(-country_of_origin, -variety) %>%  filter( total_cup_points >25)


m1 <- lm(total_cup_points ~
     species + 
     aroma +
     flavor +
     acidity + 
     color +
     sweetness + 
     altitude_mean_meters +
     processing_method, 
     data = mdata) 


model_Data <- m1$model
# Trying K fold cross validation
set.seed(101) # Set Seed so that same sample can be reproduced in future also
model_Data$total_cup_points <- ifelse(model_Data$total_cup_points >82.09, "Good",
                                      "Bad")
# Now Selecting 50% of data as sample from total 'n' rows of the data
sample <- sample.int(n = nrow(model_Data), size = floor(.50*nrow(model_Data)), replace = F)
train <- model_Data[sample, ]
test  <- model_Data[-sample, ]



# Define train control for k fold cross validation
train_control <- trainControl(method="LGOCV", number=10)
# Fit Naive Bayes Model
model <- train(total_cup_points~.,
               data=train,
               trControl=train_control,
               method="rf",
               importance=T)

# Summarise Results
print(model)

## Random Forest 
## 
## 467 samples
##   8 predictor
##   2 classes: 'Bad', 'Good' 
## 
## No pre-processing
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 75%) 
## Summary of sample sizes: 351, 351, 351, 351, 351, 351, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8663793  0.7164522
##    7    0.8646552  0.7168245
##   13    0.8620690  0.7121044
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

summary(model)

##                 Length Class      Mode     
## call               5   -none-     call     
## type               1   -none-     character
## predicted        467   factor     numeric  
## err.rate        1500   -none-     numeric  
## confusion          6   -none-     numeric  
## votes            934   matrix     numeric  
## oob.times        467   -none-     numeric  
## classes            2   -none-     character
## importance        52   -none-     numeric  
## importanceSD      39   -none-     numeric  
## localImportance    0   -none-     NULL     
## proximity          0   -none-     NULL     
## ntree              1   -none-     numeric  
## mtry               1   -none-     numeric  
## forest            14   -none-     list     
## y                467   factor     numeric  
## test               0   -none-     NULL     
## inbag              0   -none-     NULL     
## xNames            13   -none-     character
## problemType        1   -none-     character
## tuneValue          1   data.frame list     
## obsLevels          2   -none-     character
## param              1   -none-     list

model$results

##   mtry  Accuracy     Kappa AccuracySD    KappaSD
## 1    2 0.8663793 0.7164522 0.02412767 0.05055639
## 2    7 0.8646552 0.7168245 0.02539489 0.05214482
## 3   13 0.8620690 0.7121044 0.02537862 0.05390591

model$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, importance = ..1) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 13.92%
## Confusion matrix:
##      Bad Good class.error
## Bad  147   39  0.20967742
## Good  26  255  0.09252669

gbmImp <- varImp(model)
gbmImp

## rf variable importance
## 
##                                            Importance
## flavor                                        100.000
## acidity                                        85.754
## aroma                                          66.694
## sweetness                                      30.981
## altitude_mean_meters                           18.404
## colorBluish-Green                              14.379
## processing_methodOther                          9.987
## colorGreen                                      9.783
## processing_methodSemi-washed / Semi-pulped      7.041
## processing_methodPulped natural / honey         6.773
## colorNone                                       5.189
## speciesRobusta                                  3.408
## processing_methodWashed / Wet                   0.000

What are Random Forest?

Classification

Regression

Random Forest example