What are Random Forest?
Draft
Random forest algorithm helps us to classify items into various groupings.There are two kinds of random forest algorithms: one for regression and one for classifications. Both have similar underlying philosophies.
Before getting into random forest its important to know the workings of the Decision tree algorithm. It works in a way to find features that will split the data in resulting groups that are as different from each other and the members of th these groups are as similar to each other as possible.
Classification
See the plot below to differentiate the blue circles and red triangles, the DT algorithm will split at point 3 and point 4 drawing a straight line to make the clusters as different to each other as possible. Simply it takes into account the majority view and then classifies the data based on that.
Regression
When it comes to regression the style of data changes as dependency comes into place. So the algorithm divides the data into various parts and then in a way classifies it. This also implies that the regression is piecewise constant [INSErt LInk]
Random Forest example
Random forest then relies on the wisdom of the crowds. So Basically the data is split into many parts like what is done in bootstrapping and then analysed.
Example
library(tidyverse)
library(stringi)
library(styler)
library(caret)
library(klaR)
coffee_ratings <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-07/coffee_ratings.csv")
# Remove outlier
mdata <- coffee_ratings %>% dplyr::select(-country_of_origin, -variety) %>% filter( total_cup_points >25)
m1 <- lm(total_cup_points ~
species +
aroma +
flavor +
acidity +
color +
sweetness +
altitude_mean_meters +
processing_method,
data = mdata)
model_Data <- m1$model
# Trying K fold cross validation
set.seed(101) # Set Seed so that same sample can be reproduced in future also
model_Data$total_cup_points <- ifelse(model_Data$total_cup_points >82.09, "Good",
"Bad")
# Now Selecting 50% of data as sample from total 'n' rows of the data
sample <- sample.int(n = nrow(model_Data), size = floor(.50*nrow(model_Data)), replace = F)
train <- model_Data[sample, ]
test <- model_Data[-sample, ]
# Define train control for k fold cross validation
train_control <- trainControl(method="LGOCV", number=10)
# Fit Naive Bayes Model
model <- train(total_cup_points~.,
data=train,
trControl=train_control,
method="rf",
importance=T)
# Summarise Results
print(model)
## Random Forest
##
## 467 samples
## 8 predictor
## 2 classes: 'Bad', 'Good'
##
## No pre-processing
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 75%)
## Summary of sample sizes: 351, 351, 351, 351, 351, 351, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8663793 0.7164522
## 7 0.8646552 0.7168245
## 13 0.8620690 0.7121044
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
summary(model)
## Length Class Mode
## call 5 -none- call
## type 1 -none- character
## predicted 467 factor numeric
## err.rate 1500 -none- numeric
## confusion 6 -none- numeric
## votes 934 matrix numeric
## oob.times 467 -none- numeric
## classes 2 -none- character
## importance 52 -none- numeric
## importanceSD 39 -none- numeric
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 467 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## xNames 13 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 1 -none- list
model$results
## mtry Accuracy Kappa AccuracySD KappaSD
## 1 2 0.8663793 0.7164522 0.02412767 0.05055639
## 2 7 0.8646552 0.7168245 0.02539489 0.05214482
## 3 13 0.8620690 0.7121044 0.02537862 0.05390591
model$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, importance = ..1)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 13.92%
## Confusion matrix:
## Bad Good class.error
## Bad 147 39 0.20967742
## Good 26 255 0.09252669
gbmImp <- varImp(model)
gbmImp
## rf variable importance
##
## Importance
## flavor 100.000
## acidity 85.754
## aroma 66.694
## sweetness 30.981
## altitude_mean_meters 18.404
## colorBluish-Green 14.379
## processing_methodOther 9.987
## colorGreen 9.783
## processing_methodSemi-washed / Semi-pulped 7.041
## processing_methodPulped natural / honey 6.773
## colorNone 5.189
## speciesRobusta 3.408
## processing_methodWashed / Wet 0.000