Random forest is one of the most popular algorithms for multiple machine learning tasks. This story looks into random forest regression in R, focusing on understanding the output and variable importance. If you prefer Python code, . here you go Decision Trees and Random Forest When decision trees came to the scene in , they were better than classic multiple regression. One of the reasons is that decision trees are easy on the eyes. People without a degree in statistics could easily interpret the results in the form of branches. 1984 Additionally, decision trees help you avoid the synergy effects of interdependent predictors in multiple regression. Synergy (interaction/moderation) effect is when one predictor depends on another predictor. is a nice example from a business context. Here On the other hand, regression trees are not very stable - a slight change in the training set could mean a great change in the structure of the whole tree. Simply put, they are not very accurate. But what if you combine multiple trees? Randomly created decision trees make up a , a type of based on , .i.e. . First, you create various decision trees on bootstrapped versions of your dataset, i.e. random sampling with replacement (see the image below). Next, you aggregate (e.g. average) the individual predictions over the decision trees into the final random forest prediction. random forest ensemble modeling bootstrap aggregating bagging Notice that we skipped some observations, namely Istanbul, Paris and Barcelona. These observations, i.e. rows, are called and used for prediction error estimation. out-of-bag Random Forest Regression in R Based on CRAN’s , 63 R libraries mention random forest. I recommend you go over the options as they range from bayesian-based random forest to clinical and omics specific libraries. You could potentially find random forest regression that fits your use-case better than the original version. Still, I wouldn’t use it if you can’t find the details of how exactly it improves on Breiman’s and Cutler’s implementation. If you have no idea, it’s safer to go with the original - . list of packages randomForest Code-wise, it’s pretty simple, so I will stick to the example from the documentation using . 1974 Motor Trend data ### Import libraries library(randomForest) library(ggplot2) set.seed(4543) data(mtcars) rf.fit <- randomForest(mpg ~ ., data=mtcars, ntree=1000, keep.forest=FALSE, importance=TRUE) I will specifically focus on understanding the performance and . So after we run the piece of code above, we can check out the results by simply running . variable importance rf.fit > rf.fit Call: randomForest(formula = mpg ~ ., data = mtcars, ntree = 1000, keep.forest = FALSE, importance = TRUE) Type of random forest: regression Number of trees: 1000 No. of variables tried at each split: 3 Mean of squared residuals: 5.587022 % Var explained: 84.12 Notice that the function ran random forest regression, and we didn’t need to specify that. It will perform nonlinear multiple regression as long as the target variable is numeric (in this example, it is Miles per Gallon - ). But, if it makes you feel better, you can add mpg type= “regression”. The mean of squared residuals and % variance explained indicate how well the model fits the data. Residuals are a difference between prediction and the actual value. In our example, 5.6 means that we were wrong by 5.6 miles/gallon on average. If you want to have a deep understanding of how this is calculated per decision tree, watch . this video You can experiment with, i.e. increase or decrease, the number of trees ( ) or the number of variables tried at each split ( ) and see whether the residuals or % variance change. ntree mtry If you also want to understand what the model has learnt, make sure that you do as in the code above. importance = TRUE Random forest regression in R provides two outputs: decrease in mean square error (MSE) and node purity. Prediction error described as MSE is based on permuting out-of-bag sections of the data per individual tree and predictor, and the errors are then averaged. In the regression context, Node purity is the total decrease in residual sum of squares when splitting on a variable averaged over all trees (i.e. how well a predictor decreases variance). MSE is a more reliable measure of variable importance. If the two importance metrics show different results, listen to MSE. If all of your predictors are numerical, then it shouldn’t be too much of an issue - read more . here The built-in will visualize the results, but we can do better. Here, we combine both importance measures into one plot emphasizing MSE results. varImpPlot() ### Visualize variable importance ---------------------------------------------- # Get variable importance from the model fit ImpData <- as.data.frame(importance(rf.fit)) ImpData$Var.Names <- row.names(ImpData) ggplot(ImpData, aes(x=Var.Names, y=`%IncMSE`)) + geom_segment( aes(x=Var.Names, xend=Var.Names, y=0, yend=`%IncMSE`), color="skyblue") + geom_point(aes(size = IncNodePurity), color="blue", alpha=0.6) + theme_light() + coord_flip() + theme( legend.position="bottom", panel.grid.major.y = element_blank(), panel.border = element_blank(), axis.ticks.y = element_blank() ) Conclusion In terms of assessment, it always comes down to some theory or logic behind the data. Do the top predictors make sense? If not, investigate why. “Rome was not built in one day, nor was any reliable model.” ( ) Good & Hardin, 2012 Modeling is an iterative process. You can get a better idea about the predictive error of your random forest regression when you save some data for performance testing only. You might also want to try out . other methods