What Are the Advantages and Disadvantages of Random Forest?

Written by Coursera Staff • Updated on

Explore random forests, a popular machine learning algorithm, in more detail by delving into the advantages, disadvantages, and exciting industry applications of this technique.

[Featured Image] A data science professional sits at a desk with a computer and tablet, using the random forest algorithm to aggregate data.

Random forests are one of the most widely used machine learning algorithms because they can handle classification and regression problems effectively. However, despite their strengths, random forests can be resource-intensive, requiring significant computational power and memory, especially as the number of trees grows. As you explore machine learning techniques, understanding the advantages and disadvantages of random forest can help you make informed decisions about applying this algorithm and determining whether it might be right for your specific purposes, from predicting real estate housing prices to discovering biomarkers for disease.

What is a random forest in machine learning?

A random forest model is an ensemble machine learning technique that combines many decision trees to make predictions. An ensemble machine learning technique is a method that combines several individual models, improving the accuracy of predictions beyond what a single model could do. When you create a random forest, your model builds many trees using random samples of your data, drawn with replacement. Using these trees together, the algorithm combines the results to make an informed prediction. 

Real-life example of random forest

To better understand how a random forest works, consider that you are predicting whether your friend will enjoy a new restaurant in town. Instead of asking just one friend (one tree) if they liked the restaurant and making a prediction based on that single data point, you gather information from as many people as you can find who have tried the restaurant. Each person (tree) gives you a “yes” or “no” answer based on their own criteria, such as the food, ambience, personal taste, and so on. At the end, you consider which answer had the most votes and use this as your prediction for whether your friend will enjoy the restaurant. 

Placeholder

Benefits of random forest

Random forests are one of the most widely used machine learning algorithms thanks to their versatility, feature importance evaluation, reduced overfitting risk, and the ability to modify hyperparameters. Consider each benefit in more detail when deciding if it’s right for you.

Offers versatile applications

You can use random forests for classification and regression tasks, meaning once you take time to learn this skill, you may utilize it for a wide variety of professional applications.

For classification tasks, random forest models use each tree to classify an input (categorical variable) and then “count” the votes for each classification. The classification that has the most votes becomes the output. For regression, the random forest model uses the algorithm’s regressor to create a weighted prediction for the outcome variable (continuous variable).

Appropriate for data with missing values

Data in the real world is not always clean or complete, which can pose challenges during your analysis. However, feature bagging, a method of aggregating predictions using different features, allows random forest models to estimate missing values. That means you can still use this model even if your data isn’t complete, all without sacrificing accuracy.

Reduces risk of overfitting

When choosing random forests, you reduce the risk of overfitting, a common issue with single decision trees. By averaging the predictions of multiple trees, random forests aim to prevent the model from becoming too closely aligned with the training data. This helps your model more effectively generalize to new, unseen data. 

Estimates feature importance

With random forest, you can measure the relative importance of each feature in your model. As an analyst, you might want to know which variables are the most influential when predicting or classifying an outcome. By identifying these key features, you can target your efforts toward the most impactful aspects of your data. This can help you drop low-impact variables and create a more efficient model.

Allows for customized hyperparameters

You can customize your random forest model through hyperparameter tuning, which allows you to optimize it based on your priorities by tweaking different variables. For example, you might focus on increasing predictive power or optimizing for speed.

Limitations of random forest

As with any method, considering the limitations alongside the benefits can help you form a comprehensive view of the algorithm. With random forests, a few of the limitations to be aware of include the following:

Larger forests increase run time.

While random forests provide higher predictive power by averaging multiple decision trees, having a high volume of trees can slow the algorithm by increasing run time. The more trees involved, the longer it takes for the model to process new data and generate predictions, which can be a limitation for tasks requiring real-time results. 

Random forests can be resource-intensive.

Random forests often work with large, complex databases, frequently handling thousands of input variables without sacrificing accuracy. This means that if your model continues to grow, your random forest algorithm should be able to handle the data without needing to reduce the number of variables. Because of this, random forest models require a large storage space, which can be resource-intensive. In some cases, if the volume of decision trees produced in your model is exceptionally high, you might have difficulty running this algorithm on devices with limited storage capacity, or the process may take longer to complete.

They’re more complex than standard decision trees.

Random forests are inherently more complex than a single decision tree. This added complexity can make them harder to interpret and explain, especially for those without a technical background. 

Common uses of random forest

You can use random forests in various professional fields for classification and regression tasks. Some common ones include:

Finance 

Professionals in finance use random forest algorithms to evaluate credit risk for customers, detect fraudulent transactions, and predict stock price movements. For example, as a risk analyst, you might use machine learning and random forest algorithms to develop credit risk models, calculating the probability of a borrower defaulting on a loan. As a fraud analyst, you might use random forest models to flag fraudulent activity in transaction data.

Health care 

You can use random forest models in health care to classify gene expression data, discover disease biomarkers, identify ways to improve patient outcomes and classify medical images. As a health researcher, you might use random forest models to help predict medical events before they occur. As a radiologist, you might use random forest classification algorithms to help with diagnostic and prognostic tasks in the medical field.

E-commerce 

Random forest models can help you power recommendation engines, identify new marketing objects, and suggest products for cross-selling or upselling based on user behavior and preferences. If you work on the marketing team for a certain product, you might input customer information such as geographic location, demographic data, and pages visited into a random forest algorithm. From this, you can predict products a customer will like and create more effective recommendation strategies.

Real estate 

You can use random forest algorithms to predict housing prices based on location, population, and other predictors. As a real estate agent, you might use algorithms like this to understand how a house should be priced. As an investor, you can make informed decisions about which properties might bring the most profit.

Energy 

Random forest models help professionals enhance building energy efficiency, predict energy consumption, and optimize electric grid use. As an industrial engineer, your role may be to help minimize economic and environmental waste by designing efficient energy systems. Random forest models can provide recommendations for how to do this, validating or assisting your designs.

How to start learning to use random forests

Learning an appropriate software programming language is an important first step in your machine learning journey. Python, which many consider the most popular language for machine learning and artificial intelligence, is a good place to start. Python’s scikit-learn package is one of the best libraries for implementing random forests thanks to built-in RandomForestClassifier and RandomForestRegressor functions. These functions and others allow you to build your model, visualize it, and improve your prediction accuracy with built-in controls. 

Keep exploring on Coursera

Random forests offer increased versatility while reducing overfitting risk, providing feature importance measures and enhanced hyperparameter control. However, while high data volume tolerance can improve output accuracy, it can increase resource requirements and decrease model speeds. To start building exciting programming skills and trying random forests, you can take top courses, Specializations, and guided projects on Coursera. Consider the Machine Learning Specialization from Stanford and DeepLearning.AI for a beginner-level introduction to machine learning models.

Keep reading

Updated on
Written by:

Editorial Team

Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...

This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.