Reinforcement learning algorithms allow artificial intelligence agents to learn the optimal way to perform a task through trial and error without human intervention. Explore reinforcement learning algorithms such as Q-learning and actor-critic.
Reinforcement learning is a machine learning method where computers, robots, or other AI models find the best way to accomplish a goal through trial and error without the need for a computer scientist or other person to show it what to do. Reinforcement learning allows the AI to reflect on its own decisions and determine its value toward reaching the goal, ultimately finding the solution with the highest or best value. You can use different reinforcement learning algorithms that utilize various approaches to the learning process, primarily in how they determine the value of actions within the decision-making process and learn through trial and error. Finally, if working in artificial intelligence sounds interesting, you might like to know that you can earn an excellent salary because, according to Glassdoor, the median annual salary for a machine learning engineer is $121,324 [1].
Explore how reinforcement learning algorithms work and examples such as Q-learning, SARSA, and DDPG.
Reinforcement learning algorithms learn in a way that might remind you of how humans learn—through trying things and determining whether that attempt was good. If you want to learn how to do something, such as how to play chess, one way to learn is to sit down in front of a chess board and start playing. As a beginner, you’re guaranteed to make a lot of mistakes. Every time you make a mistake, you can learn from what went wrong and be a stronger player in the next game. After playing many, many games of chess, you’ll start to understand the best way to dominate your opponent no matter what maneuvers they attempt.
Reinforcement learning works similarly. The algorithm attempts to accomplish a goal, and it then evaluates its own performance. It will adjust its decision-making process based on the feedback it gives itself about its actions. In this way, it can use a methodology of rewards and punishments to learn the best way to accomplish a goal in a similar way as you might. Continuing to use chess as a real-world example of this process, DeepMind developed AlphaZero, which is an AI model that can play chess, shogi, and go.
Reinforcement learning uses the Markov decision process, a sequential decision-making process based on mathematics, to evaluate the immediate and cumulative rewards of certain actions. The AI model will first explore its environment by trying different actions and then evaluate whether those actions move the state toward the final goal. By looking at both the immediate and long-term rewards of certain decisions, the AI model can choose the solution with the most value.
Reinforcement learning algorithms can be differentiated as model-based or model-free, which describes whether the AI model builds an internal model of its environment or not. In a controlled, unchanging environment, an AI model may build a map or model of its environment in order to determine the optimum way to navigate the space. For example, a robot that serves drinks at a restaurant may create a map of the area to determine the best path to each table. Using the model, the AI can make predictions about the best action without actually moving through the space to see what happens. This is a model-based reinforcement learning algorithm.
In more complex or dynamic environments, the model-free agent will learn directly through trial and error because it cannot build an internal model in the same way. For example, a self-driving car can’t map out the space it will move through because of the variability of other drivers, pedestrians, road conditions, and other factors. The AI must instead learn through trying different things and seeing what works, although in this case learning happens within a virtual environment, so the agent can experiment freely without endangering anyone.
Some of the most common reinforcement learning algorithms are Q-learning, SARSA, REINFORCE, PPO, TRPO, A2C and A3C, and DDPG. These algorithms differ in how they allow the main components of reinforcement learning to interact (i.e., the agent, environment, policy, and reward). The agent is the AI model, the environment is everything the AI model interacts with, the policy is the programming or instructions the AI model has, and the reward is a score representing the value of an action. Each reinforcement learning algorithm has a different approach to implementing these four primary components.
Q-learning or Deep Q-Networks (DQNs): Q-learning is a model-free algorithm that allows an AI model to learn without any prior knowledge or policy, or with the ability to deviate from policy. As a result, a Q-learning algorithm can create its own set of rules for how to achieve the desired action by predicting what the reward (Q-value) will be for any given action. Because of this, you can use Q-learning in uncontrolled or unpredictable environments. Combining Q-learning with neural networks allows you to use a DQN algorithm.
SARSA (State, Action, Reward, State, Action): SARSA is also a model-free algorithm and is similar to Q-learning, but is an on-policy algorithm.
REINFORCE: The REINFORCE algorithm is a type of policy-gradient algorithm, which means that it adjusts its policy as it learns by predicting what the return will be of certain actions. Because REINFORCE seeks to identify the optimal policy as it manipulates its environment, it's considered an off-policy algorithm.
Actor-critic and A2C: Actor-critic algorithms use two neural networks, one as the actor to select actions and the other as the critic to evaluate the actions. The actor follows the current policy and the critic evaluates and adjusts the policy after each iteration. This architecture can help you get the best of both value-based and policy-based algorithms.
Trust-region policy optimization (TRPO): TRPO algorithms help solve a common problem with policy gradient algorithms. Sometimes the policy changes can be so big or so small that the program won’t work as expected. A TRPO prevents the policy changes from being too drastic by adding constraints to the policy updates in each iteration.
Proximal policy optimization (PPO): PPO is an on-policy algorithm developed as a simpler and just as effective solution to the problems that a TRPO can solve. This algorithm uses a novel equation to simplify the program by implementing updates in batches.
Deep deterministic policy gradient (DDPG): A DDPG is an algorithm that combines many of the qualities of other algorithms mentioned. It is an off-policy, actor-critic model that uses a value-based critic to learn a deterministic policy, or a policy that is predictable based on the input.
You can use reinforcement learning in many different industries for various applications. A few examples of the many ways you could use reinforcement learning include gaming, health care, and self-driving vehicles:
Gaming: Reinforcement learning algorithms can learn to play games, allowing you to play against an opponent who is able to adapt to your moves. You can also use reinforcement learning algorithms for game testing.
Self-driving cars: You can use reinforcement learning to control a self-driving car that can learn to maneuver in a complex and unpredictable environment. Reinforcement learning allows the AI model to manage complex variables, such as speed, multiple lanes, and other drivers.
Health care: Health care professionals use reinforcement learning to help guide treatment decisions for patients. This technology is called dynamic treatment regimes.
Reinforcement learning algorithms allow AI models to learn the best way to accomplish a goal with little to no guidance from a human agent. If you’d like to learn more about reinforcement learning algorithms, consider taking a course online. You can begin today on Coursera with Fundamentals of Reinforcement Learning offered by the University of Alberta as part of the Reinforcement Learning Specialization.
Glassdoor. “How much does a machine learning engineer make?, https://www.glassdoor.com/Salaries/machine-learning-engineer-salary-SRCH_KO0,25.htm.” Accessed October 29, 2024.
Editorial Team
Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...
This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.