# TrackMaster - Reinforcement Learning for Race Cars

The work presented here is a school project with Thomas Gilles, for a course on Reinforcement Learning ^{[1]}. We present a new Reinforcement Learning environment, TrackEnv, to teach cars how to drive. We then use this environment to compare the performances of several Reinforcement Learning Agents : *SARSA*, Q-learning and deep Q-Learning.

## Introduction

Currently, self-driving cars employ a great deal of expensive and complex hardware to achieve autonomous motion. They use a combination of ultrasonic, radar and Lidar sensors to map the road. Here, we model a very simple case of autonomous driving where we provide simple sensors to a car to let it learn by itself how to drive on a track. Our main question was whether the car could learn simple driving policies using reinforcement learning algorithms to stay on the track. We implemented three popular reinforcement algorithms (*SARSA*, Q-learning and Deep Deterministic Policy Gradient ^{[2]}) and compared the performance of those algorithms on the driving learning problem.

We used, along standard python libraries, the Reinforcement Learning library keras-rl.

## Background and Related Work

#### Markov Decision Processes and Reinforcement Learning

Reinforcement learning algorithms are modelled through the framework of Markov Decision Processes. Given a set of states $S$, a set of actions $A$, and a discount factor $\gamma \in [0,1]$, we define:

- $P_a(s, s')$ is the probability to be in state $s'$ when taking action $a$ while being in state $s$.
- $R_a(s, s')$ is the reward given to an agent that takes action $a$ while being in $s$ and ends up in state $s'$.

In our case, the environment is fully deterministic, and the reward given depends only on the state that the agent is in, which means that $R_a(s, s')$ depends only on $s'$ and $P_a(s, s') \in \{0,1\}$.

Suppose that our agent has taken actions $a_1, a_2, ... a_n$ and followed a state path $s_1, ... s_n$.

Then the discounted reward during $T$ steps is:

$D^{\pi}(s_0) = \sum_{1 \leq i \leq T} \gamma^i R_{a_i}(s_i, s_{i+1})$

The aim of a reinforcement learning algorithm is, when being in a state $s$, to maximize the reward it could get in expectation in this state $\mathbb{E}(D^{\pi}(s))$.

Therefore, the policy is chosen in accordance with this goal:

$\pi(s) = \operatorname{argmax}_{\pi} \mathbb{E}(D^{\pi}(s)) \quad \forall s \in S$

The challenges of a reinforcement algorithm are the following: how to explore the policy space, how to estimate the expected reward in a state ^{[3]}, how to balance exploitation of a learned behavior and exploration of new behaviors.

#### Q-learning and *SARSA*

The key idea in the Q-learning method is to introduce a state-action value function $Q(s,a)$ that represents the expected reward received when taking action $a$ in state $s$. The optimal policy can thereby be computed as

$\pi(s) = \operatorname{argmax}_{a \in A} Q(s,a)$

The Q-learning algorithm introduced by Watkins in 1989 learns this function iteratively with the following update rule :

$Q^\pi(s, a) \gets (1 - \alpha) \cdot Q^\pi(s, a) + \alpha \cdot (r + \gamma \cdot \max_{a'} Q^\pi(s', a'))$ and picks its actions with an epsilon-greedy policy: The action is chosen randomly with probability $\epsilon$, and greedily with probability $1-\epsilon$.

The *SARSA* (State-Action-Reward-State-Action) ^{[4]} algorithm is about the same except for the update rule :

$Q^\pi(s_t, a_t) \gets (1 - \alpha) \cdot Q^\pi(s_t, a_t) + \alpha \cdot (r + \gamma Q^\pi(s_{t+1}, a_{t+1}))$

The *SARSA* algorithm updates the Q-function according to the policy it follows while the Q-learning algorithm updates it according to the optimal policy given this Q-function.

Q-learning and the *SARSA* algorithm can be applied to a continuous environment space by discretizing the state-action space.

#### REINFORCE algorithm and deep reinforcement learning

In the previous algorithms, the policy is computed for each state $s$ in a tabular way. There are other ways to explore the policy state. For instance, it is possible to implement a policy $\pi_{\theta}$ that depends on some parameter $\theta$. The objective is to maximize the expected cumulated reward $J_e(\theta)$, with

$J_e(\theta)= \mathbb{E_{\pi_{\theta}}}(\sum_{t=1}^H r(s_t,a_t))$

The reinforce algorithm (with single rollout) allows us to estimate the gradient of $J_e$ with respect to the parameter $\theta$

$\hat{\nabla}_{\theta}J_e (\theta)= \sum_{t=1}^H{\left(\nabla_{\theta} log \; \pi_{\theta}(s_t,a_t)\right)(\sum_{t=1}^H{r_t})}$

We can then use any gradient algorithm (e.g SGD) to optimize the parameters of the policy model ^{[5]} ^{[6]} to improve the expected reward. The policy model can be for example a softmax regression or a deep neural network ^{[7]}. In the Deep Deterministic Policy Gradient algorithm, two neural networks are implemented, and the weights of these networks are updated with these rules. We will cover the principle of the DDPG algorithm the "Agent" section.

## The Environment

The environment we designed aims at modelling the behavior of real-world cars driving on the road. The environment consists in a single track (see figs 1 and 2), and the physical part of the agent is a car that must learn how to drive without going off track.

We designed a basic track with simple 90 degrees turns in both directions. We then designed a complex track with many different behaviors to learn.

Basic Track

Complex Track

#### The Action Space

The action space is very simple. The car can accelerate or brake, and turn at a given deviation angle. Two parameters are implemented in the environment, that reflect the physics of the car:

- $a_{max}$ is the maximum absolute acceleration rate.
- The car can deviate from its trajectory by a maximum angle of $\theta_{max}$.

Therefore, the action space is

$A = [-a_{max}, a_{max}] \times [-\theta_{max}, \theta_{max}]$

#### The State Space

To simulate a real car, the state space reflect what a real driver could see. That is why we implemented car sensors, located at the front of the car, that point in many directions: each sensor measure how long the car could drive in that direction without going off track. This would be equivalent to infra-red sensors in real-life. There is a sensor limit: if the distance measured by the sensor is too large, it is set to a maximum value $s_{max}$ to simulate the fact that a real-life sensor cannot measure large distances without good accuracy.

The agent also knows the absolute speed, as a speedometer in a real car. We assume that the car respects the Highway code and that it cannot go faster than a given speed limit S.

Finally, there is a bit in the state space that warns the agent whether or not he is going in the good direction. We added this bit because if the car makes a turn-around at 180 degrees, it has no way of knowing it is going in the wrong direction with the values of sensors and speed only. This would be equivalent to the 'Wrong direction' Warning in video games.

The State Space in therefore

$S = [0, s_{max}]^N \times [0, S] \times \{-1,1\}$

where N is the number of sensors of the car. In our experiments, we took $N = 9$.

#### The Reward function

We want our car to drive as fast as possible, but also with good accuracy. There are checkpoints on the track to measure the distance driven by the car.

The reward function is the following:

- -20 when the car goes off track
- 10 when the car passes a checkpoint
- -10 when the car passes a checkpoint backwards
- -1 when nothing happens.

The -1 reward each time nothing happens is to encourage the car to move. Otherwise, the strategy to stay motionless is quite interesting for the car. With a too high punishment for going offtrack, the car stops or drives at a very low speed. -20 turned out to be a good balance for our purpose.

It is worth noting that a positive reward is given only when the car drives a sufficiently long distance on the track. If the car was given a positive reward each time it goes in the direction of the track, the driving behavior would be much easier to learn. We can set the distance the car must drive between two positive rewards at different values, having thereby a way to make our Reinforcement learning problem easier or more complicated as we wish. However, we did not study this parameter.

#### Solving the environment

We say that an agent "solves the environment" if it is able to make a complete lap without going offtrack 10 times in a row. We could also measure the performance of an agent by looking at the percentage of episodes during which a complete lap has been performed.

#### Environment challenges

This environment is particularly challenging because the agent has many things to learn : how to stay ontrack, when to accelerate, when to brake, and how to efficiently turn the car wheel. The precision needed in control makes it difficult for a discrete-state based method (such as Q-learning) to work efficiently.

The environment is also particulary suited to see the evolution of driving policies through time: For instance, when learning not to go off-track, the agent starts by learning emergency braking, and then slowly understands how to anticipate its trajectory.

## The Agents

We implemented a Q-learning Agent, a *SARSA* agent, and a Deep Deterministic Policy Gradient (DDPG) Agent ^{[8]}.

The Q-learning and *SARSA* algorithm are based on a discrete state-action space. The DDPG agent works with continous space, so should be more adapted to this environment.

#### Q-Learning and *SARSA* algorithm agents

The Q-learning and *SARSA* agents have been used with the following parameters: $\alpha = 0.1, \gamma = 0.95, \epsilon = 0.1, \epsilon_{decay} = 0.997$.

Each interval of the space-state action is discretized by a factor 3. As a result, the dimension of our state-action space is $3^{12}$.

We will now explain the principle of a DDPG Agent.

#### Actor-critic algorithm

The actor-critic algorithm represents the policy function, the $\mathbf{Actor}$ independently of the value function, the $\mathbf{Critic}$. The Actor produces actions given the states of the environment, and the Critic approximates the action-value function given the state, the output of the actor and the resultant reward.

In order to accelerate the training and to have a more stable learning, the networks are trained offline, using a replaye buffer to store the experiments and sampling them in minibatches before inputing them in the networks.

The critic network $Q(s,a)$ is trained using the Bellman equation as in Q-learning. The actor $\mu(s|\theta)$ is updated with the Deterministic Policy Gradient :

$\nabla_{\theta^{\mu}} \mu \approx \mathbb{E}_{\mu'} \big [ \nabla_{a} Q(s, a|\theta^{Q})|_{s=s_t,a=\mu(s_t)} \nabla_{\theta^{\mu}} \mu(s|\theta^{\mu})| _{s=s_t} \big ]$

#### Target networks

Since the critic being updated is also used to train the actor, the Q update is prone to divergence. Hence the use of target networks for both actor and critic. These targets are copies $\mu'$, $Q'$ of the actor and critic $\mu$, $Q$, and are used to calculate the target values. The weigths of these networks are then updated to slowly track the learned networks : $\theta ' \xleftarrow{} \tau \theta + (1-\tau )\theta '$

This means that the targets change slowly, and improves greatly the stability.

##### DDPG Algorithm

- Randomly initialize critic network $Q(s, a|\theta^Q)$ and actor $\mu(s|\theta^\mu)$ with weights $\theta^Q$ and $\theta^\mu$;
- Initialize target network $Q'$ and $\mu'$ with weights $\theta^{Q'} \xleftarrow{} \theta^Q, \theta^{\mu'} \xleftarrow{} \theta^\mu$
- Initialize replay buffer $R$
- for $episode = 1..M$ do
- Initialize a random process $\mathcal N$ for action exploration
- Receive initial observation state $s_1$
- for $t = 1..T$ do
- Select action $a_t = \mu(s_t|θ^\mu) + \mathcal N_t$ according to the current policy and exploration noise
- Execute action at and observe reward $r_t$ and observe new state $s_{t+1}$
- Store transition $(s_t, a_t, r_t, s_{t+1})$ in $R$
- Sample a random minibatch of N transitions $(s_i, a_i, r_i, s_{i+1})$ from $R$
- Set $y_i = r_i + \gamma Q'(s_{i+1}, \mu'(s_{i+1}|\theta^{\mu'})|\theta^{Q'})$
- Update critic by minimizing the loss: $L = \frac{1}{N} \sum_i (y_i - Q(s_i, a_i|\theta^Q))^2$
- Update the actor policy using the sampled policy gradient:

$\nabla_{\theta_\mu} J = \frac{1}{N} \sum_i \nabla_a Q(s, a|\theta^Q)_{|s=s_i, a=\mu(s_i)} \nabla_{\theta_\mu} \mu(s|\theta^\mu)_{|s_i}$ - Update the target networks:

$\theta^{Q'} \xleftarrow{} \tau \theta^Q + (1 - \tau )\theta^{Q'}$ $\theta^{\mu'} \xleftarrow{} \tau \theta^{\mu} + (1 - \tau )\theta^{\mu'}$

Instead of an epsilon-greedy exploration, we use an Ornstein-Uhlenbeck process that applies noise on the different actions using the following equation:

$da_t=\theta(\mu - a_t)dt+\sigma dW_t$

Where $dW_t$ is a Brownian noise and $\mu$ is the expected mean value of the action.

#### Implementation

We implemented DDPG using keras-rl. The actor is a simple dense network with 3 hidden of 16 units each. The critic is a bit more complex (inspired from details in the DDPG paper) : a first hidden layer of 128 units takes the observation as input, and is then concatenated with the action input. The concatenation is then given as input to two hidden layers of 128 units. ReLU activation was used in all hidden layers. The output of the actor has an hyperbolic tangent activation (all actions are normalized between -1 and 1), and the output of the critic is linear.

The networks are trained only every 16 action, to speed up the training and let time for the buffer to fill with new experiments, with mini-batches of size 64.

## Results and Discussion

In the first two sections, we compare the performance of our algorithms on the simple and complex tracks. They are trained for 5000 episodes (except the DDPG which takes a much longer training time) and tested over 100 episodes.

#### Comparison of Algorithms on the Simple Track

Here, we defined "solving an episode" as having a reward of more than 150, which roughly corresponds to driving a little more than a complete lap.

Episodic Rewards on the Simple Track

Algorithm | Q-learning | SARSA |
DDPG |
---|---|---|---|

Mean Reward | 101 | 256 | 496 |

% Episodes Solved | 28% | 76% | 100% |

Mean reward averaged over 100 testing episodes

The DDPG algorithm is the only to fully master this simple track. Whereas the car still go offtrack from time to time with the Q-learning and *SARSA* algorithm, the DDPG never go offtrack in 100 episodes, and always get the maximum possible reward, which is around 500. We can see that the *SARSA* algorithm, while being conceptually close to the standard Q-learning algorithm, performs twice as good as the Q-learning method.

#### Comparison of Algorithms on the Complex Track

A max clock time has been set to train the algorithms (to stop a motionless car), so the cars are stopped before they could finish a lap on this track. Therefore, we define "solving an episode" as getting a reward of more than 100 on this episode.

Episodic Rewards on the Complex Track

Algorithm | Q-learning | SARSA |
DDPG |
---|---|---|---|

Mean Reward | 70 | 97 | 163 |

% Episodes Solved | 12% | 45% | 90% |

On the complex track, the *SARSA* algorithm is still better than the Q-learning method, while the performance gap is smaller than on the simple track. The DDPG agent is still the best and reach a better performance with less episodes, around 1000 against 5000 for the discrete algorithms.

We believe that with more training and a higher max clock time, the DDPG agent would be able to reach a high accuracy on this complex track. However, this training would require more computational power.

We now explain our implementation choice for car spawning: we chose to spawn to car at a random position on the track, and not always at the same position. This seemed more relevant to us because it makes the learning procedure more homogeneous: the car first develops a basic understanding of all key places on the track, and then can improve this understanding step by step. We indeed observed that the 3 algorithms did have a steeper learning curve with a random spawn compared to a fixed initial spawn.

Below is a gif image of the DDPG algorithm's performance after 3 minutes of training on the simple track (loop: one lap) :

#### Conclusion and Future Work

We designed an environment inspired from real-life perspectives : self-driving cars. A car must learn to drive on a track, and it receives reward when it passes checkpoints. We compared the results obtained by using a standard Q-learner, the *SARSA* algorithm and a more advanced Deep Deterministic Policy Gradient (DDPG). The structure of the environment makes the DDPG agent more adapted to the reinforcement learning task, and it performs quite well, performing many laps on the simple track and beating the other agents on the complex track.

The DDPG proved to be the best agent on this environment, because it is inherently more suited to continuous state-action space.

To continue our study, we would decrease the number of checkpoints on the track and study how it affects the learning time of this DDPG agent. The TrackMaster environment is indeed well suited for handling the sparsity of rewards, by putting more or less checkpoints. It would be a good test for reinforcement learning algorithms to see until what point they are able to deal with sparse rewards.

J.Read. Lecture IV - Introduction to Reinforcement Learning. INF581 Advanced Topics in Artificial Intelligence, 2018. ↩︎

T. Lillicrap, J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning, 2015. Available http://arxiv.org/1509.02971. ↩︎

Watkins, Christopher John Cornish Hellaby, Learning from delayed rewards, PhD Thesis, King's College, Cambridge, 1989. ↩︎

Rummery, Gavin A and Niranjan, Mahesan, On-line Q-learning using connectionist systems, University of Cambridge, Department of Engineering Cambridge, England, 1994. ↩︎

Sutton, Richard S and McAllester, David A and Singh, Satinder P and Mansour, Yishay, Policy gradient methods for reinforcement learning with function approximation, in Advances in neural information processing systems, pages 1057-1063, 2000. ↩︎

Silver, David and Lever, Guy and Heess, Nicolas and Degris, Thomas and Wierstra, Daan and Riedmiller, Martin, Deterministic policy gradient algorithms, ICML, 2014. ↩︎

Mnih, Volodymyr and Kavukcuoglu, Koray and Silver, David and Graves, Alex and Antonoglou, Ioannis and Wierstra, Daan and Riedmiller, Martin, Playing atari with deep reinforcement learning, 2013, Available http://arxiv.org/pdf/1312.5602/ ↩︎

Mnih, Volodymyr and Kavukcuoglu, Koray and Silver, David and Rusu, Andrei A and Veness, Joel and Bellemare, Marc G and Graves, Alex and Riedmiller, Martin and Fidjeland, Andreas K and Ostrovski, Georg and others, Human-level control through deep reinforcement learning, Nature, 2015. ↩︎