# Reinforcement learning: First steps with policy gradients

An agent lives in some environment. He is able to perform a certain number of actions. The agent may go left or right, or may jump on the weird mushrooms growing next to him. We call this action $\mathbf{a}$. The environment, on the other hand, reacts to $\mathbf{a}$ with an observation $\mathbf{o}$. Depending on the action taken, the mushroom may die, or the overall score will get raised. In any case, the agent receives some reward $r$ if the taken action was a good one.

The agent now has only a single goal. He wants to maximize the total reward $R = \sum_{t=1}^{T} r_{t}$ while interacting with the environment at time $t$. So he intends to perform only good actions. Actions which allow him to maximize the long term reward. This is the general setup for a reinforcement learning task. An agent wants to optimize a mechanism to choose actions (called the policy ${\pi}$), which maximizes the future long term reward. Typically the agent lives in a stochastic world, where the observations are sampled from some probability distribution $\mathbf{o}_{t+1} \sim p\left( \mathbf{o} _{t+1}\left\vert\mathbf{o}_{t},\mathbf{a}_{t}\right.\right)$. We further assume that the agent chooses its actions from a stochastic policy $\mathbf{a}_{t}\sim\pi_{\mathbf{\theta}}\left( \mathbf{a}_{t}\left\vert \mathbf{o}_{t}\right. \right)$ to allow some exploration of the environment. The policy is parameterized by $K$ policy parameters $\mathbf{\theta} \in\mathbb{R}^{K}\$.

This post will cover a well known but rather simple algorithm for finding a a good policy, called the policy gradient method. The general idea is that the expected reward $J(\mathbf{\theta})$ is maximized by adjusting the policy parameters in direction of better reward. Unfortunately $J$ is generally unknown, and thus also its gradient.

However, the policy gradient theorem allows to estimate the gradient by interacting with the environment on the current policy. The gradient is then just the sample average of a few runs.

$\nabla_{\theta} J(\theta) = E \bigg\lbrack R \sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\mathbf{\theta}}(\mathbf{a}_t \vert \mathbf{o}_t) \bigg\rbrack$

Now, having an estimate of the gradient with respect to $\mathbf{\theta}$ we can adapt the policy parameters by applying gradient ascend and get an increasingly better set of parameters $\mathbf{\theta}$ while iterating. I will demonstrate the principle with a simple but artificial setup.

Consider an agent whose actions $\mathbf{a}\in\mathbb{R}^{2}\$ are drawn from a two dimensional continuous space. At each time-step he gives a pair of numbers to the environment, which then reacts with an observation $\mathbf{o}$ and a reward $\mathbf{r}$. The observation happens to be just 50 random numbers sampled from a normal distribution with the agents action interpreted as the mean and variance. The reward of the action is calculated according to these random numbers, by applying a Tent-function to each individually. Importantly, the agent has no knowledge about the interns of the environment, he just sees the observation and its corresponding reward.
The following R-code simulates the environment.

# The resulting observations are drawn from a normal distribution parameterized by the given action
getNewObservation <- function(action) {
# the variance has to be positiv
action[2] <- ifelse(action[2] < 0.00001, 0.00001, action[2])
observation = rnorm(nrRandomNumbers, mean=action[1]*max(rnorm(1,1,0.02),0.0001), sd=action[2]*max(rnorm(1,1,0.02),0.0001))
reward = getReward(observation)
return(list(observation, reward))
}

#each random number of the observation is rewarded according to a tent function
#only numbers which are close to the target value get a reward.
getReward <- function(x) {
reward = 1-abs(target-x)
return(ifelse(reward < 0, 0, reward))
}


The agent now interacts with this environment and needs to calculate appropriate actions through its policy. The policy could be anything (even some big convolutional neural network), but for the first steps lets choose a simple one. Lets model the policy as a bivariate normal distribution with policy-parameters $\mathbf{\theta}$ corresponding to the means $\mathbf{\mu}$ in the respective direction and a constant covariance matrix $\Sigma$.

covarMat = matrix(c(0.19,0,0.0,0.19), nrow = 2, ncol = 2)

sampleActionFromPolicy <- function(policy) {
require(MASS)
return(mvrnorm(n = 1, mu = policy, Sigma = covarMat))
}


The agent can now draw actions from a distribution $\mathcal{N}(\mathbf{\theta}, \Sigma)$.

The important part is the gradient estimation process. Here the magic happens. For a normal distributed policy the needed term is:

$\nabla_{\theta} \log \pi_{\mathbf{\theta}}(\mathbf{a} \vert \mathbf{\theta}) = (\mathbf{a}- \mathbf{\mu})^T \Sigma^{-1}$

It can be obtained by deriving the logarithm of the multivariate normal distributions density function with respect to $\mu$, and comes down to this code.

#return the gradient of the log policy wrt to the drawn action
return(t(as.matrix(action - policy)) %*% solve(covarMat))
}


The Monte Carlo estimation of the true gradient can now be implemented straight forward using the policy gradient theorem.

estimateGradient = function(policy) {
gradient_mu = vector(mode="numeric", length = 2)
#We play #nrSamples games with the current policy
#Each sample gives us a hint on the gradient.
#the average of the hints (=the expectation) is used as our gradient estimate
for (j in 1:nrSamples) {
action = sampleActionFromPolicy(policy)
observation = getNewObservation(action)
#observation[[2]] holds the reward for the observation
#the reward is the driving force for the optimization and scales the gradient
}
}


Having the gradient we can now apply a gradient ascent method to get a better policy.

for(i in 1:iterations) {
policy <- policy + learningRate * estimateGradient(policy)
}


I’ve plotted the environment and the actions of our agent in this environment. The drawn normal distribution is parameterized by the agents actions at each iteration. From this distribution the environment draws random numbers which get rewarded by the blue tent-function.

View post on imgur.com

Because we have a stochastic policy with constant covariance, the optimization process never stops and the system keeps wiggling around the optimal value.

However, the system even works in a dynamic environment with a changing target. This animation shows more clearly how the system works. It is truly reward driven, as the magnitude of the gradient is proportional to the overall reward during the estimation process. Hence, the convergence speed is proportional to the available reward. This makes learning slow if the reward is only sparsely granted. More generally, in a task with a bad initial policy, learning is slow at the beginning and gets increasingly faster the better the policy gets.

View post on imgur.com