A GAN is a system where two different neural networks play a game against each other. The first tries to generate data, while the second network judges this generated data. The latter ones job is to distinguish whether a given datum was generated by the former network or whether it came from a ground truth distribution (i.e. training data-set). The goal of the first network is to generate data in a way which makes it hard for the second network to do its job. At the beginning of the game both networks (i.e. both Players) do not have knowledge about the desired data distribution nor their opponents. During training the first network learns to generate data similar to the given distribution as this makes the job for the opponent harder. And the second network gets better at discriminating between real- and fake-data. Both players increase their score by adapting their respective strategy in a systematic manner.

However, after a while of playing and learning both players might find themselves in a situation where it doesn’t make sense any more to make strategy adjustments without losing ground against their opponent. They reached a Nash-Equilibrium (NE).

In a more abstract setting you could picture such a situation like in the image below. The first player tries to minimize the reward while the other player wants to maximize it. If they end up in one of the two marked spots none of them has an incentive to change its strategy as both would get a lower score after the opponent gets a chance to adapts his strategy.

However, neural networks typically are trained by gradient descent and hence both players can only slowly adjust their strategy. In fact, both players follow (almost) continuous trajectories in this combined strategy space. This restriction turns out to be a major burden for both players as they are easily captured in so called *local* Nash-Equilibria. Although they are not a problem in the above sample *min-max-*game, they are in similar *max-*game. In the *max*-game both players need to work together in order to maximize the reward.

If the combined strategy happens to be the marked left point, both players got trapped in a sub-optimal NE. Although, Player B has reached the optimal strategy, Player A could choose the optimal strategy on the right side from which both would benefit. However, due to the restriction that only local changes are allowed the system is trapped and doomed to be sub-optimal. There is simply no trajectory out of the left local minima, from which both players would benefit directly.

The problem is even more severe, because there infinitely many local NE in the *max*-game which are even worse then the previously described ones.

Additionally, they form a barrier which prevents the players from reaching the top right and bottom left parts of the strategy-space. Hence, generator and discriminator seems to get sucked into these sub-optimal solutions.

Two player games are hard.

]]>As an owner of a RaspberryPi I’d like to measure the frequency myself. However, for reasons of simplification I didn’t like to use a micro-controller. Despite the fact that the RaspberryPi is unable to reliably read PWM signals I gave it a shot. Knowing that using the I/O pins was not possible I tried using a cheap USB-Soundcard with an microphone input and then record the hum directly just like standard audio.

The only hardware requirements remaining were a 230V -> 9V transformer, a voltage divider and a Phone connector.

On my first test on my desktop PC I got an ugly signal. Surprisingly the displayed frequency was quite exact.

In a test-recording I heard a truly familiar sound. Even though I’ve listened to the hum earlier in life, this was the first time I enjoyed it.

Despite that, I noticed a periodic interruption approximate every second which lasts for ~1ms. It seemed like parts of the signals got set to zero. I haven’t tested it in detail, but I assume this effect has great influence on frequency filters, and hence I didn’t want to use them.As I could not find the source of this delay, I ignored the problem.

So, how do I get the local frequency of the signal? Coming from the machine learning community I love building models and approximating things. Why not approximate a function like on a 10 second stretch of the data? This would allow to directly read the frequency by looking at .

Assuming harmonics and other disturbances are normal distributed I can use standard Least-Square regression. Hence, my Loss function is:

If we find the set of parmeters which minimize we have a solution. Minimizing a function can be done by setting the gradient with respect to the parameters to zero.

Which leads to the gradients

with . As systems of nonlinear equations are quite nasty, I solved them by gradient descent. I initialized the parameter to 50Hz and to 0.5 which was the amplitude of my signal. It only took about 100 iterations for the parameters to converge.

I trained a model for each second of my data using a moving data window of 10 seconds width. Hence, I still get a measurement every second.

The results are quite satisfying.

The good thing is, my measurements are comparable to the ones obtained by the micro-controller hardware of netzsin.us. Especially because we use completely different measurement techniques.

Interestingly the measurements of the University of Erlangen are quite off from ours.

What remains? Porting my approach to the Raspberry-Pi.

**Update:** Corrected the last Plot. It had a wrong year. I also added more math.

**Update2: **Smoothed the measurements with an moving average. Looks like the clocks of our PCs are a few seconds off.

The agent now has only a single goal. He wants to maximize the total reward while interacting with the environment at time . So he intends to perform only good actions. Actions which allow him to maximize the long term reward. This is the general setup for a reinforcement learning task. An agent wants to optimize a mechanism to choose actions (called the policy ), which maximizes the future long term reward. Typically the agent lives in a stochastic world, where the observations are sampled from some probability distribution . We further assume that the agent chooses its actions from a stochastic policy to allow some exploration of the environment. The policy is parameterized by policy parameters .

This post will cover a well known but rather simple algorithm for finding a a good policy, called the policy gradient method. The general idea is that the expected reward is maximized by adjusting the policy parameters in direction of better reward. Unfortunately is generally unknown, and thus also its gradient.

However, the policy gradient theorem allows to estimate the gradient by interacting with the environment on the current policy. The gradient is then just the sample average of a few runs.

Now, having an estimate of the gradient with respect to we can adapt the policy parameters by applying gradient ascend and get an increasingly better set of parameters while iterating. I will demonstrate the principle with a simple but artificial setup.

Consider an agent whose actions are drawn from a two dimensional continuous space. At each time-step he gives a pair of numbers to the environment, which then reacts with an observation and a reward . The observation happens to be just 50 random numbers sampled from a normal distribution with the agents action interpreted as the mean and variance. The reward of the action is calculated according to these random numbers, by applying a Tent-function to each individually. Importantly, the agent has no knowledge about the interns of the environment, he just sees the observation and its corresponding reward.

The following R-code simulates the environment.

# The resulting observations are drawn from a normal distribution parameterized by the given action # As a bonus some additional noise is added. getNewObservation <- function(action) { # the variance has to be positiv action[2] <- ifelse(action[2] < 0.00001, 0.00001, action[2]) observation = rnorm(nrRandomNumbers, mean=action[1]*max(rnorm(1,1,0.02),0.0001), sd=action[2]*max(rnorm(1,1,0.02),0.0001)) reward = getReward(observation) return(list(observation, reward)) } #each random number of the observation is rewarded according to a tent function #only numbers which are close to the target value get a reward. getReward <- function(x) { reward = 1-abs(target-x) return(ifelse(reward < 0, 0, reward)) }

The agent now interacts with this environment and needs to calculate appropriate actions through its policy. The policy could be anything (even some big convolutional neural network), but for the first steps lets choose a simple one. Lets model the policy as a bivariate normal distribution with policy-parameters corresponding to the means in the respective direction and a constant covariance matrix .

covarMat = matrix(c(0.19,0,0.0,0.19), nrow = 2, ncol = 2) sampleActionFromPolicy <- function(policy) { require(MASS) return(mvrnorm(n = 1, mu = policy, Sigma = covarMat)) }

The agent can now draw actions from a distribution .

The important part is the gradient estimation process. Here the magic happens. For a normal distributed policy the needed term is:

It can be obtained by deriving the logarithm of the multivariate normal distributions density function with respect to , and comes down to this code.

#return the gradient of the log policy wrt to the drawn action sampleFromGradientLogPolicy = function(policy, action) { return(t(as.matrix(action - policy)) %*% solve(covarMat)) }

The Monte Carlo estimation of the true gradient can now be implemented straight forward using the policy gradient theorem.

estimateGradient = function(policy) { gradient_mu = vector(mode="numeric", length = 2) #We play #nrSamples games with the current policy #Each sample gives us a hint on the gradient. #the average of the hints (=the expectation) is used as our gradient estimate for (j in 1:nrSamples) { action = sampleActionFromPolicy(policy) observation = getNewObservation(action) g = sampleFromGradientLogPolicy(policy, action) #observation[[2]] holds the reward for the observation #the reward is the driving force for the optimization and scales the gradient gradient_mu = gradient_mu + g * sum(observation[[2]]) } gradient_mu = gradient_mu / nrSamples }

Having the gradient we can now apply a gradient ascent method to get a better policy.

for(i in 1:iterations) { policy <- policy + learningRate * estimateGradient(policy) }

I’ve plotted the environment and the actions of our agent in this environment. The drawn normal distribution is parameterized by the agents actions at each iteration. From this distribution the environment draws random numbers which get rewarded by the blue tent-function.

Because we have a stochastic policy with constant covariance, the optimization process never stops and the system keeps wiggling around the optimal value.

However, the system even works in a dynamic environment with a changing target. This animation shows more clearly how the system works. It is truly reward driven, as the magnitude of the gradient is proportional to the overall reward during the estimation process. Hence, the convergence speed is proportional to the available reward. This makes learning slow if the reward is only sparsely granted. More generally, in a task with a bad initial policy, learning is slow at the beginning and gets increasingly faster the better the policy gets.

]]>