At a very high level, the procedure for training the policy network will look like this for each episode:
Training runs for a configurable number of episodes.
The policy neural network consists of two linear layers with a ReLU activation after the first layer and a softmax activation after the second layer. The softmax layer outputs an array of size four (one element for each of the four actions). The best action should correspond with the highest output. However, during training and simulating game play, it’s best to introduce some randomness to prevent the agent from getting stuck.
The policy network is defined in
frozen_lake/model.py – take a look to see how it works. It is of course possible to make the network much deeper and more complex. But for a simple game, it’s a good idea to keep the network small because the game is simple and small networks are faster to train.
If you train this network with the high-level procedure as described, it will be difficult to learn a policy that plays Frozen Lake well. Reinforcement learning tends to be unstable and there are a couple modifications that are important for it work well in practice: experience replay and target networks.
Experience replay tracks the reward for several steps before actually updating the network. A target network is a clone of the policy network that updates less frequently. Check out this blog post for an accessible explanation of both of these tricks.
The implementation for both of these modifications is based on a tutorial from the PyTorch site. Take a look at the memory for experience replay in
frozen_lake/memory.py and the target network and update step logic in
Now it’s time to train a policy network locally:
In a new cell in the notebook, create a config object that uses all the defaults (you can see how config is defined in
config = DeepQConfig()
To run the training loop locally, call the
train() function with the
config object as the only input. The loop will take about a minute to run and will return a tuple with the policy network and the reward for each episode:
local_policy, reward = train(config)
Plot the moving average of the reward to see how it varies during training:
plt.plot(moving_average(reward)) plt.ylabel('reward') plt.xlabel('episode');
You can see that even with the experience replay and target network modifications, reward is still somewhat unstable. This is a good reason to scale up training with a hyperparameter search in order to find the policy that performs best.
Now try the policy network on the standard test level from earlier and print the results:
np.random.seed(1) n_attempts = 10000 test_level = get_test_level() rewards = [ play_level(test_level, local_policy.learned_action) for _ in range(n_attempts) ] sum(rewards) / n_attempts
The win percentage should be higher than the random agent from earlier.