Train a policy network locally

High level procedure

At a very high level, the procedure for training the policy network will look like this for each episode:

  • Start a new game
  • For each step in the game:
    • Run game state through the policy network to get an action
    • Perform the action and get new state
    • Use the current reward (or lack thereof) to update the policy network
    • When the game is done, break out of the loop

Training runs for a configurable number of episodes.

The policy neural network consists of two linear layers with a ReLU activation after the first layer and a softmax activation after the second layer. The softmax layer outputs an array of size four (one element for each of the four actions). The best action should correspond with the highest output. However, during training and simulating game play, it’s best to introduce some randomness to prevent the agent from getting stuck.

The policy network is defined in frozen_lake/ – take a look to see how it works. It is of course possible to make the network much deeper and more complex. But for a simple game, it’s a good idea to keep the network small because the game is simple and small networks are faster to train.

Training modifications

If you train this network with the high-level procedure as described, it will be difficult to learn a policy that plays Frozen Lake well. Reinforcement learning tends to be unstable and there are a couple modifications that are important for it work well in practice: experience replay and target networks.

Experience replay tracks the reward for several steps before actually updating the network. A target network is a clone of the policy network that updates less frequently. Check out this blog post for an accessible explanation of both of these tricks.

The implementation for both of these modifications is based on a tutorial from the PyTorch site. Take a look at the memory for experience replay in frozen_lake/ and the target network and update step logic in frozen_lake/

Train the policy network

Now it’s time to train a policy network locally:

  1. In a new cell in the notebook, create a config object that uses all the defaults (you can see how config is defined in frozen_lake/

    config = DeepQConfig()
  2. To run the training loop locally, call the train() function with the config object as the only input. The loop will take about a minute to run and will return a tuple with the policy network and the reward for each episode:

    local_policy, reward = train(config)

  3. Plot the moving average of the reward to see how it varies during training:


    You can see that even with the experience replay and target network modifications, reward is still somewhat unstable. This is a good reason to scale up training with a hyperparameter search in order to find the policy that performs best.

  4. Now try the policy network on the standard test level from earlier and print the results:

    n_attempts = 10000
    test_level = get_test_level()
    rewards = [
        play_level(test_level, local_policy.learned_action)
        for _ in range(n_attempts)
    sum(rewards) / n_attempts

    The win percentage should be higher than the random agent from earlier.