Train a policy network

In typical machine learning problems, the goal is to learn a function that maps an input to a known output. In reinforcement learning, the goal is to learn a policy that maximizes a reward using trial and error. The policy looks at the current state of the system and decides what action to take. The reward may not occur right away, but as the agent runs the simulation over and over again, it learns to make decisions that maximize the chance of getting a reward.

In this module, you will train a small neural network that will take the state of the Frozen Lake environment and generate a score for each of the four actions, where a high score indicates a good chance that the given action will lead to a reward. The idea is that an agent using this trained policy network will do a good job of simulating human performance in the Frozen Lake game.