Collecting bananas with Deep Reinforcement Learning agent
Today, I’m gonna share with you an amazing algorithm that learns from zero (no needs labeled data) to collecting yellow bananas while avoiding blue bananas. It’s very nice, isn’t?
Before we talk about the algorithm and the agent, let’s understand how reinforcement learning works.
How Reinforcement Learning works?
Reinforcement Learning it’s a subclass of Machine Learning.
We will have an environment and an agent … cool
The agent provides actions to environment.
The environment return the state and reward for the agent.
The objective of the agent is maximize the cumulative reward.
We have some reinforcement algorithms that work by mapping the states that lead to the best cumulative reward.
It’s can works with some simple problems (with small number of data in the state and small number of actions), but the problem starts when we have a higher number of actions or infinite number of possible actions („• ֊ •„)
Then… this gave rise to the deep reinforcement learning algorithms!!
Basically, we add a Deep Learning algorithm on that.
For this environment, I used a Deep Q-Learning, see below the architecture of this algorithm.
Now that you know the basics of reinforcement learning and deep reinforcement learning, let’s talk about bananas!
The challenge: Bananas Collector
This environment is provided by Unity.
Unity is a leader in this area and their have others environment, see more about Unity here.
In the video below, their explains how this environment works.
State space
The state space has 37 dimensions and contains the agent's velocity, along with ray-based perception of objects around agent's forward direction.
Actions
We have 4 actions that the agent can take:
Reward
A reward of +1
is provided for collecting a yellow banana, and a reward of -1
is provided for collecting a blue banana.
Agent architecture
The agent created on this project consists of an agent, a Deep Q-Learning model and a Memory Unit.
Agent
The agent has the methods that interacts with the environment: step()
, act()
, learn()
and some others.
The Deep Q-Learning and Memory Unit will be part of the agent (as attributes).
Deep Q-Learning
The function of this model is understand the states and provides better actions while the agent learns more about the environment.
The architecture of the model is too simple, it’s has an input layer, a hidden layer and then, an output layer.
This neural networks was developed with https://pytorch.org/
Memory Unit
For the Memory Unit we have two options that can be used:
- ReplayMemory
With two simple methods
add()
andsample()
this memory can storage the experiences and also can returns aleatory some experiences to be used in agent training.
- PrioritizedMemory
This memory is a little bit more complex, because the
sample()
method doesn't returns the experiences aleatory. It’s will try to understand how experiences can be used for a better training of the agent applying weights for each experience.
Training and Testing the agent
To have this environment solved, the agent must get an average score of +13.
The agent has a lot of hyperparameters, some os these are used in Deep Q-Learning model, others in Memory Unit, etc.
agent = Agent(state_size=37, action_size=4, seed=199, nb_hidden=(64, 64), learning_rate=0.001, memory_size=int(1e6), prioritized_memory=False, batch_size=64, gamma=0.9, tau=0.03, small_eps=0.03, update_every=4)
We have two trained agents (with ReplayMemory and PrioritizedMemory).
See below the training session results for each agent.
With Replay Memory, the agent solved this environment with 389 episodes and got about 4 min to it.
With Prioritized Memory, the agent solved this environment with 370 episodes and got about 11 hours and 30 minutes to learn.
We’ll test the trained agent with Prioritized Memory in the environment.
You can find the code on my GitHub’s Repository.
Considerations
I had a lot of fun working with this algorithm, it’s some complex at first, but I think that you’ll enjoy it as well.
If you study something about neuroscience, you’ll see some relationships about it. Our memory works similar to Prioritized Memory when we “select” some experiences to use to evaluate the options and decisions that we make every day.
For the future work I think that we can test a dueling DQN to improve this model. See more about it here.
That’s all folks! Thank you for reading this and I appreciate if you can share your comment with me below.