Collecting bananas with Deep Reinforcement Learning agent

Douglas Trajano
Analytics Vidhya
Published in
5 min readMay 28, 2020

--

Today, I’m gonna share with you an amazing algorithm that learns from zero (no needs labeled data) to collecting yellow bananas while avoiding blue bananas. It’s very nice, isn’t?

Before we talk about the algorithm and the agent, let’s understand how reinforcement learning works.

You waiting for my quick overview about reinforcement learning

How Reinforcement Learning works?

Reinforcement Learning it’s a subclass of Machine Learning.

Types of Machine Learning

We will have an environment and an agent … cool

The agent provides actions to environment.

The environment return the state and reward for the agent.

The objective of the agent is maximize the cumulative reward.

We have some reinforcement algorithms that work by mapping the states that lead to the best cumulative reward.

It’s can works with some simple problems (with small number of data in the state and small number of actions), but the problem starts when we have a higher number of actions or infinite number of possible actions („• ֊ •„)

Then… this gave rise to the deep reinforcement learning algorithms!!

Basically, we add a Deep Learning algorithm on that.

Reinforcement Learning + Deep Learning = Deep Reinforcement Learning

For this environment, I used a Deep Q-Learning, see below the architecture of this algorithm.

Deep Q-Learning algorithm

Now that you know the basics of reinforcement learning and deep reinforcement learning, let’s talk about bananas!

The challenge: Bananas Collector

This environment is provided by Unity.

Unity is a leader in this area and their have others environment, see more about Unity here.

In the video below, their explains how this environment works.

Unity ML-Agents Environment — Banana Collectors

State space

The state space has 37 dimensions and contains the agent's velocity, along with ray-based perception of objects around agent's forward direction.

Actions

We have 4 actions that the agent can take:

Reward

A reward of +1 is provided for collecting a yellow banana, and a reward of -1 is provided for collecting a blue banana.

Agent architecture

The agent created on this project consists of an agent, a Deep Q-Learning model and a Memory Unit.

Agent

The agent has the methods that interacts with the environment: step(), act(), learn() and some others.

The Deep Q-Learning and Memory Unit will be part of the agent (as attributes).

Deep Q-Learning

The function of this model is understand the states and provides better actions while the agent learns more about the environment.

The architecture of the model is too simple, it’s has an input layer, a hidden layer and then, an output layer.

This neural networks was developed with https://pytorch.org/

Memory Unit

For the Memory Unit we have two options that can be used:

  • ReplayMemory

With two simple methods add() and sample() this memory can storage the experiences and also can returns aleatory some experiences to be used in agent training.

  • PrioritizedMemory

This memory is a little bit more complex, because the sample() method doesn't returns the experiences aleatory. It’s will try to understand how experiences can be used for a better training of the agent applying weights for each experience.

Training and Testing the agent

To have this environment solved, the agent must get an average score of +13.

The agent has a lot of hyperparameters, some os these are used in Deep Q-Learning model, others in Memory Unit, etc.

agent = Agent(state_size=37, action_size=4, seed=199, nb_hidden=(64, 64), learning_rate=0.001, memory_size=int(1e6), prioritized_memory=False, batch_size=64, gamma=0.9, tau=0.03, small_eps=0.03, update_every=4)

We have two trained agents (with ReplayMemory and PrioritizedMemory).

See below the training session results for each agent.

Replay Memory

With Replay Memory, the agent solved this environment with 389 episodes and got about 4 min to it.

Prioritized Memory

With Prioritized Memory, the agent solved this environment with 370 episodes and got about 11 hours and 30 minutes to learn.

We’ll test the trained agent with Prioritized Memory in the environment.

Testing the agent with Prioritized Memory

You can find the code on my GitHub’s Repository.

Considerations

I had a lot of fun working with this algorithm, it’s some complex at first, but I think that you’ll enjoy it as well.

If you study something about neuroscience, you’ll see some relationships about it. Our memory works similar to Prioritized Memory when we “select” some experiences to use to evaluate the options and decisions that we make every day.

For the future work I think that we can test a dueling DQN to improve this model. See more about it here.

That’s all folks! Thank you for reading this and I appreciate if you can share your comment with me below.

References

--

--