Training a double-jointed arm with DDPG — Deep Reinforcement Learning

Published in

Analytics Vidhya

5 min readJul 18, 2020

Today we’ll see another powerful Deep Reinforcement Learning algorithm.

It’s important that you can know the basics about the Deep Reinforcement Learning, you can see my last article here (Collecting bananas with Deep Reinforcement Learning).

Before we start, I’ll share with you something more about Deep Reinforcement Learning. It’s fast, I promise!

Theory first

Reinforcement Learning is a subclass from Artificial Intelligence, and this subclass has different types of algorithms.

It’s important to know about these types because each type has its own way of learning.

Value-based

Estimate the optimal value function Q* (s, a).

This is the maximum value achievable under any policy.

Policy-based

Search directly for the optimal policy π*.

This is the policy achieving the maximum future reward.

Actor-Critic

We combine Value-based and Policy-based methods.

The Actor is policy-based and Critic is value-based.

An intro to Advantage Actor Critic methods: let’s play Sonic the Hedgehog!

For this project I used an Actor-Critic Method. The algorithm choosed is DDPG (Deep Deterministic Policy Gradient).

DDPG works a little bit different comparing with the basic approach of Reinforcement Learning. We’ll have two neural networks inside the agent.

Now, let’s know more about the environment and the double-jointed arm!

Well, the environment that we have here it’s better than we have in the GIF above. I used Reacher Environment from Unity.

Environment

In this environment, a double-jointed arm can move to target locations.

The first version contains a single agent.

The second version contains 20 identical agents, each with its own copy of the environment.

DDPG is a good approach for this environment because we can have more than one agent. We can learn from all agents and share the knowledge with all agents as well.

Deep Reinforcement Learning for Multi-Agent Systems

State Space

The observation space consists of 33 variables corresponding to position, rotation, velocity, and angular velocities of the arm.

Actions

Each action is a vector with four numbers, corresponding to torque applicable to two joints. Every entry in the action vector should be a number between -1 and 1.

Reward

A reward of +0.1 is provided for each step that the agent’s hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.

To solve this environment we need to have 30+ score at last 100 episodes.

Agent Architecture

The agent created on this project consists of an agent with an Actor model, a Critic model, a Exploration Policy and a Memory Unit.

Agent

The agent has the methods that interacts with the environment: step(), act(), learn() and some others.

The Actor model, Critic model, Exploration Policy and Memory Unit will be part of the agent (as attributes).

Actor model

We have 3 layers on this neural network.

Input layer will receive all the 33 features in the state space.

Output layer will provide an array with 4 elements as requested to action space.

Actor(
(fc1): Linear(in_features=33, out_features=256, bias=True)     (fc2): Linear(in_features=256, out_features=128, bias=True)     (fc3): Linear(in_features=128, out_features=4, bias=True)
)

Critic model

For the Critic Neural Network we also have 3 layers.

Input layer will receive all the 33 features in the state space.

Output layer will provide just one value to guide the learn of Actor Model.

Critic(
(fc1): Linear(in_features=33, out_features=256, bias=True)
(fc2): Linear(in_features=260, out_features=128, bias=True)
(fc3): Linear(in_features=128, out_features=1, bias=True)
)

These neural networks was developed with https://pytorch.org/

Memory Unit

With two simple methods add() and sample() this memory can storage the experiences and also can returns aleatory some experiences to be used in agent training.

Exploration Policy

The Exploration Policy it’s important to our agent because it’s will help the agent try several actions results and avoid to stop on a local minima.

My Exploration Policy is based on Ornstein-Uhlenbeck process.

Now, let’s see how our DDPG algorithm with multi agents solve this environment.

Multi agent approach take a long time in each episode, you know, we have several calculations to do and two neural networks to learn, but we finished this environment with 56 episodes!

My agent when we finished the environment

You can find the code on my GitHub’s Repository.

DougTrajano/ds_drl_continuous_control

This project contains an agent based on Deep Reinforcement Learning that can learn from zero (no labeled data) to…

github.com

Ideas to future work

New combinations of hyperparameters

This algorithm take too much time to learn, it’s very complicated to test several combinations of hyperparameters, but we can continue with the training process to find the optimal hyperparameters.

Implement Prioritized Experience Replay

If you read my last article, you know that I used Prioritized Experience Replay there. This memory it’s very good to provide to the agent the experiences that we’ll have more benefits on training process, but it will increase the agent’s training time.

This prioritized memory is describes here.

Change the Exploration Policy

I used Ornstein-Uhlenbeck process as recommended by initial authors of DDPG, but the most recent papers say that we have no benefits to use Ornstein-Uhlenbeck process instead of Gaussian Noise. Probably it’s can improve the agent performance.

That’s all folks! Thank you a lot to read it and don’t forget to send your comments below!