Package RL¶

This package contains three classes of RL agent :

policyLearning - That trains a neural network to predict the Q-values of a discrete actions of a policy of the boat behavior.
dqn - That trains a neural network to find the optimal policy to avoid stall with discrete actions.
DDPG - That trains a neural network to find the optimal policy to avoid stall with a continuous set of actions.

You’ll find a tutorial in order to understand how to generate a training scenario at the bottom of the page.

Policy Learner¶

class policyLearning.PolicyLearner(state_size, action_size, batch_size)¶

Bases: object

The aim of this class is to learn the Q-value of the action defined by a policy.

Tip

Please note that the policy to learn has to be defined in the methods actUnderPolicy() and actDeterministicallyUnderPolicy().

Variables:

state_size (int) – shape of the input (for convolutionnal layers).
action_size (int) – number of action output by the network.
memory (deque) – last-in first-out list of the batch.
gamma (float) – discount factor.
epsilon (float) – exploration rate.
epsilon_min (float) – smallest exploration rate that we want to converge to.
epsilon_decay (float) – decay factor that we apply after each replay.
learning_rate (float) – the learning rate of the NN.
model (keras.model) – NN, i.e the model containing the weight of the value estimator.

act(state)¶

Calculate the action that yields the maximum Q-value.

Parameters:	state – state in which we want to chose an action.
Returns:	the greedy action.

actDeterministicallyUnderPolicy(state)¶

Policy that reattaches when the angle of attack goes higher than 16 degree

Parameters:	state (np.array) – state for which we want to know the policy action.
Returns:	the policy action.

actRandomly()¶

actUnderPolicy(state)¶: Does the same as actDeterministicallyUnderPolicy() instead that the returned action is sometime taken randomly.

evaluate(state)¶

Evaluate the Q-value of the two actions in a given state using the neural network.

Parameters:	state (np.array) – state that we want to evaluate.
Returns:	The actions values as a vector.

evaluateNextAction(stall)¶: Evaluate the next action without updating the stall state in order to use it during the experience replay :param np.array state: state for which we want to know the policy action. :return: the policy action.

get_stall()¶

init_stall(mean, mdp)¶

Parameters:	mean – mdp –
Returns:

load(name)¶: Load the weight of the network saved in the file into :ivar model :param name: name of the file containing the weights to load

remember(state, action, reward, next_state, stall)¶

Remember a transition defined by an action action taken from a state state yielding a transition to a next state next_state and a reward reward. [s, a ,r, s’]

Parameters:	state (np.array) – initial state (s). action (int) – action (a). reward (float) – reward received from transition (r). next_state (np.array) – final state (s’). stall (int) – flow state in the final state (s’).

replay(batch_size)¶

Perform the learning on a the experience replay memory.

Parameters:	batch_size – number of samples used in the experience replay memory for the fit.
Returns:	the average loss over the replay batch.

save(name)¶: Save the weights of the newtork :param name: Name of the file where the weights are saved

DQN¶

class dqn.DQNAgent(state_size, action_size)¶

Bases: object

DQN agent that aims at learning the optimal policy for stall avoidance with a discrete set of available actions.

Variables:

state_size (np.shape()) – shape of the input.
action_size (int) – number of actions.
memory (deque()) – memory as a list.
gamma (float) – Discount rate.
epsilon (float) – exploration rate.
epsilon_min (float) – minimum exploration rate.
epsilon_decay (float) – decay of the exploration rate.
learning_rate (float) – initial learning rate for the gradient descent
model (keras.model) – neural network model

act(state)¶: Act ε-greedy with respect to the actual Q-value output by the network. :param state: State from which we want to use the network to compute the action to take. :return: a random action with probability ε or the greedy action with probability 1-ε.

actDeterministically(state)¶: Predicts the action with the highest q-value at a given state. :param state: state from which we want to know the action to make. :return:

load(name)¶: Load the weights for a defined architecture. :param name: Name of the source file.

loadModel(name)¶: Load the an architecture from source file. :param name: Name of the source file. :return:

remember(state, action, reward, next_state)¶

Remember a transition defined by an action action taken from a state state yielding a transition to a next state next_state and a reward reward. [s, a ,r, s’]

Parameters:	state (np.array) – initial state (s). action (int) – action (a). reward (float) – reward received from transition (r). next_state (np.array) – final state (s’).

replay(batch_size)¶: Core of the algorithm Q update according to the current weight of the network. :param int batch_size: Batch size for the batch gradient descent. :return: the loss after the batch gradient descent.

save(name)¶: Save the weights for a defined architecture. :param name: Name of the output file

saveModel(name)¶: Save the model’s weight and architecture. :param name: Name of the output file

DDPG¶

class DDPG.DDPGAgent(state_size, action_size, lower_bound, upper_bound, sess)¶

Bases: object

The aim of this class is to learn an optimal policy via an actor-critic structure with 2 separated Convolutional Neural Networks. It uses the Deep Deterministic Policy Gradient to update the actor network. This model deals with a continuous space of actions on the rudder, chosen between lower_bound and upper_bound

Parameters:

state_size (int) – length of the state input (for convolutionnal layers).
action_size (int) – number of continuous action output by the network.
lower_bound (float) – minimum value for rudder action.
upper_bound (float) – maximum value for rudder action.
sess (tensorflow.session) – initialized tensorflow session within which the agent will be trained.

Variables:

memory (deque) – last-in first-out list of the batch buffer.
gamma (float) – discount factor.
epsilon (float) – exploration rate.
epsilon_min (float) – smallest exploration rate that we want to converge to.
epsilon_decay (float) – decay factor that we apply after each replay.
actor_learning_rate (float) – the learning rate of the NN of actor.
critic_learning_rate (float) – the learning rate of the NN of critic.
update_target (float) – update factor of the Neural Networks for each fit to target
network (DDPGNetworks.Network) – tensorflow model which defines actor and critic convolutional neural networks features

act(state)¶

Calculate the action given by the Actor network’s current weights

Parameters:	state – state in which we want to chose an action.
Returns:	the greedy action according to actor network

act_epsilon_greedy(state)¶

With probability epsilon, returns a random action between bounds With probability 1 - epsilon, returns the action given by the Actor network’s current weights

Parameters:	state – state in which we want to chose an action.
Returns:	a random action or the action given by actor

evaluate(state, action)¶

Evaluate the Q-value of a state-action pair using the critic neural network.

Parameters:	state (np.array) – state that we want to evaluate. action (float) – action that we want to evaluate (has to be between permitted bounds)
Returns:	The continuous action value.

load(name)¶: Load the weights of the 2 networks saved in the file into :ivar network :param name: name of the file containing the weights to load

noise_decay(e)¶

Applies decay on noisy epsilon-greedy actions

Parameters:	e – current episode playing during learning

remember(state, action, reward, next_state)¶

Remember a transition defined by an action action taken from a state state yielding a transition to a next state next_state and a reward reward. [s, a ,r, s’]

Parameters:	state (np.array) – initial state (s). action (int) – action (a). reward (float) – reward received from transition (r). next_state (np.array) – final state (s’).

replay(batch_size)¶

Performs an update of both actor and critic networks on a minibatch chosen among the experience replay memory.

Parameters:	batch_size – number of samples used in the experience replay memory for the fit.
Returns:	the average losses for actor and critic over the replay batch.

save(name)¶: Save the weights of both of the networks into a .ckpt tensorflow session file :param name: Name of the file where the weights are saved

update_target = None¶: Definition of the neural networks

Visualization¶

We also provide a file in order to generate plots to visualize results.

class visualization.Visualization(hist_duration, mdp_step, time_step, action_size, batch_size, mean, std, hdg0, src_file, sim_time)¶

Bases: object

Class to generate different plots for result visualization.

Parameters:

hist_duration – Size of the history buffer.
mdp_step – mdp step (frequency of decision).
time_step – time step of the mdp.
action_size – size of the action space of the model.
batch_size – size of the batch use to train the model.
mean – average wind heading.
std – noise on wind heading.
hdg0 – initial heading of the simulation.
src_file – source file containing the weights of the model used for the simulation.
sim_time – duration of the simulation.

generateAnimation(hdg0)¶

Generate an animation showing the two Q-values during an interesting control simulation including gusts.

Parameters:	hdg0 – Initial heading of the boat for the simulation

generateDeltaAnimation(hdg0)¶

Generate an animation showing the differences between the two Q-values during an interesting control simulation including gusts.

Parameters:	hdg0 – Initial heading of the boat for the simulation

generateQplots()¶: Creates the comparison between the Q-values predicted by the network and the Monte-Carlo return computed over the simulation time :return: Two plots of the comparison.

simulateDQNControl(hdg0)¶

Plots the control law of the network over a simulation.

Parameters:	hdg0 – Initial heading of the boat for the simulation.
Returns:	A plot of the angle of attack and velocity during the control.

simulateGustsControl()¶

Simulate the response of the controller to gusts.

Returns:	A plot of the simulation.

visualization.rollOut(time, SIMULATION_TIME, agent, mdp, action, WH)¶

Tutorial¶

 history_duration = 3  # Duration of state history [s]
 mdp_step = 1  # Step between each state transition [s]
 time_step = 0.1  # time step [s] <-> 10Hz frequency of data acquisition
 mdp = MDP(history_duration, mdp_step, time_step)

 mean = 45 * TORAD
 std = 0 * TORAD
 wind_samples = 10
 WH = np.random.uniform(mean - std, mean + std, size=10)

 hdg0=0*np.ones(10)
 mdp.initializeMDP(hdg0,WH)

 hdg0_rand_vec=(-4,0,2,4,6,8,18,20,21,22,24)

 action_size = 2
 policy_angle = 18
 agent = PolicyLearner(mdp.size, action_size, policy_angle)
 #agent.load("policy_learning_i18_test_long_history")
 batch_size = 120

 EPISODES = 500

 for e in range(EPISODES):
     WH = w.generateWind()
     hdg0_rand = random.choice(hdg0_rand_vec) * TORAD
     hdg0 = hdg0_rand * np.ones(10)

     mdp.simulator.hyst.reset()

     #  We reinitialize the memory of the flow
     state = mdp.initializeMDP(hdg0, WH)
     loss_sim_list = []
     for time in range(80):
         WH = w.generateWind()
         action = agent.act(state)
         next_state, reward = mdp.transition(action, WH)
         agent.remember(state, action, reward, next_state)  # store the transition + the state flow in the
         # final state !!
         state = next_state
         if len(agent.memory) >= batch_size:
             loss_sim_list.append(agent.replay(batch_size))
             print("time: {}, Loss = {}".format(time, loss_sim_list[-1]))
             print("i : {}".format(mdp.s[0, -1] / TORAD))
         # For data visualisation
     loss_over_simulation_time = np.sum(np.array([los    s_sim_list])[0]) / len(np.array([loss_sim_list])[0])
     loss_of_episode.append(loss_over_simulation_time)

Package RL¶

Policy Learner¶

DQN¶

DDPG¶

Visualization¶

Tutorial¶

Table Of Contents

Related Topics

This Page