Package RL

This package contains three classes of RL agent :

  • policyLearning - That trains a neural network to predict the Q-values of a discrete actions of a policy of the boat behavior.
  • dqn - That trains a neural network to find the optimal policy to avoid stall with discrete actions.
  • DDPG - That trains a neural network to find the optimal policy to avoid stall with a continuous set of actions.

You’ll find a tutorial in order to understand how to generate a training scenario at the bottom of the page.

Policy Learner

class policyLearning.PolicyLearner(state_size, action_size, batch_size)

Bases: object

The aim of this class is to learn the Q-value of the action defined by a policy.

Tip

Please note that the policy to learn has to be defined in the methods actUnderPolicy() and actDeterministicallyUnderPolicy().

Variables:
  • state_size (int) – shape of the input (for convolutionnal layers).
  • action_size (int) – number of action output by the network.
  • memory (deque) – last-in first-out list of the batch.
  • gamma (float) – discount factor.
  • epsilon (float) – exploration rate.
  • epsilon_min (float) – smallest exploration rate that we want to converge to.
  • epsilon_decay (float) – decay factor that we apply after each replay.
  • learning_rate (float) – the learning rate of the NN.
  • model (keras.model) – NN, i.e the model containing the weight of the value estimator.
act(state)

Calculate the action that yields the maximum Q-value.

Parameters:state – state in which we want to chose an action.
Returns:the greedy action.
actDeterministicallyUnderPolicy(state)

Policy that reattaches when the angle of attack goes higher than 16 degree

Parameters:state (np.array) – state for which we want to know the policy action.
Returns:the policy action.
actRandomly()
actUnderPolicy(state)

Does the same as actDeterministicallyUnderPolicy() instead that the returned action is sometime taken randomly.

evaluate(state)

Evaluate the Q-value of the two actions in a given state using the neural network.

Parameters:state (np.array) – state that we want to evaluate.
Returns:The actions values as a vector.
evaluateNextAction(stall)

Evaluate the next action without updating the stall state in order to use it during the experience replay :param np.array state: state for which we want to know the policy action. :return: the policy action.

get_stall()
init_stall(mean, mdp)
Parameters:
  • mean
  • mdp
Returns:

load(name)

Load the weight of the network saved in the file into :ivar model :param name: name of the file containing the weights to load

remember(state, action, reward, next_state, stall)

Remember a transition defined by an action action taken from a state state yielding a transition to a next state next_state and a reward reward. [s, a ,r, s’]

Parameters:
  • state (np.array) – initial state (s).
  • action (int) – action (a).
  • reward (float) – reward received from transition (r).
  • next_state (np.array) – final state (s’).
  • stall (int) – flow state in the final state (s’).
replay(batch_size)

Perform the learning on a the experience replay memory.

Parameters:batch_size – number of samples used in the experience replay memory for the fit.
Returns:the average loss over the replay batch.
save(name)

Save the weights of the newtork :param name: Name of the file where the weights are saved

DQN

class dqn.DQNAgent(state_size, action_size)

Bases: object

DQN agent that aims at learning the optimal policy for stall avoidance with a discrete set of available actions.

Variables:
  • state_size (np.shape()) – shape of the input.
  • action_size (int) – number of actions.
  • memory (deque()) – memory as a list.
  • gamma (float) – Discount rate.
  • epsilon (float) – exploration rate.
  • epsilon_min (float) – minimum exploration rate.
  • epsilon_decay (float) – decay of the exploration rate.
  • learning_rate (float) – initial learning rate for the gradient descent
  • model (keras.model) – neural network model
act(state)

Act ε-greedy with respect to the actual Q-value output by the network. :param state: State from which we want to use the network to compute the action to take. :return: a random action with probability ε or the greedy action with probability 1-ε.

actDeterministically(state)

Predicts the action with the highest q-value at a given state. :param state: state from which we want to know the action to make. :return:

load(name)

Load the weights for a defined architecture. :param name: Name of the source file.

loadModel(name)

Load the an architecture from source file. :param name: Name of the source file. :return:

remember(state, action, reward, next_state)

Remember a transition defined by an action action taken from a state state yielding a transition to a next state next_state and a reward reward. [s, a ,r, s’]

Parameters:
  • state (np.array) – initial state (s).
  • action (int) – action (a).
  • reward (float) – reward received from transition (r).
  • next_state (np.array) – final state (s’).
replay(batch_size)

Core of the algorithm Q update according to the current weight of the network. :param int batch_size: Batch size for the batch gradient descent. :return: the loss after the batch gradient descent.

save(name)

Save the weights for a defined architecture. :param name: Name of the output file

saveModel(name)

Save the model’s weight and architecture. :param name: Name of the output file

DDPG

class DDPG.DDPGAgent(state_size, action_size, lower_bound, upper_bound, sess)

Bases: object

The aim of this class is to learn an optimal policy via an actor-critic structure with 2 separated Convolutional Neural Networks. It uses the Deep Deterministic Policy Gradient to update the actor network. This model deals with a continuous space of actions on the rudder, chosen between lower_bound and upper_bound

Parameters:
  • state_size (int) – length of the state input (for convolutionnal layers).
  • action_size (int) – number of continuous action output by the network.
  • lower_bound (float) – minimum value for rudder action.
  • upper_bound (float) – maximum value for rudder action.
  • sess (tensorflow.session) – initialized tensorflow session within which the agent will be trained.
Variables:
  • memory (deque) – last-in first-out list of the batch buffer.
  • gamma (float) – discount factor.
  • epsilon (float) – exploration rate.
  • epsilon_min (float) – smallest exploration rate that we want to converge to.
  • epsilon_decay (float) – decay factor that we apply after each replay.
  • actor_learning_rate (float) – the learning rate of the NN of actor.
  • critic_learning_rate (float) – the learning rate of the NN of critic.
  • update_target (float) – update factor of the Neural Networks for each fit to target
  • network (DDPGNetworks.Network) – tensorflow model which defines actor and critic convolutional neural networks features
act(state)

Calculate the action given by the Actor network’s current weights

Parameters:state – state in which we want to chose an action.
Returns:the greedy action according to actor network
act_epsilon_greedy(state)

With probability epsilon, returns a random action between bounds With probability 1 - epsilon, returns the action given by the Actor network’s current weights

Parameters:state – state in which we want to chose an action.
Returns:a random action or the action given by actor
evaluate(state, action)

Evaluate the Q-value of a state-action pair using the critic neural network.

Parameters:
  • state (np.array) – state that we want to evaluate.
  • action (float) – action that we want to evaluate (has to be between permitted bounds)
Returns:

The continuous action value.

load(name)

Load the weights of the 2 networks saved in the file into :ivar network :param name: name of the file containing the weights to load

noise_decay(e)

Applies decay on noisy epsilon-greedy actions

Parameters:e – current episode playing during learning
remember(state, action, reward, next_state)

Remember a transition defined by an action action taken from a state state yielding a transition to a next state next_state and a reward reward. [s, a ,r, s’]

Parameters:
  • state (np.array) – initial state (s).
  • action (int) – action (a).
  • reward (float) – reward received from transition (r).
  • next_state (np.array) – final state (s’).
replay(batch_size)

Performs an update of both actor and critic networks on a minibatch chosen among the experience replay memory.

Parameters:batch_size – number of samples used in the experience replay memory for the fit.
Returns:the average losses for actor and critic over the replay batch.
save(name)

Save the weights of both of the networks into a .ckpt tensorflow session file :param name: Name of the file where the weights are saved

update_target = None

Definition of the neural networks

Visualization

We also provide a file in order to generate plots to visualize results.

class visualization.Visualization(hist_duration, mdp_step, time_step, action_size, batch_size, mean, std, hdg0, src_file, sim_time)

Bases: object

Class to generate different plots for result visualization.

Parameters:
  • hist_duration – Size of the history buffer.
  • mdp_step – mdp step (frequency of decision).
  • time_step – time step of the mdp.
  • action_size – size of the action space of the model.
  • batch_size – size of the batch use to train the model.
  • mean – average wind heading.
  • std – noise on wind heading.
  • hdg0 – initial heading of the simulation.
  • src_file – source file containing the weights of the model used for the simulation.
  • sim_time – duration of the simulation.
generateAnimation(hdg0)

Generate an animation showing the two Q-values during an interesting control simulation including gusts.

Parameters:hdg0 – Initial heading of the boat for the simulation
generateDeltaAnimation(hdg0)

Generate an animation showing the differences between the two Q-values during an interesting control simulation including gusts.

Parameters:hdg0 – Initial heading of the boat for the simulation
generateQplots()

Creates the comparison between the Q-values predicted by the network and the Monte-Carlo return computed over the simulation time :return: Two plots of the comparison.

simulateDQNControl(hdg0)

Plots the control law of the network over a simulation.

Parameters:hdg0 – Initial heading of the boat for the simulation.
Returns:A plot of the angle of attack and velocity during the control.
simulateGustsControl()

Simulate the response of the controller to gusts.

Returns:A plot of the simulation.
visualization.rollOut(time, SIMULATION_TIME, agent, mdp, action, WH)

Tutorial

 history_duration = 3  # Duration of state history [s]
 mdp_step = 1  # Step between each state transition [s]
 time_step = 0.1  # time step [s] <-> 10Hz frequency of data acquisition
 mdp = MDP(history_duration, mdp_step, time_step)

 mean = 45 * TORAD
 std = 0 * TORAD
 wind_samples = 10
 WH = np.random.uniform(mean - std, mean + std, size=10)

 hdg0=0*np.ones(10)
 mdp.initializeMDP(hdg0,WH)

 hdg0_rand_vec=(-4,0,2,4,6,8,18,20,21,22,24)

 action_size = 2
 policy_angle = 18
 agent = PolicyLearner(mdp.size, action_size, policy_angle)
 #agent.load("policy_learning_i18_test_long_history")
 batch_size = 120

 EPISODES = 500

 for e in range(EPISODES):
     WH = w.generateWind()
     hdg0_rand = random.choice(hdg0_rand_vec) * TORAD
     hdg0 = hdg0_rand * np.ones(10)

     mdp.simulator.hyst.reset()

     #  We reinitialize the memory of the flow
     state = mdp.initializeMDP(hdg0, WH)
     loss_sim_list = []
     for time in range(80):
         WH = w.generateWind()
         action = agent.act(state)
         next_state, reward = mdp.transition(action, WH)
         agent.remember(state, action, reward, next_state)  # store the transition + the state flow in the
         # final state !!
         state = next_state
         if len(agent.memory) >= batch_size:
             loss_sim_list.append(agent.replay(batch_size))
             print("time: {}, Loss = {}".format(time, loss_sim_list[-1]))
             print("i : {}".format(mdp.s[0, -1] / TORAD))
         # For data visualisation
     loss_over_simulation_time = np.sum(np.array([los    s_sim_list])[0]) / len(np.array([loss_sim_list])[0])
     loss_of_episode.append(loss_over_simulation_time)