Package RL¶
This package contains three classes of RL agent :
policyLearning
- That trains a neural network to predict the Q-values of a discrete actions of a policy of the boat behavior.dqn
- That trains a neural network to find the optimal policy to avoid stall with discrete actions.DDPG
- That trains a neural network to find the optimal policy to avoid stall with a continuous set of actions.
You’ll find a tutorial in order to understand how to generate a training scenario at the bottom of the page.
Policy Learner¶
-
class
policyLearning.
PolicyLearner
(state_size, action_size, batch_size)¶ Bases:
object
The aim of this class is to learn the Q-value of the action defined by a policy.
Tip
Please note that the policy to learn has to be defined in the methods
actUnderPolicy()
andactDeterministicallyUnderPolicy()
.Variables: - state_size (int) – shape of the input (for convolutionnal layers).
- action_size (int) – number of action output by the network.
- memory (deque) – last-in first-out list of the batch.
- gamma (float) – discount factor.
- epsilon (float) – exploration rate.
- epsilon_min (float) – smallest exploration rate that we want to converge to.
- epsilon_decay (float) – decay factor that we apply after each replay.
- learning_rate (float) – the learning rate of the NN.
- model (keras.model) – NN, i.e the model containing the weight of the value estimator.
-
act
(state)¶ Calculate the action that yields the maximum Q-value.
Parameters: state – state in which we want to chose an action. Returns: the greedy action.
-
actDeterministicallyUnderPolicy
(state)¶ Policy that reattaches when the angle of attack goes higher than 16 degree
Parameters: state (np.array) – state for which we want to know the policy action. Returns: the policy action.
-
actRandomly
()¶
-
actUnderPolicy
(state)¶ Does the same as
actDeterministicallyUnderPolicy()
instead that the returned action is sometime taken randomly.
-
evaluate
(state)¶ Evaluate the Q-value of the two actions in a given state using the neural network.
Parameters: state (np.array) – state that we want to evaluate. Returns: The actions values as a vector.
-
evaluateNextAction
(stall)¶ Evaluate the next action without updating the stall state in order to use it during the experience replay :param np.array state: state for which we want to know the policy action. :return: the policy action.
-
get_stall
()¶
-
init_stall
(mean, mdp)¶ Parameters: - mean –
- mdp –
Returns:
-
load
(name)¶ Load the weight of the network saved in the file into :ivar model :param name: name of the file containing the weights to load
-
remember
(state, action, reward, next_state, stall)¶ Remember a transition defined by an action action taken from a state state yielding a transition to a next state next_state and a reward reward. [s, a ,r, s’]
Parameters: - state (np.array) – initial state (s).
- action (int) – action (a).
- reward (float) – reward received from transition (r).
- next_state (np.array) – final state (s’).
- stall (int) – flow state in the final state (s’).
-
replay
(batch_size)¶ Perform the learning on a the experience replay memory.
Parameters: batch_size – number of samples used in the experience replay memory for the fit. Returns: the average loss over the replay batch.
-
save
(name)¶ Save the weights of the newtork :param name: Name of the file where the weights are saved
DQN¶
-
class
dqn.
DQNAgent
(state_size, action_size)¶ Bases:
object
DQN agent that aims at learning the optimal policy for stall avoidance with a discrete set of available actions.
Variables: - state_size (np.shape()) – shape of the input.
- action_size (int) – number of actions.
- memory (deque()) – memory as a list.
- gamma (float) – Discount rate.
- epsilon (float) – exploration rate.
- epsilon_min (float) – minimum exploration rate.
- epsilon_decay (float) – decay of the exploration rate.
- learning_rate (float) – initial learning rate for the gradient descent
- model (keras.model) – neural network model
-
act
(state)¶ Act ε-greedy with respect to the actual Q-value output by the network. :param state: State from which we want to use the network to compute the action to take. :return: a random action with probability ε or the greedy action with probability 1-ε.
-
actDeterministically
(state)¶ Predicts the action with the highest q-value at a given state. :param state: state from which we want to know the action to make. :return:
-
load
(name)¶ Load the weights for a defined architecture. :param name: Name of the source file.
-
loadModel
(name)¶ Load the an architecture from source file. :param name: Name of the source file. :return:
-
remember
(state, action, reward, next_state)¶ Remember a transition defined by an action action taken from a state state yielding a transition to a next state next_state and a reward reward. [s, a ,r, s’]
Parameters: - state (np.array) – initial state (s).
- action (int) – action (a).
- reward (float) – reward received from transition (r).
- next_state (np.array) – final state (s’).
-
replay
(batch_size)¶ Core of the algorithm Q update according to the current weight of the network. :param int batch_size: Batch size for the batch gradient descent. :return: the loss after the batch gradient descent.
-
save
(name)¶ Save the weights for a defined architecture. :param name: Name of the output file
-
saveModel
(name)¶ Save the model’s weight and architecture. :param name: Name of the output file
DDPG¶
-
class
DDPG.
DDPGAgent
(state_size, action_size, lower_bound, upper_bound, sess)¶ Bases:
object
The aim of this class is to learn an optimal policy via an actor-critic structure with 2 separated Convolutional Neural Networks. It uses the Deep Deterministic Policy Gradient to update the actor network. This model deals with a continuous space of actions on the rudder, chosen between lower_bound and upper_bound
Parameters: - state_size (int) – length of the state input (for convolutionnal layers).
- action_size (int) – number of continuous action output by the network.
- lower_bound (float) – minimum value for rudder action.
- upper_bound (float) – maximum value for rudder action.
- sess (tensorflow.session) – initialized tensorflow session within which the agent will be trained.
Variables: - memory (deque) – last-in first-out list of the batch buffer.
- gamma (float) – discount factor.
- epsilon (float) – exploration rate.
- epsilon_min (float) – smallest exploration rate that we want to converge to.
- epsilon_decay (float) – decay factor that we apply after each replay.
- actor_learning_rate (float) – the learning rate of the NN of actor.
- critic_learning_rate (float) – the learning rate of the NN of critic.
- update_target (float) – update factor of the Neural Networks for each fit to target
- network (DDPGNetworks.Network) – tensorflow model which defines actor and critic convolutional neural networks features
-
act
(state)¶ Calculate the action given by the Actor network’s current weights
Parameters: state – state in which we want to chose an action. Returns: the greedy action according to actor network
-
act_epsilon_greedy
(state)¶ With probability epsilon, returns a random action between bounds With probability 1 - epsilon, returns the action given by the Actor network’s current weights
Parameters: state – state in which we want to chose an action. Returns: a random action or the action given by actor
-
evaluate
(state, action)¶ Evaluate the Q-value of a state-action pair using the critic neural network.
Parameters: - state (np.array) – state that we want to evaluate.
- action (float) – action that we want to evaluate (has to be between permitted bounds)
Returns: The continuous action value.
-
load
(name)¶ Load the weights of the 2 networks saved in the file into :ivar network :param name: name of the file containing the weights to load
-
noise_decay
(e)¶ Applies decay on noisy epsilon-greedy actions
Parameters: e – current episode playing during learning
-
remember
(state, action, reward, next_state)¶ Remember a transition defined by an action action taken from a state state yielding a transition to a next state next_state and a reward reward. [s, a ,r, s’]
Parameters: - state (np.array) – initial state (s).
- action (int) – action (a).
- reward (float) – reward received from transition (r).
- next_state (np.array) – final state (s’).
-
replay
(batch_size)¶ Performs an update of both actor and critic networks on a minibatch chosen among the experience replay memory.
Parameters: batch_size – number of samples used in the experience replay memory for the fit. Returns: the average losses for actor and critic over the replay batch.
-
save
(name)¶ Save the weights of both of the networks into a .ckpt tensorflow session file :param name: Name of the file where the weights are saved
-
update_target
= None¶ Definition of the neural networks
Visualization¶
We also provide a file in order to generate plots to visualize results.
-
class
visualization.
Visualization
(hist_duration, mdp_step, time_step, action_size, batch_size, mean, std, hdg0, src_file, sim_time)¶ Bases:
object
Class to generate different plots for result visualization.
Parameters: - hist_duration – Size of the history buffer.
- mdp_step – mdp step (frequency of decision).
- time_step – time step of the mdp.
- action_size – size of the action space of the model.
- batch_size – size of the batch use to train the model.
- mean – average wind heading.
- std – noise on wind heading.
- hdg0 – initial heading of the simulation.
- src_file – source file containing the weights of the model used for the simulation.
- sim_time – duration of the simulation.
-
generateAnimation
(hdg0)¶ Generate an animation showing the two Q-values during an interesting control simulation including gusts.
Parameters: hdg0 – Initial heading of the boat for the simulation
-
generateDeltaAnimation
(hdg0)¶ Generate an animation showing the differences between the two Q-values during an interesting control simulation including gusts.
Parameters: hdg0 – Initial heading of the boat for the simulation
-
generateQplots
()¶ Creates the comparison between the Q-values predicted by the network and the Monte-Carlo return computed over the simulation time :return: Two plots of the comparison.
-
simulateDQNControl
(hdg0)¶ Plots the control law of the network over a simulation.
Parameters: hdg0 – Initial heading of the boat for the simulation. Returns: A plot of the angle of attack and velocity during the control.
-
simulateGustsControl
()¶ Simulate the response of the controller to gusts.
Returns: A plot of the simulation.
-
visualization.
rollOut
(time, SIMULATION_TIME, agent, mdp, action, WH)¶
Tutorial¶
history_duration = 3 # Duration of state history [s]
mdp_step = 1 # Step between each state transition [s]
time_step = 0.1 # time step [s] <-> 10Hz frequency of data acquisition
mdp = MDP(history_duration, mdp_step, time_step)
mean = 45 * TORAD
std = 0 * TORAD
wind_samples = 10
WH = np.random.uniform(mean - std, mean + std, size=10)
hdg0=0*np.ones(10)
mdp.initializeMDP(hdg0,WH)
hdg0_rand_vec=(-4,0,2,4,6,8,18,20,21,22,24)
action_size = 2
policy_angle = 18
agent = PolicyLearner(mdp.size, action_size, policy_angle)
#agent.load("policy_learning_i18_test_long_history")
batch_size = 120
EPISODES = 500
for e in range(EPISODES):
WH = w.generateWind()
hdg0_rand = random.choice(hdg0_rand_vec) * TORAD
hdg0 = hdg0_rand * np.ones(10)
mdp.simulator.hyst.reset()
# We reinitialize the memory of the flow
state = mdp.initializeMDP(hdg0, WH)
loss_sim_list = []
for time in range(80):
WH = w.generateWind()
action = agent.act(state)
next_state, reward = mdp.transition(action, WH)
agent.remember(state, action, reward, next_state) # store the transition + the state flow in the
# final state !!
state = next_state
if len(agent.memory) >= batch_size:
loss_sim_list.append(agent.replay(batch_size))
print("time: {}, Loss = {}".format(time, loss_sim_list[-1]))
print("i : {}".format(mdp.s[0, -1] / TORAD))
# For data visualisation
loss_over_simulation_time = np.sum(np.array([los s_sim_list])[0]) / len(np.array([loss_sim_list])[0])
loss_of_episode.append(loss_over_simulation_time)