
a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives while interacting with a complex and uncertain environment. -Sutton and Barto

Action: Move LEFT or Right




https://www/youtube.com/watch?v=WXuK6gekU1Y
A chess player makes a move: the choice is informed both by planning-anticipating possible replies and counterreplies.

A gazelle calf struggles to stand, 30 min later it runs 36 kilometers per hour.

Portfolio management.

Playing Atari game

Action: move UP or Down

From Andrej Karpathy blog: http://karpathy.github.io/2016/05/31/rl/











https://www.youtube.com/watch?v=gn4nRCC9TwQ

https://ai.googleblog.com/2016/03/deep-learning-for-robots-learning-from.html

https://www.youtube.com/watch?v=jwSbzNHGflM

https://www.youtube.com/watch?v=ixmE5nt2o88
The agent learns to interact with the environment

All goals of agent can be described by the maximization of expected cumulative reward.
+/- reward for winning or losing a game
+/- reward for running with its mom or being eaten
+/- reward for each profit or loss in $
+/- reward for increasing or decreasing scores
Objective of the agent: select a series of actions to maximize total future rewards
Actions may have long term consequences
Reward may be delayed
Trade-off between immediate reward and long-term reward
The history is sequence of observation, actions, rewards.
H
t
=
O
1
,
R
1
,
A
1
,
.
.
.
,
A
t
−
1
,
O
t
,
R
t
H_{t} = O_{1}, R_{1}, A_{1}, ..., A_{t - 1}, O_{t}, R_{t}
Ht=O1,R1,A1,...,At−1,Ot,Rt
What happens next depends on the history
State is the function used to determine what happens next
S
t
=
f
(
H
t
)
S_{t} = f(H_{t})
St=f(Ht)

Environment state and agent state
S
t
e
=
f
e
(
H
t
)
S
t
a
=
f
a
(
H
t
)
S_{t}^{e} = f^{e}(H_{t}) \,S_{t}^{a} = f^{a}(H_{t})
Ste=fe(Ht)Sta=fa(Ht)
Full observability: agent directly observe the environment state, formally as Markov decision process (MDP)
O
t
=
S
t
e
=
S
t
a
O_{t} = S_{t}^{e} = S_{t}^{a}
Ot=Ste=Sta
Partial observability: agent indirectly observe the environment, formally as partial observable Markov decision process (POMDP)
An RL agent may include one or more of these components:

Value function: expected discounted sum of future rewards under a particular policy π \pi π
Discount factor weights immediate vs future rewards
Used to quantify goodness/badness of states and actions
v
π
(
s
)
≐
E
π
[
G
t
∣
S
t
=
s
]
=
E
π
[
∑
k
=
0
∞
γ
k
R
t
+
k
+
1
∣
S
t
=
s
]
,
f
o
r
a
l
l
s
∈
S
v_{\pi}(s) \doteq E_{\pi}[G_{t} | S_{t} = s] = E_{\pi}[\sum_{k = 0}^{∞}γ^{k}R_{t + k + 1} | S_{t} = s], for \, all \, s ∈ S
vπ(s)≐Eπ[Gt∣St=s]=Eπ[k=0∑∞γkRt+k+1∣St=s],foralls∈S
Q-function (could be used to select among actions)
q
π
(
s
,
a
)
≐
E
π
[
G
t
∣
S
t
=
s
,
A
t
=
a
]
=
E
π
[
∑
k
=
0
∞
γ
k
R
t
+
k
+
1
∣
S
t
=
s
,
A
t
=
a
]
.
q_{\pi}(s, a) \doteq E_{\pi}[G_{t} | S_{t} = s, A_{t} = a] = E_{\pi}[\sum_{k = 0}^{∞}γ^{k}R_{t + k + 1} | S_{t} = s, A_{t} = a].
qπ(s,a)≐Eπ[Gt∣St=s,At=a]=Eπ[k=0∑∞γkRt+k+1∣St=s,At=a].
A model predict what the environment will do next
predict the next state: P S S ′ a = P [ S t + 1 = s ′ ∣ S t = s ] , A t = a P_{SS'}^{a} = \mathbb{P}[S_{t + 1} = s' | S_{t} = s], A_{t} = a PSS′a=P[St+1=s′∣St=s],At=a
Definition of MDP
P
a
P^{a}
Pa is dynamics/transition model for each action
P
(
S
t
+
1
=
s
′
∣
S
t
=
s
,
A
t
=
a
)
P(S_{t + 1} = s' | S_{t} = s, A_{t} = a)
P(St+1=s′∣St=s,At=a)
R is reward function R ( S t = s , A t = a ) = E [ R t ∣ S t = s , A t = a ] R(S_{t} = s, A_{t} = a) = \mathbb{E}[R_{t} | S_{t} = s, A_{t} = a] R(St=s,At=a)=E[Rt∣St=s,At=a]
Discount factor γ ∈ [0, 1]


From David Silver Slide



Credit: David Silver’s slide
Agent only experiences what happens for the actions it tries!
How should an RL agent balance its actions?
Often there may be an exploration-exploitation trade-off
Restaurant Selection
Online Banner Advertisements
Oil Drilling
Game Playing
https://github.com/metalbubble/RLexample


https://github.com/openai/retro


import gym
env = gym.make("Taxi-v2")
observation = env.reset()
agent = load_agent()
for step in range(100):
action = agent(observation)
observation, reward, done, info = env.step(action)

https://gym.openai.com/envs/#classic_control

https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py
import gym
env = gym.make("CartPole-v0")
env.reset()
env.render() # display the rendered scene
action = env.action_space.sample()
observation, reward, done, info = env.step(action)
Cross Entropy method(CEM)
https://gist.github.com/kashif/5dfa12d80402c559e060d567ea352c06
import gym
env = gym.make("Pong-v0")
env.reset()
env.render() # display the rendered scene
python my_random_agent.py Pong-v0

python pg-pong.py
Loading weight: pong_bolei.p(model trained over night)
observation = env.reset()
cur_x = prepro(observation)
x = cur_x - pre_x
pre_x = cur_x
aprob, h = policy_forward(x)
Randomized action:
action = 2 if np.random.uniform() < aprob else 3 # roll the dice!
h = np.dot(W1, x)
h[h<0] = 0 # ReLU nonlinearity: threshold at zero
logp = np.dot(W2,h) # compute log probability of going up
p = 1.0 / (1.0 + np.exp(-logp)) #sigmoid function (gives probability of going up)
How to optimize the W1 and W2?
Policy Gradient!(To be introduced in future lecture)

http://karpathy.github.io/2016/05/31/rl
https://github.com/cuhkrlcourse/RLexample