Reinforcement learning

ME 343; March 13th, 2019

Prof. Eric Darve

YouTube will not let you play this video from this web site. Follow the link Watch this video on YouTube on the error page to see the video.

Recap on reinforcement learning

  • Case 1: a model for the environment is known.

  • Bellman equation can be used to calculate various quantities like the value function

v(s)=E[RtSt=s]+γsPssv(s)v(s) = E[R_t | S_t = s] + \gamma \sum_{s'} P_{ss'} v(s')

Control! Optimal value function

Iterate:

  • Policy evaluation: update value function based on policy
    vk+1(s)=aπk(as)(Rsa+γsPssa  vk(s))v_{k+1}(s) = \sum_{a} \pi_k(a|s) \Big( R^a_s + \gamma \sum_{s'} P^a_{ss'} \; v_k(s') \Big)
  • Policy improvement using greedy approach:
    argmaxa  [Rsa+γsPssavk+1(s)]\text{argmax}_{a} \; \Big[ R^a_s + \gamma \sum_{s'} P^a_{ss'} v_{k+1}(s') \Big]

large

Case 2: model-free learning

  • For cases where the environment is too large to be completely evaluated or simply unknown
  • Learn based on Monte-Carlo sampling

Full episodes

V(St)V(St)+α(GtV(St))V(S_t) \leftarrow V(S_t) + \alpha (G_t - V(S_t))

medium

Temporal-difference backup

V(St)V(St)+α(Rt+1+γV(St+1)V(St))V(S_t) \leftarrow V(S_t) + \alpha ( R_{t+1} + \gamma V(S_{t+1}) - V(S_t))

medium

Policy gradient

  • Policy function may be approximated using a neural network
  • This allows using gradient methods to improve the policy

π(as,θ)=P(At=aSt=s,θt=θ)\pi(a|s,\theta) = P(A_t = a | S_t = s, \theta_t = \theta)

θ\theta: parameterization of π\pi.

  • Example
    π(as,θ)=eϕ(s,a)θaeϕ(s,a)θ\pi(a|s,\theta) = \frac{e^{\phi(s,a)^\top \theta}}{\sum_a e^{\phi(s,a)^\top \theta}}
  • ϕ(s,a)\phi(s,a): vector of features; (ϕ1(s,a),,ϕk(s,a))(\phi_1(s,a), \ldots, \phi_k(s,a)).

Policy gradient theorem (see notes)

Gradient used to update weights θ\theta:
θJ(θ)Eπ[Gt  θlnπ(AtSt,θ)]\nabla_\theta J(\theta) \propto E_\pi \big[ G_t \; \nabla_\theta \ln \pi(A_t|S_t, \theta) \big]

  • GtG_t: discounted reward (MC sampling)
  • π(AtSt,θ)\pi(A_t|S_t, \theta): parameterized policy
  • There is a coefficient of proportionality but this is not important for gradient algorithms

REINFORCE algorithm

Algorithm:

Repeat until convergence:
    Generate an episode S0, A0, R1,...,RT
    using policy pi(theta)
    For each step t of this episode:
        G <- return at step t
        Update theta using the eq. REINFORCE

θθ+α  γt  Gt  θlnπ(AtSt,θ)(REINFORCE)\theta \leftarrow \theta + \alpha \; \gamma^t \; G_t \; \nabla_\theta \ln \pi(A_t|S_t, \theta) \qquad \text{(REINFORCE)}

AlphaGo!

medium

  • AlphaGo: computer program that defeated a Go world champion. Arguably the strongest Go player in history.
  • Invented highly innovated Go moves.
  • Let's play go!

Ingredients

Uses a combination of several of the methods we have seen:

  • Neural network to approximate policy and value functions
  • Monte-Carlo tree search with temporal difference
  • Policy gradient theorem to improve policy
  • UCT (multi-armed bandit) strategy to select moves

Neural networks: self-play

medium

  • Program plays a game s1s_1, \dots, sTs_T against itself.
  • An MCTS is executed using the latest neural network fθf_\theta.
  • Moves are selected according to the search probabilities computed by the MCTS, πt\pi_t.

Neural networks: policy/value functions

Neural networks: policy/value functions

small

Neural network parameters θ\theta are updated to:

  • maximize the similarity of the policy ptp_t to the search probabilities πt\pi_t, and to
  • minimize the error between the predicted winner vtv_t and the game winner zz.

Example of MCTS

Policy improvement

Use the policy gradient theorem!

Δθztθlnπθ(AtSt)\Delta \theta \propto z_t \nabla_\theta \ln \pi_\theta(A_t | S_t)

UCT

Variant of PUCT algorithm was used:

at=argmax  Q(st,a)+U(st,a)U(s,a)=C  P(s,a)  bN(s,b)1+N(s,a)\begin{gathered} a_t = \text{argmax} \; Q(s_t, a) + U(s_t, a) \\ U(s,a) = C \; P(s,a) \; \frac{\sqrt{\sum_b N(s,b)}}{1 + N(s,a)} \end{gathered}

  • Added P(s,a)P(s,a) to bias towards winning moves
  • Abandoned log in favor of \sqrt{}. This leads to more exploration.

Training time graph

Joseki

  • Joseki: corner sequences common in professional Go
  • Noew slide: frequency of occurrence over time during training for different joseki
  • Ultimately, AlphaGo preferred new joseki variants that were previously unknown!

Training time is shown on the xx-axis.

Strategies learned by AlphaGo

Try out AlphaGo

You pick select moves and see how the game evolves from there.

https://alphagoteach.deepmind.com/

GitHub codes

https://github.com/suragnair/alpha-zero-general

https://github.com/yhyu13/AlphaGOZero-python-tensorflow

https://github.com/leela-zero/leela-zero