Skip to content

Commit 4941d8f

Browse files
authored
FIX: Edits for LaTeX compatibility (#295)
* FIX: Edits for LaTeX compatibility * remove tag'
1 parent 4964a27 commit 4941d8f

File tree

1 file changed

+16
-27
lines changed

1 file changed

+16
-27
lines changed

lectures/mccall_q.md

Lines changed: 16 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,6 @@ This lecture illustrates a powerful machine learning technique called Q-learning
1919

2020
{cite}`Sutton_2018` presents Q-learning and a variety of other statistical learning procedures.
2121

22-
2322
The Q-learning algorithm combines ideas from
2423

2524
* dynamic programming
@@ -30,8 +29,8 @@ This lecture applies a Q-learning algorithm to the situation faced by a McCal
3029

3130
Relative to the dynamic programming formulation of the McCall worker model that we studied in {doc}`quantecon lecture <mccall_model>`, a Q-learning algorithm gives the worker less knowledge about
3231

33-
* the random process that generates a sequence of wages
34-
* the reward function that tells consequences of accepting or rejecting a job
32+
* the random process that generates a sequence of wages
33+
* the reward function that tells consequences of accepting or rejecting a job
3534

3635
The Q-learning algorithm invokes a statistical learning model to learn about these things.
3736

@@ -266,10 +265,11 @@ This definition of $Q(w,a)$ presumes that in subsequent periods the worker take
266265
An optimal Q-function for our McCall worker satisfies
267266
268267
$$
269-
\begin{align}
268+
\begin{aligned}
270269
Q\left(w,\text{accept}\right) & =\frac{w}{1-\beta} \\
271270
Q\left(w,\text{reject}\right) & =c+\beta\int\max_{\text{accept, reject}}\left\{ \frac{w'}{1-\beta},Q\left(w',\text{reject}\right)\right\} dF\left(w'\right)
272-
\end{align} $$ (eq:impliedq)
271+
\end{aligned}
272+
$$ (eq:impliedq)
273273
274274
275275
Note that the first equation of system {eq}`eq:impliedq` presumes that after the agent has accepted an offer, he will not have the objection to reject that same offer in the future.
@@ -352,7 +352,7 @@ $$
352352
\begin{aligned}
353353
w & + \beta \max_{\textrm{accept, reject}} \left\{ Q (w, \textrm{accept}), Q(w, \textrm{reject}) \right\} - Q (w, \textrm{accept}) = 0 \cr
354354
c & +\beta\int\max_{\text{accept, reject}}\left\{ Q(w', \textrm{accept}),Q\left(w',\text{reject}\right)\right\} dF\left(w'\right) - Q\left(w,\text{reject}\right) = 0 \cr
355-
\end{aligned}
355+
\end{aligned}
356356
$$ (eq:probtosample1)
357357
358358
Notice the integral over $F(w')$ on the second line.
@@ -366,7 +366,7 @@ $$
366366
\begin{aligned}
367367
w & + \beta \max_{\textrm{accept, reject}} \left\{ Q (w, \textrm{accept}), Q(w, \textrm{reject}) \right\} - Q (w, \textrm{accept}) = 0 \cr
368368
c & +\beta \max_{\text{accept, reject}}\left\{ Q(w', \textrm{accept}),Q\left(w',\text{reject}\right)\right\} - Q\left(w,\text{reject}\right) \approx 0 \cr
369-
\end{aligned}
369+
\end{aligned}
370370
$$(eq:probtosample2)
371371
372372
@@ -387,7 +387,7 @@ $$
387387
\begin{aligned}
388388
w & + \beta \max_{\textrm{accept, reject}} \left\{ \hat Q_t (w_t, \textrm{accept}), \hat Q_t(w_t, \textrm{reject}) \right\} - \hat Q_t(w_t, \textrm{accept}) = \textrm{diff}_{\textrm{accept},t} \cr
389389
c & +\beta\int\max_{\text{accept, reject}}\left\{ \hat Q_t(w_{t+1}, \textrm{accept}),\hat Q_t\left(w_{t+1},\text{reject}\right)\right\} - \hat Q_t\left(w_t,\text{reject}\right) = \textrm{diff}_{\textrm{reject},t} \cr
390-
\end{aligned}
390+
\end{aligned}
391391
$$ (eq:old105)
392392
393393
The adaptive learning scheme would then be some version of
@@ -401,9 +401,6 @@ to objects in equation system {eq}`eq:old105`.
401401
402402
This informal argument takes us to the threshold of Q-learning.
403403
404-
405-
+++
406-
407404
## Q-Learning
408405
409406
Let's first describe a $Q$-learning algorithm precisely.
@@ -436,10 +433,10 @@ $$ (eq:old3)
436433
where
437434
438435
$$
439-
\begin{align*}
436+
\begin{aligned}
440437
\widetilde{TD}\left(w,\text{accept}\right) & = \left[ w+\beta\max_{a'\in\mathcal{A}}\widetilde{Q}^{old}\left(w,a'\right) \right]-\widetilde{Q}^{old}\left(w,\text{accept}\right) \\
441438
\widetilde{TD}\left(w,\text{reject}\right) & = \left[ c+\beta\max_{a'\in\mathcal{A}}\widetilde{Q}^{old}\left(w',a'\right) \right]-\widetilde{Q}^{old}\left(w,\text{reject}\right),\;w'\sim F
442-
\end{align*}
439+
\end{aligned}
443440
$$ (eq:old4)
444441
445442
The terms $\widetilde{TD}(w,a) $ for $a = \left\{\textrm{accept,reject} \right\}$ are the **temporal difference errors** that drive the updates.
@@ -516,9 +513,6 @@ By using the $\epsilon$-greedy method and also by increasing the number of episo
516513
517514
**Remark:** Notice that $\widetilde{TD}$ associated with an optimal Q-table defined in equation (2) automatically above satisfies $\widetilde{TD}=0$ for all state action pairs. Whether a limit of our Q-learning algorithm converges to an optimal Q-table depends on whether the algorithm visits all state, action pairs often enough.
518515
519-
520-
521-
522516
We implement this pseudo code in a Python class.
523517
524518
For simplicity and convenience, we let `s` represent the state index between $0$ and $n=50$ and $w_s=w[s]$.
@@ -736,19 +730,16 @@ plot_epochs(ns_to_plot=[100, 1000, 10000, 100000, 200000])
736730
737731
The above graphs indicates that
738732
739-
* the Q-learning algorithm has trouble learning the Q-table well for wages that are rarely drawn
740-
741-
* the quality of approximation to the "true" value function computed by value function iteration improves for longer epochs
733+
* the Q-learning algorithm has trouble learning the Q-table well for wages that are rarely drawn
742734
743-
+++
735+
* the quality of approximation to the "true" value function computed by value function iteration improves for longer epochs
744736
745737
## Employed Worker Can't Quit
746738
747739
748740
The preceding version of temporal difference Q-learning described in equation system (4) lets an an employed worker quit, i.e., reject her wage as an incumbent and instead accept receive unemployment compensation this period
749741
and draw a new offer next period.
750742
751-
752743
This is an option that the McCall worker described in {doc}`this quantecon lecture <mccall_model>` would not take.
753744
754745
See {cite}`Ljungqvist2012`, chapter 7 on search, for a proof.
@@ -757,20 +748,18 @@ But in the context of Q-learning, giving the worker the option to quit and get u
757748
unemployed turns out to accelerate the learning process by promoting experimentation vis a vis premature
758749
exploitation only.
759750
760-
761751
To illustrate this, we'll amend our formulas for temporal differences to forbid an employed worker from quitting a job she had accepted earlier.
762752
763753
With this understanding about available choices, we obtain the following temporal difference values:
764754
765755
$$
766-
\begin{align*}
756+
\begin{aligned}
767757
\widetilde{TD}\left(w,\text{accept}\right) & = \left[ w+\beta\widetilde{Q}^{old}\left(w,\text{accept}\right) \right]-\widetilde{Q}^{old}\left(w,\text{accept}\right) \\
768758
\widetilde{TD}\left(w,\text{reject}\right) & = \left[ c+\beta\max_{a'\in\mathcal{A}}\widetilde{Q}^{old}\left(w',a'\right) \right]-\widetilde{Q}^{old}\left(w,\text{reject}\right),\;w'\sim F
769-
\tag{4'}
770-
\end{align*}
771-
$$
759+
\end{aligned}
760+
$$ (eq:temp-diff)
772761
773-
It turns out that formulas (4') combined with our Q-learning recursion (3) can lead our agent to eventually learn the optimal value function as well as in the case where an option to redraw can be exercised.
762+
It turns out that formulas {eq}`eq:temp-diff` combined with our Q-learning recursion (3) can lead our agent to eventually learn the optimal value function as well as in the case where an option to redraw can be exercised.
774763
775764
But learning is slower because an agent who ends up accepting a wage offer prematurally loses the option to explore new states in the same episode and to adjust the value associated with that state.
776765

0 commit comments

Comments
 (0)