You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: lectures/mccall_q.md
+16-27Lines changed: 16 additions & 27 deletions
Original file line number
Diff line number
Diff line change
@@ -19,7 +19,6 @@ This lecture illustrates a powerful machine learning technique called Q-learning
19
19
20
20
{cite}`Sutton_2018` presents Q-learning and a variety of other statistical learning procedures.
21
21
22
-
23
22
The Q-learning algorithm combines ideas from
24
23
25
24
* dynamic programming
@@ -30,8 +29,8 @@ This lecture applies a Q-learning algorithm to the situation faced by a McCal
30
29
31
30
Relative to the dynamic programming formulation of the McCall worker model that we studied in {doc}`quantecon lecture <mccall_model>`, a Q-learning algorithm gives the worker less knowledge about
32
31
33
-
* the random process that generates a sequence of wages
34
-
* the reward function that tells consequences of accepting or rejecting a job
32
+
* the random process that generates a sequence of wages
33
+
* the reward function that tells consequences of accepting or rejecting a job
35
34
36
35
The Q-learning algorithm invokes a statistical learning model to learn about these things.
37
36
@@ -266,10 +265,11 @@ This definition of $Q(w,a)$ presumes that in subsequent periods the worker take
266
265
An optimal Q-function for our McCall worker satisfies
Note that the first equation of system {eq}`eq:impliedq` presumes that after the agent has accepted an offer, he will not have the objection to reject that same offer in the future.
\widetilde{TD}\left(w,\text{reject}\right) & = \left[ c+\beta\max_{a'\in\mathcal{A}}\widetilde{Q}^{old}\left(w',a'\right) \right]-\widetilde{Q}^{old}\left(w,\text{reject}\right),\;w'\sim F
442
-
\end{align*}
439
+
\end{aligned}
443
440
$$ (eq:old4)
444
441
445
442
The terms $\widetilde{TD}(w,a) $ for $a = \left\{\textrm{accept,reject} \right\}$ are the **temporal difference errors** that drive the updates.
@@ -516,9 +513,6 @@ By using the $\epsilon$-greedy method and also by increasing the number of episo
516
513
517
514
**Remark:** Notice that $\widetilde{TD}$ associated with an optimal Q-table defined in equation (2) automatically above satisfies $\widetilde{TD}=0$ for all state action pairs. Whether a limit of our Q-learning algorithm converges to an optimal Q-table depends on whether the algorithm visits all state, action pairs often enough.
518
515
519
-
520
-
521
-
522
516
We implement this pseudo code in a Python class.
523
517
524
518
For simplicity and convenience, we let `s` represent the state index between $0$ and $n=50$ and $w_s=w[s]$.
* the Q-learning algorithm has trouble learning the Q-table well for wages that are rarely drawn
740
-
741
-
* the quality of approximation to the "true" value function computed by value function iteration improves for longer epochs
733
+
* the Q-learning algorithm has trouble learning the Q-table well for wages that are rarely drawn
742
734
743
-
+++
735
+
* the quality of approximation to the "true" value function computed by value function iteration improves for longer epochs
744
736
745
737
## Employed Worker Can't Quit
746
738
747
739
748
740
The preceding version of temporal difference Q-learning described in equation system (4) lets an an employed worker quit, i.e., reject her wage as an incumbent and instead accept receive unemployment compensation this period
749
741
and draw a new offer next period.
750
742
751
-
752
743
This is an option that the McCall worker described in {doc}`this quantecon lecture <mccall_model>` would not take.
753
744
754
745
See {cite}`Ljungqvist2012`, chapter 7 on search, for a proof.
@@ -757,20 +748,18 @@ But in the context of Q-learning, giving the worker the option to quit and get u
757
748
unemployed turns out to accelerate the learning process by promoting experimentation vis a vis premature
758
749
exploitation only.
759
750
760
-
761
751
To illustrate this, we'll amend our formulas for temporal differences to forbid an employed worker from quitting a job she had accepted earlier.
762
752
763
753
With this understanding about available choices, we obtain the following temporal difference values:
\widetilde{TD}\left(w,\text{reject}\right) & = \left[ c+\beta\max_{a'\in\mathcal{A}}\widetilde{Q}^{old}\left(w',a'\right) \right]-\widetilde{Q}^{old}\left(w,\text{reject}\right),\;w'\sim F
769
-
\tag{4'}
770
-
\end{align*}
771
-
$$
759
+
\end{aligned}
760
+
$$ (eq:temp-diff)
772
761
773
-
It turns out that formulas (4') combined with our Q-learning recursion (3) can lead our agent to eventually learn the optimal value function as well as in the case where an option to redraw can be exercised.
762
+
It turns out that formulas {eq}`eq:temp-diff` combined with our Q-learning recursion (3) can lead our agent to eventually learn the optimal value function as well as in the case where an option to redraw can be exercised.
774
763
775
764
But learning is slower because an agent who ends up accepting a wage offer prematurally loses the option to explore new states in the same episode and to adjust the value associated with that state.
0 commit comments