Skip to content

_get_response without headers doesn't work (at least with 'yahoo' source #867

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
galashour opened this issue Jul 2, 2021 · 39 comments · Fixed by dask/dask#7858 or #883
Closed

_get_response without headers doesn't work (at least with 'yahoo' source #867

galashour opened this issue Jul 2, 2021 · 39 comments · Fixed by dask/dask#7858 or #883
Milestone

Comments

@galashour
Copy link
Contributor

galashour commented Jul 2, 2021

to fix, I put in base.py:

def _get_response(self, url, params=None, headers=None):
    """ send raw HTTP request to get requests.Response from the specified url
    Parameters
    ----------
    url : str
        target URL
    params : dict or None
        parameters passed to the URL
    """

    # initial attempt + retry
    if headers == None:
        headers          = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

    pause = self.pause
@ivofernandes
Copy link

Thanks, I tested here, it works like that, I was also with an error since yesterday

@maxmekiska
Copy link

Hi everyone, I am getting an error as well when I try to retrieve stock data from yahoo finance. Is this error related to the same issue as described above?

RemoteDataError: Unable to read URL: https://finance.yahoo.com/quote/F/history?period1=1264996800&period2=1517543999&interval=1d&frequency=1d&filter=history
Response Text:
b'<!DOCTYPE html>\n  <html lang="en-us"><head>\n  <meta http-equiv="content-type" content="text/html; charset=UTF-8">\n      <meta charset="utf-8">\n      <title>Yahoo</title>\n      <meta name="viewport" content="width=device-width,initial-scale=1,minimal-ui">\n      <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\n      <style>\n  html {\n      height: 100%;\n  }\n  body {\n      background: #fafafc url(https://s.yimg.com/nn/img/sad-panda-201402200631.png) 50% 50%;\n      background-size: cover;\n      height: 100%;\n      text-align: center;\n      font: 300 18px "helvetica neue", helvetica, verdana, tahoma, arial, sans-serif;\n  }\n  table {\n      height: 100%;\n      width: 100%;\n      table-layout: fixed;\n      border-collapse: collapse;\n      border-spacing: 0;\n      border: none;\n  }\n  h1 {\n      font-size: 42px;\n      font-weight: 400;\n      color: #400090;\n  }\n  p {\n      color: #1A1A1A;\n  }\n  #message-1 {\n      font-weight: bold;\n      margin: 0;\n  }\n  #message-2 {\n      display: inline-block;\n      *display: inline;\n      zoom: 1;\n      max-width: 17em;\n      _width: 17em;\n  }\n      </style>\n  <script>\n    document.write(\'<img src="//geo.yahoo.com/b?s=1197757129&t=\'+new Date().getTime()+\'&src=aws&err_url=\'+encodeURIComponent(document.URL)+\'&err=%<pssc>&test=\'+encodeURIComponent(\'%<{Bucket}cqh[:200]>\')+\'" width="0px" height="0px"/>\');var beacon = new Image();beacon.src="//bcn.fp.yahoo.com/p?s=1197757129&t="+new Date().getTime()+"&src=aws&err_url="+encodeURIComponent(document.URL)+"&err=%<pssc>&test="+encodeURIComponent(\'%<{Bucket}cqh[:200]>\');\n  </script>\n  </head>\n  <body>\n  <!-- status code : 404 -->\n  <!-- Not Found on Server -->\n  <table>\n  <tbody><tr>\n      <td>\n      <img src="https://s.yimg.com/rz/p/yahoo_frontpage_en-US_s_f_p_205x58_frontpage.png" alt="Yahoo Logo">\n      <h1 style="margin-top:20px;">Will be right back...</h1>\n      <p id="message-1">Thank you for your patience.</p>\n      <p id="message-2">Our engineers are working quickly to resolve the issue.</p>\n      </td>\n  </tr>\n  </tbody></table>\n  </body></html>

@galashour
Copy link
Contributor Author

Hi everyone, I am getting an error as well when I try to retrieve stock data from yahoo finance. Is this error related to the same issue as described above?

RemoteDataError: Unable to read URL: https://finance.yahoo.com/quote/F/history?period1=1264996800&period2=1517543999&interval=1d&frequency=1d&filter=history
Response Text:
b'<!DOCTYPE html>\n  <html lang="en-us"><head>\n  <meta http-equiv="content-type" content="text/html; charset=UTF-8">\n      <meta charset="utf-8">\n      <title>Yahoo</title>\n      <meta name="viewport" content="width=device-width,initial-scale=1,minimal-ui">\n      <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\n      <style>\n  html {\n      height: 100%;\n  }\n  body {\n      background: #fafafc url(https://s.yimg.com/nn/img/sad-panda-201402200631.png) 50% 50%;\n      background-size: cover;\n      height: 100%;\n      text-align: center;\n      font: 300 18px "helvetica neue", helvetica, verdana, tahoma, arial, sans-serif;\n  }\n  table {\n      height: 100%;\n      width: 100%;\n      table-layout: fixed;\n      border-collapse: collapse;\n      border-spacing: 0;\n      border: none;\n  }\n  h1 {\n      font-size: 42px;\n      font-weight: 400;\n      color: #400090;\n  }\n  p {\n      color: #1A1A1A;\n  }\n  #message-1 {\n      font-weight: bold;\n      margin: 0;\n  }\n  #message-2 {\n      display: inline-block;\n      *display: inline;\n      zoom: 1;\n      max-width: 17em;\n      _width: 17em;\n  }\n      </style>\n  <script>\n    document.write(\'<img src="//geo.yahoo.com/b?s=1197757129&t=\'+new Date().getTime()+\'&src=aws&err_url=\'+encodeURIComponent(document.URL)+\'&err=%<pssc>&test=\'+encodeURIComponent(\'%<{Bucket}cqh[:200]>\')+\'" width="0px" height="0px"/>\');var beacon = new Image();beacon.src="//bcn.fp.yahoo.com/p?s=1197757129&t="+new Date().getTime()+"&src=aws&err_url="+encodeURIComponent(document.URL)+"&err=%<pssc>&test="+encodeURIComponent(\'%<{Bucket}cqh[:200]>\');\n  </script>\n  </head>\n  <body>\n  <!-- status code : 404 -->\n  <!-- Not Found on Server -->\n  <table>\n  <tbody><tr>\n      <td>\n      <img src="https://s.yimg.com/rz/p/yahoo_frontpage_en-US_s_f_p_205x58_frontpage.png" alt="Yahoo Logo">\n      <h1 style="margin-top:20px;">Will be right back...</h1>\n      <p id="message-1">Thank you for your patience.</p>\n      <p id="message-2">Our engineers are working quickly to resolve the issue.</p>\n      </td>\n  </tr>\n  </tbody></table>\n  </body></html>

yes - looks to me the same

@shortsallday
Copy link

I'm getting same error as well. Thanks for creating this fix.

Did yahoo change their HTML in a recent release?

@galashour
Copy link
Contributor Author

galashour commented Jul 2, 2021

I'm getting same error as well. Thanks for creating this fix.

Did yahoo change their HTML in a recent release?

My guess (speculation) is that yahoo tightened the requirements on url queries to verify these are legitimate browser requests, thus resulting in the breakage in case the header section of the query was missing (None).

The suggested 2 lines addition to base.py (around line 152) of pandas_datareader seems to resolve the issue by explicitly forcing a valid header when one is missing.

@s-kust
Copy link

s-kust commented Jul 2, 2021

My guess (speculation) is that yahoo tightened the requirements on url queries to verify these are legitimate browser requests

If it is true, then today's complication is just the beginning. Eventually, Yahoo engineers will force us to use paid services, but we can still make life a little more difficult for them.

Instead of this:

# initial attempt + retry
if headers == None:
 headers          = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
pause = self.pause

do this:

# initial attempt + retry
        if headers == None:
            ua = UserAgent()
            headers = {'User-Agent': ua.random}
        pause = self.pause

or this

# initial attempt + retry
        if headers == None:
            ua = UserAgent(verify_ssl=False)
            headers = {'User-Agent': ua.random}
        pause = self.pause

And don't forget to install and import the package:
from fake_useragent import UserAgent
Anyway, the news is pretty bad.

@ralexx
Copy link

ralexx commented Jul 2, 2021

No change to pandas_datareader code is required. You can instantiate your own requests.Session and update its .headers attribute with the user agent header of your choice. Then pass the Session instance to pandas_datareader.yahoo.daily.YahooDailyReader:

# from pandas_datareader.yahoo.daily import YahooDailyReader as ydr
# import requests

USER_AGENT = {
    'User-Agent': ('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
                   ' Chrome/91.0.4472.124 Safari/537.36')
    }
sesh = requests.Session()
sesh.headers.update(USER_AGENT)

df = ydr(**your_kwargs, session=sesh)

This works for me with e.g. the current Chrome user agent as above. See for example HSBC's HK listing 0005.HK (which has closed at the time of writing and should match your results if you check the "Historical Data" tab on finance.yahoo.com).

                 High        Low       Open      Close    Volume  Adj Close
Date                                                                       
2021-06-15  48.000000  47.000000  47.849998  47.349998  22382507  47.349998
2021-06-16  48.349998  47.400002  47.650002  48.099998  18503780  48.099998
2021-06-17  48.500000  47.650002  48.049999  48.299999  20179603  48.299999
2021-06-18  47.650002  47.099998  47.599998  47.200001  26316274  47.200001
2021-06-21  46.200001  45.099998  46.049999  45.549999  48269764  45.549999
2021-06-22  46.000000  45.650002  45.849998  45.750000  18034452  45.750000
2021-06-23  46.450001  45.299999  45.650002  46.049999  15193950  46.049999
2021-06-24  46.200001  45.650002  45.650002  45.900002  11924309  45.900002
2021-06-25  46.299999  45.650002  45.650002  46.099998  14515655  46.099998
2021-06-28  46.200001  45.500000  46.099998  45.900002   9692425  45.900002
2021-06-29  45.599998  45.000000  45.549999  45.250000  27624318  45.250000
2021-06-30  45.250000  44.799999  45.000000  44.849998  18489180  44.849998
2021-07-02  45.049999  44.400002  45.000000  44.849998  19693352  44.849998

@galashour
Copy link
Contributor Author

No change to pandas_datareader code is required. You can instantiate your own requests.Session and update its .headers attribute with the user agent header of your choice. Then pass the Session instance to pandas_datareader.yahoo.daily.YahooDailyReader:

# from pandas_datareader.yahoo.daily import YahooDailyReader as ydr
# import requests

USER_AGENT = {
    'User-Agent': ('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
                   ' Chrome/91.0.4472.124 Safari/537.36')
    }
sesh = requests.Session()
sesh.headers.update(USER_AGENT)

df = ydr(**your_kwargs, session=sesh)

This works for me with e.g. the current Chrome user agent as above. See for example HSBC's HK listing 0005.HK (which has closed at the time of writing and should match your results if you check the "Historical Data" tab on finance.yahoo.com).

                 High        Low       Open      Close    Volume  Adj Close
Date                                                                       
2021-06-15  48.000000  47.000000  47.849998  47.349998  22382507  47.349998
2021-06-16  48.349998  47.400002  47.650002  48.099998  18503780  48.099998
2021-06-17  48.500000  47.650002  48.049999  48.299999  20179603  48.299999
2021-06-18  47.650002  47.099998  47.599998  47.200001  26316274  47.200001
2021-06-21  46.200001  45.099998  46.049999  45.549999  48269764  45.549999
2021-06-22  46.000000  45.650002  45.849998  45.750000  18034452  45.750000
2021-06-23  46.450001  45.299999  45.650002  46.049999  15193950  46.049999
2021-06-24  46.200001  45.650002  45.650002  45.900002  11924309  45.900002
2021-06-25  46.299999  45.650002  45.650002  46.099998  14515655  46.099998
2021-06-28  46.200001  45.500000  46.099998  45.900002   9692425  45.900002
2021-06-29  45.599998  45.000000  45.549999  45.250000  27624318  45.250000
2021-06-30  45.250000  44.799999  45.000000  44.849998  18489180  44.849998
2021-07-02  45.049999  44.400002  45.000000  44.849998  19693352  44.849998

True, but in my opinion it is more elegant to abstract these details from the user, and have the library handle the various 'subtleties' internally.

@ralexx
Copy link

ralexx commented Jul 2, 2021

True, but in my opinion it is more elegant to abstract these details from the user, and have the library handle the various 'subtleties' internally.

That elegance would come at the cost of making the Yahoo module more brittle. If I were a maintainer here I would not want the stream of issues that could result if/when an arbitrary, hard-coded user agent string is blocked by Yahoo. Better to let users follow the pattern of supplying their own preferred user agent strings.

@kdvolder
Copy link

kdvolder commented Jul 3, 2021

I also noticed this broke and started digging around in the code (but I'm a total newbee to python so just groping around in the dark really).

Eventually, after failing to figure out why pandas datareader stopped working for yahoo, I wrote some code that uses a different url 'https://query1.finance.yahoo.com/v7/finance/download. This is the url that downloads data in csv format when you click the download link inside yahoo's pages. The url is easy to 'curl' without special headers or cookies or any stuff like that (which I would have no idea how to do anyways).That url seems to return similar data in convenient csv format. So it would seem much more convenient to use than the url datareader is currently using.

Anyways, just in case this is useful... (maybe someone can use it to figure out how to fix pandas datareader without special headers / cookies and sessions)... this is my amateurish code to read stock data from the https://query1.finance.yahoo.com/v7/finance/download url:

import pandas as pd
import datetime
import requests
import dateutil

baseUrl = 'https://query1.finance.yahoo.com/v7/finance/download'

def timestamp(dt):
    return round(datetime.datetime.timestamp(dt))

def get_csv_data(ticker='SPY', days=200) :
    endDate = datetime.datetime.today()
    startDate = endDate - datetime.timedelta(days=days)
    response = requests.get(baseUrl+"/"+urllib.parse.quote(ticker), stream=True, params = {
        'period1': timestamp(startDate),
        'period2': timestamp(endDate),
        'interval': '1d',
        'events': 'history',
        'includeAdjustedClose': 'true'
    })
    response.raise_for_status()
    return pd.read_csv(response.raw)

data = get_csv_data();
print(data)

Produces output like:

           Date        Open        High         Low       Close   Adj Close     Volume
0    2020-12-15  367.399994  369.589996  365.920013  369.589996  365.623657   63865300
1    2020-12-16  369.820007  371.160004  368.869995  370.170013  366.197449   58420500
2    2020-12-17  371.940002  372.459991  371.049988  372.239990  368.245209   64119500
3    2020-12-18  370.970001  371.149994  367.019989  369.179993  366.774872  136542300
4    2020-12-21  364.970001  378.459991  362.029999  367.859985  365.463440   96386700
..          ...         ...         ...         ...         ...         ...        ...
133  2021-06-28  427.170013  427.649994  425.890015  427.470001  427.470001   53090800
134  2021-06-29  427.880005  428.559998  427.130005  427.700012  427.700012   35970500
135  2021-06-30  427.209991  428.779999  427.179993  428.059998  428.059998   64827900
136  2021-07-01  428.869995  430.600006  428.799988  430.429993  430.429993   53365900
137  2021-07-02  428.869995  434.100006  430.521790  433.720001  433.720001   57697668

[138 rows x 7 columns]

@fdfpy
Copy link

fdfpy commented Jul 3, 2021

I come across same error from yesterday. Here is my code. My code means getting SPYD stock price.

import pandas_datareader.data as web 
import pandas as pd
import datetime

Y = datetime.datetime.today().year
M = datetime.datetime.today().month
D = datetime.datetime.today().day
start=datetime.datetime(Y-1, M, D)
end=datetime.datetime(Y, M, D)

df=web.DataReader('SPYD' ,'yahoo',start,end)

The error I got is here. I think the yahoo URL have already changed, the yahoo URL written in DataReader need to change?

RemoteDataError                           Traceback (most recent call last)
<ipython-input-2-7fc62741cac4> in <module>
     11 
     12 
---> 13 df=web.DataReader('SPYD' ,'yahoo',start,end)
     14 
     15 

~\anaconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
    197                 else:
    198                     kwargs[new_arg_name] = new_arg_value
--> 199             return func(*args, **kwargs)
    200 
    201         return cast(F, wrapper)

~\anaconda3\lib\site-packages\pandas_datareader\data.py in DataReader(name, data_source, start, end, retry_count, pause, session, api_key)
    374 
    375     if data_source == "yahoo":
--> 376         return YahooDailyReader(
    377             symbols=name,
    378             start=start,

~\anaconda3\lib\site-packages\pandas_datareader\base.py in read(self)
    251         # If a single symbol, (e.g., 'GOOG')
    252         if isinstance(self.symbols, (string_types, int)):
--> 253             df = self._read_one_data(self.url, params=self._get_params(self.symbols))
    254         # Or multiple symbols, (e.g., ['GOOG', 'AAPL', 'MSFT'])
    255         elif isinstance(self.symbols, DataFrame):

~\anaconda3\lib\site-packages\pandas_datareader\yahoo\daily.py in _read_one_data(self, url, params)
    151         url = url.format(symbol)
    152 
--> 153         resp = self._get_response(url, params=params)
    154         ptrn = r"root\.App\.main = (.*?);\n}\(this\)\);"
    155         try:

~\anaconda3\lib\site-packages\pandas_datareader\base.py in _get_response(self, url, params, headers)
    179             msg += "\nResponse Text:\n{0}".format(last_response_text)
    180 
--> 181         raise RemoteDataError(msg)
    182 
    183     def _get_crumb(self, *args):

RemoteDataError: Unable to read URL: https://finance.yahoo.com/quote/SPYD/history?period1=1593716400&period2=1625338799&interval=1d&frequency=1d&filter=history
Response Text:
b'<!DOCTYPE html>\n  <html lang="en-us"><head>\n  <meta http-equiv="content-type" content="text/html; charset=UTF-8">\n      <meta charset="utf-8">\n      <title>Yahoo</title>\n      <meta name="viewport" content="width=device-width,initial-scale=1,minimal-ui">\n      <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\n      <style>\n  html {\n      height: 100%;\n  }\n  body {\n      background: #fafafc url(https://s.yimg.com/nn/img/sad-panda-201402200631.png) 50% 50%;\n      background-size: cover;\n      height: 100%;\n      text-align: center;\n      font: 300 18px "helvetica neue", helvetica, verdana, tahoma, arial, sans-serif;\n  }\n  table {\n      height: 100%;\n      width: 100%;\n      table-layout: fixed;\n      border-collapse: collapse;\n      border-spacing: 0;\n      border: none;\n  }\n  h1 {\n      font-size: 42px;\n      font-weight: 400;\n      color: #400090;\n  }\n  p {\n      color: #1A1A1A;\n  }\n  #message-1 {\n      font-weight: bold;\n      margin: 0;\n  }\n  #message-2 {\n      display: inline-block;\n      *display: inline;\n      zoom: 1;\n      max-width: 17em;\n      _width: 17em;\n  }\n      </style>\n  <script>\n    document.write(\'<img src="//geo.yahoo.com/b?s=1197757129&t=\'+new Date().getTime()+\'&src=aws&err_url=\'+encodeURIComponent(document.URL)+\'&err=%<pssc>&test=\'+encodeURIComponent(\'%<{Bucket}cqh[:200]>\')+\'" width="0px" height="0px"/>\');var beacon = new Image();beacon.src="//bcn.fp.yahoo.com/p?s=1197757129&t="+new Date().getTime()+"&src=aws&err_url="+encodeURIComponent(document.URL)+"&err=%<pssc>&test="+encodeURIComponent(\'%<{Bucket}cqh[:200]>\');\n  </script>\n  </head>\n  <body>\n  <!-- status code : 404 -->\n  <!-- Not Found on Server -->\n  <table>\n  <tbody><tr>\n      <td>\n      <img src="https://s.yimg.com/rz/p/yahoo_frontpage_en-US_s_f_p_205x58_frontpage.png" alt="Yahoo Logo">\n      <h1 style="margin-top:20px;">Will be right back...</h1>\n      <p id="message-1">Thank you for your patience.</p>\n      <p id="message-2">Our engineers are working quickly to resolve the issue.</p>\n      </td>\n  </tr>\n  </tbody></table>\n  </body></html>'

@shortsallday
Copy link

I agree with a lot of the comments here. Yes, the proposed solution by @galashour may make the yahoo module more brittle, but isn't the current implementation already brittle since it's currently broken?

I agree we have to play a balancing game here: making this user friendly VS. keeping the yahoo module flexible. My angle is a mix of both suggestions already mentioned above. I think there should be a code change because one of reasons to import a library or module is so that it's ready to use "out of the box". The user shouldn't have to do extra and repetitive work, such as handling their own user agent. However, we do need to account for future changes on Yahoo's side so this doesn't break again.

What are your thoughts? Can we achieve both goals?

@s-kust
Copy link

s-kust commented Jul 3, 2021

What are your thoughts?

Simply switch to Alpha Vantage.

500 free requests per day is currently enough for my purposes.

@vonOak
Copy link

vonOak commented Jul 3, 2021

Hello,
I think, we shouldn't change base.py, because this file is used also by many other reader classes and we don't know if this change in base.py can't have some side effect for other readers.
I propose change directly in the file yahoo/daily.py in the class YahooDailyReader. In that class, we already have defined and filled the variable self.headers but this variable is not used anywhere. Why is this variable there without using it? How about we use the self.headers variable as a parameter in _get_response method, which is then called to base.py?

So in yahoo/daily.py in the method _read_one_data on row 153
instead of this

        resp = self._get_response(url, params=params)

can be this

        resp = self._get_response(url, params=params, headers=self.headers)

@galashour
Copy link
Contributor Author

Hello,
I think, we shouldn't change base.py, because this file is used also by many other reader classes and we don't know if this change in base.py can't have some side effect for other readers.
I propose change directly in the file yahoo/daily.py in the class YahooDailyReader. In that class, we already have defined and filled the variable self.headers but this variable is not used anywhere. Why is this variable there without using it? How about we use the self.headers variable as a parameter in _get_response method, which is then called to base.py?

So in yahoo/daily.py in the method _read_one_data on row 153
instead of this

        resp = self._get_response(url, params=params)

can be this

        resp = self._get_response(url, params=params, headers=self.headers)

I agree with the spirit of the comment, but my perspective is slightly different:
The base class provides a value as a 'safety measure' in case we end up with None as a header.
Each of the derived reader classes can still override as applicable.

So, think about it as a 'last safety measure', just in case the 'sub-class' or user didn't provide a value, so that we still end up with a valid header (lack of which introduced this little drama).
Whether that value is fixed or derived in other way to introduce some 'variability' in headers like suggested above - is indeed something that could be considered.

@shortsallday
Copy link

So we have some confidence a change to base.py won't affect other readers?

Also, as a thought exercise, what are some ideas to address the "brittle"-ness of this change if Yahoo were to make future changes?

Anyone on the YOLO route and just deal with Yahoo changes as they happen? Because, it's basically what we're doing right now with this current header issue.

@galashour
Copy link
Contributor Author

So we have some confidence a change to base.py won't affect other readers?

Also, as a thought exercise, what are some ideas to address the "brittle"-ness of this change if Yahoo were to make future changes?

Anyone on the YOLO route and just deal with Yahoo changes as they happen? Because, it's basically what we're doing right now with this current header issue.

Indeed would be good if someone who has scripts that use also datareaders from other sources could verify that the suggested update doesn't degrade anything (it shouldn't, but worth double checking).
Would be good if they can update here which other datareader source was checked after making the suggested 2 lines addition to the base.

@kdvolder
Copy link

kdvolder commented Jul 4, 2021

Also, as a thought exercise, what are some ideas to address the "brittle"-ness of this change if Yahoo were to make future changes?

There isn't really much you can do about that really. The nature of yahoo data reader is that it is essentially using yahoo internal apis. There will never be any guarantee they won't change something that breaks datareader in the future, either deliberately or simply as a consequence of them deciding to change the structure of their internal apis for whatever reason. Personally I don't actually think that yahoo broke this deliberately. But in any case, using an internal api like this will always be somewhat brittle.

So we have some confidence a change to base.py won't affect other readers?

Would be good to check that yes. It seems unlikely though. In some sense sending requests without setting a 'user-agent' is a bit unusual and is more likely to cause problems than the other way around. So overall I would expect a change that guarantees such a header is present to be making things less brittle (for any data provider, not just for yahoo).

@marcosinging
Copy link

Hello,
I think, we shouldn't change base.py, because this file is used also by many other reader classes and we don't know if this change in base.py can't have some side effect for other readers.
I propose change directly in the file yahoo/daily.py in the class YahooDailyReader. In that class, we already have defined and filled the variable self.headers but this variable is not used anywhere. Why is this variable there without using it? How about we use the self.headers variable as a parameter in _get_response method, which is then called to base.py?

So in yahoo/daily.py in the method _read_one_data on row 153
instead of this

        resp = self._get_response(url, params=params)

can be this

        resp = self._get_response(url, params=params, headers=self.headers)

I did exactly the same thing and it works now. Probably the solution is not so elegant, but for now it works.

@uad1098
Copy link

uad1098 commented Jul 8, 2021

Doesn't look like Yahoo going to fix problem they created so what's next? (Not found on server. Thank you for your patience. Our engineers are working quickly to resole the issue.)
So what's next?

  • Enhance Pandas datareader to work with Yahoo with marcosinging fix?
  • use aeolio yfinance fix in 868?
  • discontinue using datareader for Yahoo and find other data service such as Alpha Vantage.

@shortsallday
Copy link

Doesn't look like Yahoo going to fix problem they created so what's next? (Not found on server. Thank you for your patience. Our engineers are working quickly to resole the issue.)
So what's next?

  • Enhance Pandas datareader to work with Yahoo with marcosinging fix?
  • use aeolio yfinance fix in 868?
  • discontinue using datareader for Yahoo and find other data service such as Alpha Vantage.

Correct, Yahoo won't "fix" because this is a deliberate header change on their part.

Since this is the datareader repo, we should try to agree on a fix. Moving to another API, such as Alpha Vantage, is an option, but not related to this repo.

I think we're between @marcosinging and @galashour fixes. Any preference between the two?
Also, who ultimately decides which fix gets pushed?

@SaeedBohlooli
Copy link

This one worked for me. Thanks

Hello,
I think, we shouldn't change base.py, because this file is used also by many other reader classes and we don't know if this change in base.py can't have some side effect for other readers.
I propose change directly in the file yahoo/daily.py in the class YahooDailyReader. In that class, we already have defined and filled the variable self.headers but this variable is not used anywhere. Why is this variable there without using it? How about we use the self.headers variable as a parameter in _get_response method, which is then called to base.py?

So in yahoo/daily.py in the method _read_one_data on row 153
instead of this

        resp = self._get_response(url, params=params)

can be this

        resp = self._get_response(url, params=params, headers=self.headers)

@bashtage
Copy link
Contributor

bashtage commented Jul 8, 2021 via email

@uad1098
Copy link

uad1098 commented Jul 8, 2021 via email

@ralexx ralexx mentioned this issue Jul 8, 2021
@marcosinging
Copy link

Hi everybody.. honestly I'm really new here on github: I signed in last week because of the issue on pandasdatareader. So I don't know well how it works to fix an issue, but of course I can say I adopted that solution (proposed also by @vonOak) and it works. If we agree that it's the best solution we can update the code of pandasdatareade, but please help me in the process here on github.

@galashour
Copy link
Contributor Author

I just made a pull request using in the yahoo/daily:
resp = self._get_response(url, params=params, headers=self.headers)

and also adding a 'safeguard' check in the base class.

hopefully the request will go through.

Thanks all for the constructive feedback.

bashtage pushed a commit to bashtage/pandas-datareader that referenced this issue Jul 12, 2021
bashtage pushed a commit to bashtage/pandas-datareader that referenced this issue Jul 12, 2021
@bashtage
Copy link
Contributor

@883 builds on this and addresses the complaint.

@bkcollection
Copy link

Due to some reasons, I have to use the existing pandas_datareader as I am using python 2.7 with some other old modules.
Just wonder how I can add the user agent?

import re
import time
import warnings
import numpy as np
from pandas import Panel
from pandas_datareader.base import (_DailyBaseReader, _in_chunks)
from pandas_datareader._utils import (RemoteDataError, SymbolWarning)


class YahooDailyReader(_DailyBaseReader):

    """
    Returns DataFrame/Panel of historical stock prices from symbols, over date
    range, start to end. To avoid being penalized by Yahoo! Finance servers,
    pauses between downloading 'chunks' of symbols can be specified.

    Parameters
    ----------
    symbols : string, array-like object (list, tuple, Series), or DataFrame
        Single stock symbol (ticker), array-like object of symbols or
        DataFrame with index containing stock symbols.
    start : string, (defaults to '1/1/2010')
        Starting date, timestamp. Parses many different kind of date
        representations (e.g., 'JAN-01-2010', '1/1/10', 'Jan, 1, 1980')
    end : string, (defaults to today)
        Ending date, timestamp. Same format as starting date.
    retry_count : int, default 3
        Number of times to retry query request.
    pause : int, default 0
        Time, in seconds, to pause between consecutive queries of chunks. If
        single value given for symbol, represents the pause between retries.
    session : Session, default None
        requests.sessions.Session instance to be used
    adjust_price : bool, default False
        If True, adjusts all prices in hist_data ('Open', 'High', 'Low',
        'Close') based on 'Adj Close' price. Adds 'Adj_Ratio' column and drops
        'Adj Close'.
    ret_index : bool, default False
        If True, includes a simple return index 'Ret_Index' in hist_data.
    chunksize : int, default 25
        Number of symbols to download consecutively before intiating pause.
    interval : string, default 'd'
        Time interval code, valid values are 'd' for daily, 'w' for weekly,
        'm' for monthly and 'v' for dividend.
    """

    def __init__(self, symbols=None, start=None, end=None, retry_count=3,
                 pause=0.35, session=None, adjust_price=False,
                 ret_index=False, chunksize=25, interval='d'):
        super(YahooDailyReader, self).__init__(symbols=symbols,
                                               start=start, end=end,
                                               retry_count=retry_count,
                                               pause=pause, session=session,
                                               chunksize=chunksize)
        # Ladder up the wait time between subsequent requests to improve
        # probability of a successful retry
        self.pause_multiplier = 2.5

        self.headers = {
            'Connection': 'keep-alive',
            'Expires': str(-1),
            'Upgrade-Insecure-Requests': str(1),
            # Google Chrome:
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'  # noqa
        }

        self.adjust_price = adjust_price
        self.ret_index = ret_index
        self.interval = interval

        if self.interval not in ['d', 'wk', 'mo', 'm', 'w']:
            raise ValueError("Invalid interval: valid values are  'd', 'wk' and 'mo'. 'm' and 'w' have been implemented for "  # noqa
                             "backward compatibility. 'v' has been moved to the yahoo-actions or yahoo-dividends APIs.")  # noqa
        elif self.interval in ['m', 'mo']:
            self.pdinterval = 'm'
            self.interval = 'mo'
        elif self.interval in ['w', 'wk']:
            self.pdinterval = 'w'
            self.interval = 'wk'

        self.interval = '1' + self.interval
        self.crumb = self._get_crumb(retry_count)

    @property
    def service(self):
        return 'history'

    @property
    def url(self):
        return 'https://query1.finance.yahoo.com/v7/finance/download/{}'\
            .format(self.symbols)

    @staticmethod
    def yurl(symbol):
        return 'https://query1.finance.yahoo.com/v7/finance/download/{}'\
            .format(symbol)

    def _get_params(self, symbol):
        unix_start = int(time.mktime(self.start.timetuple()))
        unix_end = int(time.mktime(self.end.timetuple()))

        params = {
            'period1': unix_start,
            'period2': unix_end,
            'interval': self.interval,
            'events': self.service,
            'crumb': self.crumb
        }
        return params

    def read(self):
        """ read one data from specified URL """
        df = super(YahooDailyReader, self).read()
        if self.ret_index:
            df['Ret_Index'] = _calc_return_index(df['Adj Close'])
        if self.adjust_price:
            df = _adjust_prices(df)
        return df.sort_index()

    def _dl_mult_symbols(self, symbols):
        stocks = {}
        failed = []
        passed = []
        for sym_group in _in_chunks(symbols, self.chunksize):
            for sym in sym_group:
                try:
                    stocks[sym] = self._read_one_data(self.yurl(sym),
                                                      self._get_params(sym))
                    passed.append(sym)
                except IOError:
                    msg = 'Failed to read symbol: {0!r}, replacing with NaN.'
                    warnings.warn(msg.format(sym), SymbolWarning)
                    failed.append(sym)

        if len(passed) == 0:
            msg = "No data fetched using {0!r}"
            raise RemoteDataError(msg.format(self.__class__.__name__))
        try:
            if len(stocks) > 0 and len(failed) > 0 and len(passed) > 0:
                df_na = stocks[passed[0]].copy()
                df_na[:] = np.nan
                for sym in failed:
                    stocks[sym] = df_na
            return Panel(stocks).swapaxes('items', 'minor')
        except AttributeError:
            # cannot construct a panel with just 1D nans indicating no data
            msg = "No data fetched using {0!r}"
            raise RemoteDataError(msg.format(self.__class__.__name__))

    def _get_crumb(self, retries):
        # Scrape a history page for a valid crumb ID:
        tu = "https://finance.yahoo.com/quote/{}/history".format(self.symbols)
        response = self._get_response(tu,
                                      params=self.params, headers=self.headers)
        out = str(self._sanitize_response(response))
        # Matches: {"crumb":"AlphaNumeric"}
        rpat = '"CrumbStore":{"crumb":"([^"]+)"}'

        crumb = re.findall(rpat, out)[0]
        return crumb.encode('ascii').decode('unicode-escape')


def _adjust_prices(hist_data, price_list=None):
    """
    Return modifed DataFrame or Panel with adjusted prices based on
    'Adj Close' price. Adds 'Adj_Ratio' column.
    """
    if price_list is None:
        price_list = 'Open', 'High', 'Low', 'Close'
    adj_ratio = hist_data['Adj Close'] / hist_data['Close']

    data = hist_data.copy()
    for item in price_list:
        data[item] = hist_data[item] * adj_ratio
    data['Adj_Ratio'] = adj_ratio
    del data['Adj Close']
    return data


def _calc_return_index(price_df):
    """
    Return a returns index from a input price df or series. Initial value
    (typically NaN) is set to 1.
    """
    df = price_df.pct_change().add(1).cumprod()
    mask = df.ix[1].notnull() & df.ix[0].isnull()
    df.ix[0][mask] = 1

    # Check for first stock listings after starting date of index in ret_index
    # If True, find first_valid_index and set previous entry to 1.
    if (~mask).any():
        for sym in mask.index[~mask]:
            tstamp = df[sym].first_valid_index()
            t_idx = df.index.get_loc(tstamp) - 1
            df[sym].ix[t_idx] = 1

    return df

@bashtage bashtage added this to the 0.10 milestone Jul 13, 2021
@OT2022
Copy link

OT2022 commented Dec 29, 2022

if headers == None:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

Adding the 2 lines to base.py didn't work for me. Do you know if there is an updated UA I should be using?
conscious initial answer was over a year ago now

Thanks

@galashour
Copy link
Contributor Author

This thread is quite outdated, and the library have changed, Also - eventually at the time (1.5 years ago) I think a different fix was eventually implemented.

Consider moving to version 0.2 which seems to be working fine (or 0.1.94 which I think was the last on the 0.1 branch).

@paulmcq
Copy link

paulmcq commented Dec 29, 2022

headers == None should instead be headers is None
eg: https://stackoverflow.com/a/47366574/5593151

@OT2022
Copy link

OT2022 commented Dec 29, 2022

https://stackoverflow.com/a/47366574/5593151

Tried headers is None which returned below error:

TypeError Traceback (most recent call last)
Input In [4], in <cell line: 1>()
----> 1 wb.DataReader('PG', data_source='yahoo', start='2012-1-1')

File /Applications/Python/anaconda3/lib/python3.9/site-packages/pandas/util/_decorators.py:207, in deprecate_kwarg.._deprecate_kwarg..wrapper(*args, **kwargs)
205 else:
206 kwargs[new_arg_name] = new_arg_value
--> 207 return func(*args, **kwargs)

File /Applications/Python/anaconda3/lib/python3.9/site-packages/pandas_datareader/data.py:370, in DataReader(name, data_source, start, end, retry_count, pause, session, api_key)
367 raise NotImplementedError(msg)
369 if data_source == "yahoo":
--> 370 return YahooDailyReader(
371 symbols=name,
372 start=start,
373 end=end,
374 adjust_price=False,
375 chunksize=25,
376 retry_count=retry_count,
377 pause=pause,
378 session=session,
379 ).read()
381 elif data_source == "iex":
382 return IEXDailyReader(
383 symbols=name,
384 start=start,
(...)
390 session=session,
391 ).read()

File /Applications/Python/anaconda3/lib/python3.9/site-packages/pandas_datareader/base.py:256, in _DailyBaseReader.read(self)
254 # If a single symbol, (e.g., 'GOOG')
255 if isinstance(self.symbols, (string_types, int)):
--> 256 df = self._read_one_data(self.url, params=self._get_params(self.symbols))
257 # Or multiple symbols, (e.g., ['GOOG', 'AAPL', 'MSFT'])
258 elif isinstance(self.symbols, DataFrame):

File /Applications/Python/anaconda3/lib/python3.9/site-packages/pandas_datareader/yahoo/daily.py:153, in YahooDailyReader._read_one_data(self, url, params)
151 try:
152 j = json.loads(re.search(ptrn, resp.text, re.DOTALL).group(1))
--> 153 data = j["context"]["dispatcher"]["stores"]["HistoricalPriceStore"]
154 except KeyError:
155 msg = "No data fetched for symbol {} using {}"

TypeError: string indices must be integers

Any idea what I am doing wrong?

@OT2022
Copy link

OT2022 commented Dec 29, 2022

This thread is quite outdated, and the library have changed, Also - eventually at the time (1.5 years ago) I think a different fix was eventually implemented.

Consider moving to version 0.2 which seems to be working fine (or 0.1.94 which I think was the last on the 0.1 branch).

Can you explain what you mean by moving to version 0.2? What application are you referring to here?

@galashour
Copy link
Contributor Author

galashour commented Dec 29, 2022

I assume you use this library in the context of a python application.
In your python environment you can use the following to upgrade the library to the most recent one:

pip install yfinance --upgrade --no-cache-dir

@OT2022
Copy link

OT2022 commented Dec 29, 2022

got it - the library I was using was 'pandas_datareader'

I did however install yfinance and got the below error:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
conda-repo-cli 1.0.4 requires pathlib, which is not installed.

basically I just want to be able to pull historically pricing using Yahoo. Is there anyway you can think of without requiring me to contact Yahoo support.

Alternatively, can you suggest any other library's I can use to pull historical adjust close prices for X # stocks?

@galashour
Copy link
Contributor Author

It seems you have environment issue.

Try to install Anaconda:
https://docs.anaconda.com/anaconda/install/

and then create an 'Env' (or virtual env) using the relevant version of Python (I recommend using versions 3.8 or 3.9 since they have the majority of the relevant libraries you are likely to use).

Again, the issue you have seems to be related to the environment/paths on your setup, rather than related to the yahoo library (regardless of if you use pandas-datareader or yfinance).

resolving such issues, can be tedious, but worth it for the long run.
create a new environment, then start with a simple 'hello world' and then incrementally expand it to contain your code in 'stages' after installing the relevant libraries.

good luck

@OT2022
Copy link

OT2022 commented Dec 29, 2022

It seems you have environment issue.

Try to install Anaconda: https://docs.anaconda.com/anaconda/install/

and then create an 'Env' (or virtual env) using the relevant version of Python (I recommend using versions 3.8 or 3.9 since they have the majority of the relevant libraries you are likely to use).

Again, the issue you have seems to be related to the environment/paths on your setup, rather than related to the yahoo library (regardless of if you use pandas-datareader or yfinance).

resolving such issues, can be tedious, but worth it for the long run. create a new environment, then start with a simple 'hello world' and then incrementally expand it to contain your code in 'stages' after installing the relevant libraries.

good luck

understood - thank you

@OT2022
Copy link

OT2022 commented Dec 29, 2022

It seems you have environment issue.

Try to install Anaconda: https://docs.anaconda.com/anaconda/install/

and then create an 'Env' (or virtual env) using the relevant version of Python (I recommend using versions 3.8 or 3.9 since they have the majority of the relevant libraries you are likely to use).

Again, the issue you have seems to be related to the environment/paths on your setup, rather than related to the yahoo library (regardless of if you use pandas-datareader or yfinance).

resolving such issues, can be tedious, but worth it for the long run. create a new environment, then start with a simple 'hello world' and then incrementally expand it to contain your code in 'stages' after installing the relevant libraries.

good luck

just out of interest - what gave of it is an environment issue vs Yahoo/yfinance?

@galashour
Copy link
Contributor Author

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
conda-repo-cli 1.0.4 requires pathlib, which is not installed.

it seems like some dependencies were not resolved, can't say if it is related to the pip command itself or something else. Also 'dependency conflicts in conda-repo, and complaint that pathlib is not installed, etc. (these all seem to be generic complaints that while I can't say what triggers them, they don't seem to be related to yfinance).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet