Skip to content

resolve issue #867 by explicitly adding headers to url _get_response #873

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

galashour
Copy link
Contributor

@galashour galashour commented Jul 10, 2021

  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • passes black --check pandas_datareader
  • added entry to docs/source/whatsnew/vLATEST.txt

@galashour
Copy link
Contributor Author

I'm relatively new, the fix is simple and addresses issue #867 , without the fix, using yahoo source fails (/crashes).

I'm not sure if / how to do some of the above, but I did test that before the fix the library crashes and after the fix it completes. Also others commenting on issue #867 confrimed the same.

I'm not sure if/how to performm the above checkboxes, and in which environment, possibly you guys can help.

@s-meinhardt
Copy link

s-meinhardt commented Jul 11, 2021

Hi galashour,

I made a similar PR a couple of days ago here but was made aware of an issue by @ralexx with whom you had a similar discussion here. As there is no reason to have multiple PRs for the same problem, I'll close mine and continue to contribute here. I hope the issue raised by @ralexx can be solved soon as I want the YDR back to work.

However, you should test if you actually get any data back and not just an empty DataFrame. I've noticed that I had to change the default interval from 'd' to '1d' in the daily.py file too in order to get actual data. See my commit #869 for further details. You might want to add another commit to fix that issue in your PR.

@s-meinhardt s-meinhardt mentioned this pull request Jul 11, 2021
@galashour
Copy link
Contributor Author

Hi galashour,

I made a similar PR a couple of days ago here but was made aware of an issue by @ralexx with whom you had a similar discussion here. As there is no reason to have multiple PRs for the same problem, I'll close mine and continue to contribute here. I hope the issue raised by @ralexx can be solved soon as I want the YDR back to work.

However, you should test if you actually get any data back and not just an empty DataFrame. I've noticed that I had to change the default interval from 'd' to '1d' in the daily.py file too in order to get actual data. See my commit #869 for further details. You might want to add another commit to fix that issue in your PR.

Thanks for the comment.
I verified that I do get valid daily data once the header issue was addressed.

df1 = web.DataReader(ticker1, data_source='yahoo', start=date_start, end=date_end)

@s-meinhardt
Copy link

s-meinhardt commented Jul 11, 2021

I can also get data now without the interval modification. Yahoo must have fixed their API as it definitely didn't work at the beginning of the week. It is actually the line self.interval = "1" + self.interval in the YahooDailyReader's init-function which adds the missing 1. When I was running my tests, I was only checking the requests call with default value for interval='d', hence got nothing back. My fault!

Well, I hope you get your PR accepted asap.

@ralexx
Copy link

ralexx commented Jul 12, 2021

Please do not merge this pull request as it appears now. It will break the idiomatic use of user-defined requests.Session instances in pandas_datareader. A much simpler change can achieve the desired result without breaking any use cases.

I appreciate that @galashour is intent on embedding a default user agent string in pandas_datareader. To be clear, I have already submitted that this is both unnecessary and an anti-pattern. But leaving aside my protest, this patch currently takes a harmful approach that will unnecessarily break valid use cases.

Problem

These changes force all users to monkey patch pandas_datareader.yahoo.daily.YahooDailyReader.headers if they do not want to use the request headers arbitrarily hard-coded into that module:

        self.headers = {
            "Connection": "keep-alive",
            "Expires": str(-1),
            "Upgrade-Insecure-Requests": str(1),
            # Google Chrome:
            "User-Agent": (
                "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 "
                "(KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"
            ),
}

This is because in requests [my emphasis in the quote that follows],

Any dictionaries that you pass to a request method will be merged with the session-level values that are set. The method-level parameters override session parameters. [docs]

The proposed change to YahooDailyReader._read_one_data causes the headers argument to be passed to the pandas_datareader.base._BaseReader.session.get method, which overrides any headers that may have been set on _BaseReader.session. That includes the headers on any requests.Session instance passed to YahooDailyReader's session argument (which is passed up to _BaseReader.session as the intervening subclasses call super()).

Because no public headers argument exists in YahooDailyReader.__init__, a monkey patch would be required to avoid being compelled to use the hard-coded headers. Even if you add a public API for setting the headers, users will still be forced to pass their preferred headers to each YahooDailyReader instance. This defeats the design intent of a requests.Session: "Sessions can also be used to provide default data to the request methods." [docs]

Non-breaking approach

If you remain intent on providing a hard-coded, fallback user agent string, it can be accomplished without breaking any use case.

Note that the _BaseReader.session attribute is set by calling pandas_datareader._utils._init_session. If no Session instance has been passed to _BaseReader.__init__, _init_session instantiates one and binds _BaseReader.session to it. Make changes here and they will only be applied where users do not pass a Session instance.

# pandas_datareader._utils.py

DEFAULT_USER_AGENT = <some user agent string>  # so far this is the only change needed for Yahoo API

def _init_session(session, retry_count=3):
    if session is None:
        session = requests.Session()
        # do not set requests max_retries here to support arbitrary pause
        session.headers['User-Agent'] = DEFAULT_USER_AGENT  # <-- NON-BREAKING CHANGE
    return session

@galashour
Copy link
Contributor Author

There is No HARD CODED header content in the fix, not sure if you actually read the actual fix.

It puts on line 153 of the yahoo/daily.py the 3rd parameter that was missing, with NO hardcoded string:
resp = self._get_response(url, params=params, headers=self.headers)

Similarly, it also put a safeguard in the base class (again happens to be line 153) only for the case somehow we end up with headers == None:
if headers == None:
headers = self.headers

@ralexx
Copy link

ralexx commented Jul 12, 2021

There is No HARD CODED header content in the fix

self.headers in your patch is pandas_datareader.yahoo.daily.YahooDailyReader.headers (lines 84-93). There is no interface to set this value.

As I said, even if you sort this issue out, your patch still forces all users to pass headers if they don't want to use the headers as defined above. There are other ways of accomplishing your purpose without disadvantaging users.

@galashour
Copy link
Contributor Author

galashour commented Jul 12, 2021

Once again - wrong.

  1. No interface was changed (as is also stated in my above posts).
  2. Users don't provide headers (clearly since no interface was changed for them to do so)
  3. and obviously - I didn't provide any headers whatsoever in my testing (since no interface was changed)

Not sure where all these false claims are coming from, since clearly you didn't read the fix.

The original problem was that in yahoo/daily, the _read-one_data method didn't use the self.headers (that was already set correctly and just wasn't used in the url request). The very simple fix was just to add the 3rd parameter (with the existing value) to the function call:

Instead of:

resp = self._get_response(url, params=params)

The fix is:

resp = self._get_response(url, params=params, headers=self.headers)

You really can't get more elegant, minimalistic and without changing any paradigm or interface.

More so, in the internal testing I printed the 'existing' self.headers and verified it was (already) set with a valid value, and was just not used, and the trivial fix was to add it to resolve the issue.

I do ask that the PR will be handled soon (this way or another), since currently the interface in the official build is broken when working with yahoo (and the proposed fix elegantly addresses it).

@ralexx
Copy link

ralexx commented Jul 12, 2021

Please be kind, I am trying to help you. I understand that your changes work for you, but so far you have not provided any tests for your proposed changes so it's not clear that your changes work for other use cases. As the author of a PR it's incumbent upon you to address fellow users' concerns.

You propose a change that violates the public API of pandas_datareader.yahoo.daily.YahooDailyReader and I have explained why this is the case. If you believe I am wrong then your tests should show that clearly. Just pass a requests.Session instance into YahooDailyReader and show that YahooDailyReader(**kwds).read() fails (because we know that Yahoo is rejecting the default user agent string 'User-Agent': 'python-requests/2.25.1' used by requests.Session.

the interface in the official build is broken when working with yahoo

This is not accurate. I demonstrated this to you in #867. Anyone using version 0.9 can download historical prices today by passing a requests.Session instance with a valid user agent header into the session argument of the YahooDailyReader.__init__ interface.

Or, by making a trivial method call, you can even use the same hard-coded request headers on lines 84-93 of yahoo/daily.py that your patch points to:

from pandas_datareader.yahoo.daily import YahooDailyReader

reader = YahooDailyReader('MSFT', start='2021-07-01', end='2021-07-09')

# confirm that the default headers are the ones we don't want
print(reader.session.headers)

# {'User-Agent': 'python-requests/2.25.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

# now update the *session* headers from `YahooDailyReader.headers`
reader.session.headers.update(reader.headers)
print(reader.session.headers)

# {'Connection': 'keep-alive',
#  'Expires': '-1',
#  'Upgrade-Insecure-Requests': '1',
#  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}

# download prices as usual
df = reader.read()
print(df)

#                  High         Low        Open       Close    Volume   Adj Close
# Date
# 2021-07-01  271.839996  269.600006  269.609985  271.600006  16725300  271.600006
# 2021-07-02  278.000000  272.500000  272.820007  277.649994  26458000  277.649994
# 2021-07-06  279.369995  274.299988  278.029999  277.660004  31565600  277.660004
# 2021-07-07  280.690002  277.149994  279.399994  279.929993  23260000  279.929993
# 2021-07-08  278.730011  274.869995  276.899994  277.420013  24618600  277.420013
# 2021-07-09  278.049988  275.320007  275.720001  277.940002  23905500  277.940002

Please confirm that you are able to run this code.

@galashour
Copy link
Contributor Author

galashour commented Jul 12, 2021

The API that used to work was simple and is used by many:

df1 = web.DataReader(ticker1, data_source='yahoo', start=date_start, end=date_end)

This API is now broken.
It is a common API used, by many.

It is not only for me, quite a few on issue #867 (and also on issue #868 that is a duplicate) - reported the same problem, and reported it works for them with the simple fix applied.

Original API was broken due to a 'bug' in the existing that yahoo.daily as I mentioned that 'forgot' to add the self.headers that was already set internally correctly (with no explicit value from the users), while a preferred practice regrdless to yahoo is to use headers in the url-request (as can be seen also in the base class).

I will say it one last time, as nicely as I can, I didn't violate any public API, nor did I change any API (public or internal). Simply added a parameter that was already there unused while it should have been used.
Also the base class does use the headers in the url-request, and just the yahoo-daily implementation forgot to add this 3rd parameter.

Lastly, I will add the yfinance package used the same approach as I'm suggesting to resolve the same failure they experienced last week with yahoo source.

Addition
I just made additional check with your code on top of my fix, to confirm it didn't break anything (and indeed it didn't break):

from pandas_datareader.yahoo.daily import YahooDailyReader
reader = YahooDailyReader('MSFT', start='2021-07-01', end='2021-07-09')

confirm that the default headers are the ones we don't want

print(reader.session.headers)
reader.session.headers.update(reader.headers)
print(reader.session.headers)

download prices as usual

df = reader.read()
print(df)

=== === ===

run results seem to work perfectly (thus, as desired, no breakage introduced by the proposed simple fix):

C:\Users\user\anaconda3\envs\p38web2\python.exe D:/Python/Sandbox/index_tables.py

{'User-Agent': 'python-requests/2.25.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '/', 'Connection': 'keep-alive'}

{'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36', 'Accept-Encoding': 'gzip, deflate', 'Accept': '/', 'Connection': 'keep-alive', 'Expires': '-1', 'Upgrade-Insecure-Requests': '1'}
High Low ... Volume Adj Close
Date ...
2021-07-01 271.839996 269.600006 ... 16725300 271.600006
2021-07-02 278.000000 272.500000 ... 26458000 277.649994
2021-07-06 279.369995 274.299988 ... 31565600 277.660004
2021-07-07 280.690002 277.149994 ... 23260000 279.929993
2021-07-08 278.730011 274.869995 ... 24618600 277.420013
2021-07-09 278.049988 275.320007 ... 23905500 277.940002

[6 rows x 6 columns]

Process finished with exit code 0

@bashtage
Copy link
Contributor

#883 builds on this while addressing the issues it created.

@bashtage
Copy link
Contributor

Please try master to see if it is working now.

pip install git+https://github.com/pydata/pandas-datareader.git

If there are positive reports, I'll release very soon.

@galashour
Copy link
Contributor Author

Please try master to see if it is working now.

pip install git+https://github.com/pydata/pandas-datareader.git

If there are positive reports, I'll release very soon.

Looks good from my side. Thanks (!)

@bashtage
Copy link
Contributor

Fixed in master. Thanks for the headstart.

@bashtage bashtage closed this Jul 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants