-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Use python-requests
library, if installed, to support broad http(s) scenarios including Basic Authentication
#17087
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
doc/source/whatsnew/v0.21.0.txt
Outdated
``read_csv`` use `python-requests` (if installed) to support basic auth and much more | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
If `python-requests` library is installed try to use it first. If not, continue using urllib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given our previous discussion, let's not do this (at least for now).
Only use requests
in the cases when auth
is needed. That will help to preserve backwards compatibility with behavior from urllib2
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got engaged as I opened an issue for basic auth + self signed certs.
Tomorrow someone might open an issue that pandas doesnt support today eg: proxies. Yet another issue might be setting custom headers, etc.
Here is a bug in released version of pandas. This does not involve https, or auth of any kind. This is supposed to work
df = pd.read_csv('http://handsome-equator.000webhostapp.com/no_auth/aaa.csv') # works fine
df = pd.read_csv('http://handsome-equator.000webhostapp.com:80/no_auth/aaa.csv') # fails with 404 if port number is added. it shouldnt. It doesnt in browser, or requests
It fails because urllib behavior. It works in use-requests if requests is installed and it is used even for existing scenarios when installed.
So, what is the direction for http(s) in pandas? Natively use urllib or move to requests? If the goal is to try to use requests the most we can, then what is the test acceptance criteria for trying to use requests most when available? use-requests passes all tests. Besides api design, what else remains?
@gfyoung we also discussed trying to offload most to requests
if available. If we start detecting auth
scenario, we are then parsing URL etc, adding code. Please see my comment above for 'future issues' as well as 'long term / new feature requests'. It would help if I could understand the reasoning behind:
let's not do this (at least for now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reasoning is that we want our code to be backwards compatible, so we don't want to introduce any inconsistencies by using a new library to handle requests for the time being.
In the future, we may transition to using requests
entirely (the interface is definitely better, I won't deny that), but such transitions are gradual so that people have to adjust their code bases that depend on pandas
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we don't want to introduce any inconsistencies by using a new library to handle requests for the time being
@gfyoung Ok.. I totally agree with that specific part reasoning. And that is why earlier I had asked the question of acceptance-criteria
. So I agree it should not use requests by default, only for the unknown/destabilization issues. But we need some way to explicitly allow use of requests in many cases and not just limit it to scenario of basic-auth. So I propose instead of detecting only auth
scenario, can we change it to:
if url_params is None:
# only use existing code-base
# never attempt to use requests even if installed...
# existing codebase should continue to use urllib
else:
# try to use requests if installed, even if it is empty.
# provide a way for early adopters to try out requests library even for existing scenarios
In this scenario, old codebase is not impacted. Only brave souls who specify the new url_params
and if requests is installed, will use requests.
Perhaps we can set 'The acceptance criteria for using requests by default if installed - is that we get 3-6 months of user feedback on using url_params\requests
satisfactorily' ? Thoughts?
df = pd.read_csv('http://aaa.com/joker.csv') # never uses requests. always uses `urllib`
df = pd.read_csv('http://aaa.com/joker.csv', url_params={}) # only uses requests if installed
df = pd.read_csv('http://aaa.com/joker.csv', url_params={ 'auth':('u', 'p')}) # requires requests. fails otherwise.
df = pd.read_csv('http://user:[email protected]/joker.csv', url_params={}) # only uses requests if installed
df = pd.read_csv('http://user:[email protected]/joker.csv') # Should we parse URL and look for it? Code is already written, but one can argue, lets not check in code that needs to parse url to detect auth scenario. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a startup, so we don't have "early adopters" per se 😄 . The API should be that we use requests
only in cases where we cannot use existing read_csv
currently. And no, I would not use requests
if url_params
is empty. The condition to use requests
should be if url_params
is a non-empty dictionary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets say I am a heavy duty pandas user with http (dont use basic auth. Just use plain old http). I get it that possibly in future requests might become the default (if installed). I want to be prepared and find issues early. How should i force pandas to use requests, even if I am not using http auth? How else do you expect to gain confidence in requests and detect scenarios of inconsistencies?
The condition to use requests should be if url_params is a non-empty dictionary
how about: url_params = { 'use_requests' : True }
forces use of requests even if it is not auth
or verify
scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to be prepared and find issues early. How should i force pandas to use requests, even if I am not using http auth? How else do you expect to gain confidence in requests and detect scenarios of inconsistencies?
That's fair, but you as the user should have to go out of your way to use it until we start enabling as the default. Notice I didn't prevent you at this point from passing in a non-empty dictionary, even if it contains parameters that are ultimately default parameters for requests
for example. That is one way you can "force" us to use requests
.
Now whether that's good design is another story 😄 , but I do understand where you're coming from as a user.
The |
pytest.skip('python-requests not installed, skipping') | ||
|
||
|
||
uname = 'pandasusr' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create fixtures out of these e.g.:
@pytest.fixture
def username():
return "pandasusr"
This is more idiomatic with pytest
. You can read more here:
Codecov Report
@@ Coverage Diff @@
## master #17087 +/- ##
==========================================
- Coverage 91.02% 90.99% -0.04%
==========================================
Files 161 161
Lines 49396 49434 +38
==========================================
+ Hits 44965 44984 +19
- Misses 4431 4450 +19
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #17087 +/- ##
==========================================
- Coverage 91.58% 91.54% -0.04%
==========================================
Files 153 153
Lines 51250 51288 +38
==========================================
+ Hits 46935 46952 +17
- Misses 4315 4336 +21
Continue to review full report at Codecov.
|
so let's make this really simple. If auth (and I would use |
Only after I got that feedback from @gfyoung did I implement While I would like to help the project, I lack bandwidth on my end that this async conversation requires. And perhaps it might be best that I just hand what is here to you fine folks to take this forward. I have provided you multiple working solutions which you can choose to pick and modify (or just discard them):
If after reviewing all comments in previous pull request, and this pull request, if there are any minor modifications I can make that seem reasonable, I would be happy to do. Please let me know.. else, my apologies that I am unable to contribute to the level of depth that you folks need at the current time.. and request you folks to do what is best. |
Here is my previous comment. I am not sure why you are going back to the previous conversation.I think my comment is quite clear. I suggested you do a more limited PR that only implements using requests if |
@jreback url = 'https://aaa.com/t.csv'
df = pd.read_csv(url, url_params = { 'auth':('uname', 'pwd'), 'verify': False}) # Option 1: current check-in
df = pd.read_csv(url, auth=('uname', 'pwd'), verify_ssl= False) # Option 2: previous fork/pull request
df = pd.read_csv(url, auth=('uname', 'pwd')) # Option 3: New suggestion that dont even handle scenario allowing self signed certs |
option 3
I just don't want more keywords in all of the |
Almost no-one does basic auth over http (without ssl). Getting officially verified ssl certs is difficult especially in private environments and testing scenarios. So in those scenarios, the compromise is to use self-signed ssl certs or expired certs or certs whose domains dont match in those cases, and it becomes essential to bypass ssl cert verification to be able to test in that scenario. That is why requests library has a parameter None of us want to add additional params to I often read data that requires http timeout to be non-default. I would like to use the callback functionality that requests library provides to see at times "what the heck is going on in the http layer and capture it". If I wont be able to set this within pandas, then for me, its just easy to leave anything http (auth/ssl/timeout/callback/proxy etc) related out of pandas as a standard practice, first use requests separately and then use StringIO with read_csv. The currently checked code with If there is no plan for supporting above mentioned http scenarios, why even take a requests soft dependency? There is already a checked in solution with basic auth, that was already rejected (in previous fork) that worked with urllib/urllib3. |
@skynss I DO like the idea of only adding 1 new parameter (to basically all read_* routines). I object to it being called
would be fine. let's do step-by-step and simpler is better. |
@jreback This PR already meets all your new requirements (except for renaming of The dict is already well defined (and very short): It can contain There is one exception in this PR that doesnt meet your requirement:
So, after this PR merge, if I want to implement from requests import Session
s = Session()
s.timeout = (1.0, 3.5)
df = pd.read_csv( url, url_params=s) # great... now my read_csv will timeout as specified. So far, I cannot see of any disadvantage of allowing passing of |
great if u can change to http_params i'll take a look tom |
@jreback checked in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a number of comments about docs and testing. after these are updated will have a deeper look at the impl.
doc/source/whatsnew/v0.21.0.txt
Outdated
up = { 'auth': ('john', 'pwd') , 'verify' : False} | ||
df = pd.read_csv('https://aa.com/bb.csv', http_params=up ) # now url can contain username and pwd | ||
|
||
# Optionally, A requests.Session() can also be passed into http_params |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be much simpler in the whatsnew. rather make a new section on using authorization in io.rst since this is note restricted to just read_csv, and show a link from the whatsnew.
This should be a code-block as well. Pls make this readable simple code.
pandas/io/common.py
Outdated
@@ -93,6 +99,11 @@ def __next__(self): | |||
BaseIterator.next = lambda self: self.__next__() | |||
|
|||
|
|||
def is_requests_pkg_avail(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just use the variable directly
pandas/io/common.py
Outdated
@@ -47,6 +47,12 @@ | |||
except: | |||
_PY_PATH_INSTALLED = False | |||
|
|||
try: | |||
import requests | |||
_PY_REQUESTS_INSTALLED = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_REQUESTS_INSTALLED, no _PY
""" | ||
Generate python-requests session from http_params dict | ||
""" | ||
s = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to assert that _REQUESTS_INSTALLED here as its a private function and the reader doesn't know this.
pls expand the doc-string a bit.
|
||
def fetch_url(url, http_params=None, skip_requests=False): | ||
""" | ||
If url is url, first try python-requests else try urllib. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just call this requests
"""Try to read from a url, file or string. | ||
|
||
Parameters | ||
---------- | ||
obj : str, unicode, or file-like | ||
|
||
http_params : dict or requests.Session(), default None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so again put this template in pandas.io.common, then you can import it and add it rather than copy-pasting all over the place.
@@ -856,6 +869,12 @@ def read_html(io, match='.+', flavor=None, header=None, index_col=None, | |||
|
|||
.. versionadded:: 0.19.0 | |||
|
|||
http_params : requests.Session(), default None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
pandas/tests/io/test_http_auth.py
Outdated
import pandas as pd | ||
import pandas.util.testing as tm | ||
|
||
if pd.io.common.is_requests_pkg_avail(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of doing this, simply import the _REQUESTS_INSTALLED
or as parameters, self signed ssl certs or trusted ssl certs, no auth | ||
or basic auth | ||
""" | ||
def gen_level1_tc(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these can be module level functions rather than nested
return str(df.to_json()) == j | ||
|
||
|
||
@tm.network |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parametrize this for the various read functions
this is not being tested on travis; rather all of the new code is skipped. you need to add requests to the pandas/ci/requirements-* files. But only add to 1 for 3.6 and 1 for 2.7 (of the main jobs). |
@skynss if you want to rebase / update can take a look. |
@jreback m trying to make time but just not getting enough time to the final fit and finish. I had thought i had checked in the correct solution with passing tests and rebased. What timeline is it for next milestone? If someone else wants to pick up you are welcome.. it might be few weeks for me. It would also help if i know what the outstanding things remain. There are lots of comments. |
@skynss 0.21.0 is soon, we just released the rc for it. I'd like this to be merged in master for a while, so not a problem to do for the next milestone. This is a nice PR, but still need to test in the broader community after merge. |
pls rebase / update when you can |
@skynss , we would love to see this capability in pandas. Are you still working on this? |
@rsignell-usgs I have been too occupied with my work.. but will try to see if I can. I am not a routine contributor, so the logistics of submitting is whats timeconsuming, not the actual fix. |
@skynss : Keep the PR as is and rebase onto |
@jreback @gfyoung import pandas.util.testing as tm
from urllib3.exceptions import InsecureRequestWarning
import requests
def test_this():
# succeeds in PY2, fails in PY3
with tm.assert_produces_warning(InsecureRequestWarning):
r = requests.get('https://cnn.com/', verify=False) This is the only issue remaining. |
needs a rebase |
Set skip human verification set and lounch in my com |
closing as stale. would like to have this, if one of the core wanted to pick this up, or if @skynss can rebase. pls ping to reopen. |
@skynss are you still interested in this? If not, do you mind if I try to re-submitted this PR? |
@ocefpaf : It's been inactive for some time now. Go for it! |
thanks @gfyoung |
Use
python-requests
(Python HTTP Requests for Humans) , if installed, to enable additional scenarios as Basic Authentication, permit self-signed SSL certs, and other advanced scenarios.A new optional parameter
url_params
is added intoread_csv
,read_html
,read_excel
andread_json
to allow passing in either a dict containing authentication, etc or even a requests session. It also allows passing in basic auth in url: http://:@:/ and fixes #Issue #16910 Also, better handling of specified:<port>
in url.If
python-requests
library is not installed, existing behavior prior to this check-in should continue as-is, which directly callsurllib\urllib3
git diff upstream/master -u -- "*.py" | flake8 --diff