Skip to content

Use python-requests library, if installed, to support broad http(s) scenarios including Basic Authentication #17087

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

skynss
Copy link

@skynss skynss commented Jul 26, 2017

Use python-requests (Python HTTP Requests for Humans) , if installed, to enable additional scenarios as Basic Authentication, permit self-signed SSL certs, and other advanced scenarios.

A new optional parameter url_params is added into read_csv, read_html, read_excel and read_json to allow passing in either a dict containing authentication, etc or even a requests session. It also allows passing in basic auth in url: http://:@:/ and fixes #Issue #16910 Also, better handling of specified :<port> in url.

If python-requests library is not installed, existing behavior prior to this check-in should continue as-is, which directly calls urllib\urllib3

``read_csv`` use `python-requests` (if installed) to support basic auth and much more
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If `python-requests` library is installed try to use it first. If not, continue using urllib
Copy link
Member

@gfyoung gfyoung Jul 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given our previous discussion, let's not do this (at least for now).

Only use requests in the cases when auth is needed. That will help to preserve backwards compatibility with behavior from urllib2.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got engaged as I opened an issue for basic auth + self signed certs.
Tomorrow someone might open an issue that pandas doesnt support today eg: proxies. Yet another issue might be setting custom headers, etc.
Here is a bug in released version of pandas. This does not involve https, or auth of any kind. This is supposed to work

df = pd.read_csv('http://handsome-equator.000webhostapp.com/no_auth/aaa.csv') # works fine
df = pd.read_csv('http://handsome-equator.000webhostapp.com:80/no_auth/aaa.csv') # fails with 404 if port number is added. it shouldnt. It doesnt in browser, or requests

It fails because urllib behavior. It works in use-requests if requests is installed and it is used even for existing scenarios when installed.

So, what is the direction for http(s) in pandas? Natively use urllib or move to requests? If the goal is to try to use requests the most we can, then what is the test acceptance criteria for trying to use requests most when available? use-requests passes all tests. Besides api design, what else remains?

@gfyoung we also discussed trying to offload most to requests if available. If we start detecting auth scenario, we are then parsing URL etc, adding code. Please see my comment above for 'future issues' as well as 'long term / new feature requests'. It would help if I could understand the reasoning behind:

let's not do this (at least for now)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reasoning is that we want our code to be backwards compatible, so we don't want to introduce any inconsistencies by using a new library to handle requests for the time being.

In the future, we may transition to using requests entirely (the interface is definitely better, I won't deny that), but such transitions are gradual so that people have to adjust their code bases that depend on pandas.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we don't want to introduce any inconsistencies by using a new library to handle requests for the time being

@gfyoung Ok.. I totally agree with that specific part reasoning. And that is why earlier I had asked the question of acceptance-criteria. So I agree it should not use requests by default, only for the unknown/destabilization issues. But we need some way to explicitly allow use of requests in many cases and not just limit it to scenario of basic-auth. So I propose instead of detecting only auth scenario, can we change it to:

if url_params is None:
    # only use existing code-base
    # never attempt to use requests even if installed... 
    # existing codebase should continue to use urllib
else: 
    # try to use requests if installed, even if it is empty. 
    # provide a way for early adopters to try out requests library even for existing scenarios

In this scenario, old codebase is not impacted. Only brave souls who specify the new url_params and if requests is installed, will use requests.
Perhaps we can set 'The acceptance criteria for using requests by default if installed - is that we get 3-6 months of user feedback on using url_params\requests satisfactorily' ? Thoughts?

df = pd.read_csv('http://aaa.com/joker.csv') # never uses requests. always uses `urllib`
df = pd.read_csv('http://aaa.com/joker.csv', url_params={}) # only uses requests if installed
df = pd.read_csv('http://aaa.com/joker.csv', url_params={ 'auth':('u', 'p')}) # requires requests. fails otherwise.
df = pd.read_csv('http://user:[email protected]/joker.csv', url_params={}) # only uses requests if installed
df = pd.read_csv('http://user:[email protected]/joker.csv') # Should we parse URL and look for it? Code is already written, but one can argue, lets not check in code that needs to parse url to detect auth scenario. Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a startup, so we don't have "early adopters" per se 😄 . The API should be that we use requests only in cases where we cannot use existing read_csv currently. And no, I would not use requests if url_params is empty. The condition to use requests should be if url_params is a non-empty dictionary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets say I am a heavy duty pandas user with http (dont use basic auth. Just use plain old http). I get it that possibly in future requests might become the default (if installed). I want to be prepared and find issues early. How should i force pandas to use requests, even if I am not using http auth? How else do you expect to gain confidence in requests and detect scenarios of inconsistencies?

The condition to use requests should be if url_params is a non-empty dictionary

how about: url_params = { 'use_requests' : True } forces use of requests even if it is not auth or verify scenario.

Copy link
Member

@gfyoung gfyoung Jul 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to be prepared and find issues early. How should i force pandas to use requests, even if I am not using http auth? How else do you expect to gain confidence in requests and detect scenarios of inconsistencies?

That's fair, but you as the user should have to go out of your way to use it until we start enabling as the default. Notice I didn't prevent you at this point from passing in a non-empty dictionary, even if it contains parameters that are ultimately default parameters for requests for example. That is one way you can "force" us to use requests.

Now whether that's good design is another story 😄 , but I do understand where you're coming from as a user.

@skynss
Copy link
Author

skynss commented Jul 26, 2017

The whatsnew is outdated and wrong. I will push it out later today.

@gfyoung gfyoung added Enhancement IO CSV read_csv, to_csv labels Jul 26, 2017
pytest.skip('python-requests not installed, skipping')


uname = 'pandasusr'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create fixtures out of these e.g.:

@pytest.fixture
def username():
   return "pandasusr"

This is more idiomatic with pytest. You can read more here:

https://docs.pytest.org/en/latest/fixture.html

@codecov
Copy link

codecov bot commented Jul 29, 2017

Codecov Report

Merging #17087 into master will decrease coverage by 0.03%.
The diff coverage is 70.58%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17087      +/-   ##
==========================================
- Coverage   91.02%   90.99%   -0.04%     
==========================================
  Files         161      161              
  Lines       49396    49434      +38     
==========================================
+ Hits        44965    44984      +19     
- Misses       4431     4450      +19
Flag Coverage Δ
#multiple 88.77% <70.58%> (-0.02%) ⬇️
#single 40.27% <27.45%> (-0.08%) ⬇️
Impacted Files Coverage Δ
pandas/io/json/json.py 100% <ø> (ø) ⬆️
pandas/io/parsers.py 95.44% <100%> (ø) ⬆️
pandas/io/excel.py 80.34% <57.14%> (-0.21%) ⬇️
pandas/io/common.py 68.51% <69.44%> (+0.72%) ⬆️
pandas/io/html.py 85.17% <85.71%> (+0.31%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.66% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f9a552d...0b1a10f. Read the comment docs.

@codecov
Copy link

codecov bot commented Jul 29, 2017

Codecov Report

Merging #17087 into master will decrease coverage by 0.03%.
The diff coverage is 65.3%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17087      +/-   ##
==========================================
- Coverage   91.58%   91.54%   -0.04%     
==========================================
  Files         153      153              
  Lines       51250    51288      +38     
==========================================
+ Hits        46935    46952      +17     
- Misses       4315     4336      +21
Flag Coverage Δ
#multiple 89.41% <65.3%> (-0.02%) ⬇️
#single 40.66% <24.48%> (-0.13%) ⬇️
Impacted Files Coverage Δ
pandas/io/json/json.py 91.75% <ø> (ø) ⬆️
pandas/io/parsers.py 95.55% <100%> (ø) ⬆️
pandas/io/excel.py 89.85% <40%> (-0.28%) ⬇️
pandas/io/common.py 69.25% <63.88%> (-0.24%) ⬇️
pandas/io/html.py 86.29% <85.71%> (+0.31%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.81% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c3c04e2...2e34854. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Jul 29, 2017

so let's make this really simple. If auth (and I would use auth= instead of url_params I don't find that intuitive. Use request, if auth is passed and not None and requests is not installed raise. Otherwise use existing code path. Let's do this first, then you can add fancier things.

@skynss
Copy link
Author

skynss commented Jul 31, 2017

@jreback

  • please see comment here on why I proposed we use 'url_params' instead of 'auth'
  • please see comment here on why @gfyoung agreed with me to use 'url_params' instead of 'auth'

Only after I got that feedback from @gfyoung did I implement url_params. Now if we are going back to suggesting auth only, unfortunately, I fail to understand why that is better, I have the exact same questions as I have raised previously.

While I would like to help the project, I lack bandwidth on my end that this async conversation requires. And perhaps it might be best that I just hand what is here to you fine folks to take this forward. I have provided you multiple working solutions which you can choose to pick and modify (or just discard them):

  1. auth and verify_ssl using only urllib\urllib3 + passing live test cases (previous fork)
  2. auth and verify_ssl using using requests when installed + passing live test cases (previous fork)
  3. url_params using requests, if installed, while keep using urllib in existing scenarios if url_params is not provided + passing live test cases (this fork)
    In my benevolent and sincere view, the current check-in in this pull request is a really good/working/passing solution that I personally do not see any issues with - and as a consumer, it would work for me.

If after reviewing all comments in previous pull request, and this pull request, if there are any minor modifications I can make that seem reasonable, I would be happy to do. Please let me know.. else, my apologies that I am unable to contribute to the level of depth that you folks need at the current time.. and request you folks to do what is best.

@jreback
Copy link
Contributor

jreback commented Jul 31, 2017

@skynss

so let's make this really simple. If auth (and I would use auth= instead of url_params I don't find that intuitive. Use request, if auth is passed and not None and requests is not installed raise. Otherwise use existing code path. Let's do this first, then you can add fancier things.

Here is my previous comment. I am not sure why you are going back to the previous conversation.I think my comment is quite clear.

I suggested you do a more limited PR that only implements using requests if auth (instead of the name url_parms) is passed. That's it. That gets us a lot of the way w/o any complication about falling back or anything else.

@skynss
Copy link
Author

skynss commented Jul 31, 2017

@jreback
Can you clarify what would the api call look in your implementation where I need to perform basic auth and need to be able to bypass ssl verification?

url = 'https://aaa.com/t.csv'
df = pd.read_csv(url, url_params = { 'auth':('uname', 'pwd'), 'verify': False}) # Option 1: current check-in
df = pd.read_csv(url, auth=('uname', 'pwd'), verify_ssl= False) # Option 2: previous fork/pull request
df = pd.read_csv(url, auth=('uname', 'pwd')) # Option 3: New suggestion that dont even handle scenario allowing self signed certs

@jreback
Copy link
Contributor

jreback commented Jul 31, 2017

option 3

df = pd.read_csv(url, auth=('uname', 'pwd')) # Option 3: New suggestion that dont even handle scenario allowing self signed certs

I just don't want more keywords in all of the read_* calls. (and I don't particulaly likeurl_params as a kw). auth is fine. verify_ssl if you really really need it.

@skynss
Copy link
Author

skynss commented Aug 1, 2017

I just don't want more keywords in all of the read_* calls. (and I don't particulaly likeurl_params as a kw). auth is fine. verify_ssl if you really really need it

Almost no-one does basic auth over http (without ssl). Getting officially verified ssl certs is difficult especially in private environments and testing scenarios. So in those scenarios, the compromise is to use self-signed ssl certs or expired certs or certs whose domains dont match in those cases, and it becomes essential to bypass ssl cert verification to be able to test in that scenario. That is why requests library has a parameter verify (with a warning logged every time such request is called).

None of us want to add additional params to read_csv so the suggestion that use url_params for anything url related.. be it auth or ssl verification (today) or potential future user asks of changing headers, or timeouts, or proxies, or any other stuff that is http(s) related. That way read_csv params never expand beyond url_params for http(s) related params. If a new issue is filed in pandas to allow override of http timeout, or allow proxy server would we add new parameter to read_csv( timeout=5000, http_proxy=['my.proxy.com'])? Or does it make sense to just add them to 'url_params' along with auth, ssl, etc (or pass in a requests.Session())?

I often read data that requires http timeout to be non-default. I would like to use the callback functionality that requests library provides to see at times "what the heck is going on in the http layer and capture it". If I wont be able to set this within pandas, then for me, its just easy to leave anything http (auth/ssl/timeout/callback/proxy etc) related out of pandas as a standard practice, first use requests separately and then use StringIO with read_csv. The currently checked code with url_params would allow me to "opt-in" to handle every scenario mentioned above that requests supports, while leaving current http flow unaffected.

If there is no plan for supporting above mentioned http scenarios, why even take a requests soft dependency? There is already a checked in solution with basic auth, that was already rejected (in previous fork) that worked with urllib/urllib3.

@jreback
Copy link
Contributor

jreback commented Aug 1, 2017

@skynss I DO like the idea of only adding 1 new parameter (to basically all read_* routines). I object to it being called url_parms. maybe auth_params is better or something similar. As I said a PR that:

  • adds 1 new parameter (with a dict of a well-defined structure of keys)
  • adds requests to handle this paramaater (and raises if its not None and requests is not installed)
  • leaves all existing functionaility alone

would be fine. let's do step-by-step and simpler is better.

@skynss
Copy link
Author

skynss commented Aug 1, 2017

I DO like the idea of only adding 1 new parameter (to basically all read_* routines). I object to it being called url_parms. maybe auth_params is better or something similar

@jreback This PR already meets all your new requirements (except for renaming of url_params - perhaps they can be renamed to http_params because only http(s) is handled or req_params because we pass thru this to requests). We shouldn't call them auth_params because they do and will contain non-auth related parameters such as verify. SSL certificate verification is unrelated to authentication.

The dict is already well defined (and very short): It can contain auth or verify and these parameters are passthru to similarly named parameters in requests

There is one exception in this PR that doesnt meet your requirement: url_params can be one of three:

  • None -- per your requirement
  • dict containing auth or verify -- per your requirement. If anything else is passed, its just ignored but used as signal to force use of requests library.. which is actually desirable as it enables uses of requests in existing http scenarios.
  • a requests.Session() object. If it is passed passed, we basically use this session instead of creating our own new requests.Session() object. I implemented it this way because this does not interfere with other requirements, but allows a knowledgeable user to really configure requests session object without requiring pandas feature requests: Custom Auth, proxies, timeout, callback urls, etc.

So, after this PR merge, if I want to implement timeout in my read_csv, even though it is not a feature implemented by pandas, I can still do the following:

from requests import Session
s = Session()
s.timeout = (1.0, 3.5)
df = pd.read_csv( url, url_params=s) # great... now my read_csv will timeout as specified.

So far, I cannot see of any disadvantage of allowing passing of requests.Session(). I see the advantage of no longer having to play catch-up with changes from requests library, and moving the focus of pandas team from io related issues to core pandas related issues - which everyone really values pandas for.

@jreback
Copy link
Contributor

jreback commented Aug 1, 2017

great if u can change to http_params i'll take a look tom

@skynss
Copy link
Author

skynss commented Aug 2, 2017

great if u can change to http_params i'll take a look tom

@jreback checked in.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a number of comments about docs and testing. after these are updated will have a deeper look at the impl.

up = { 'auth': ('john', 'pwd') , 'verify' : False}
df = pd.read_csv('https://aa.com/bb.csv', http_params=up ) # now url can contain username and pwd

# Optionally, A requests.Session() can also be passed into http_params
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be much simpler in the whatsnew. rather make a new section on using authorization in io.rst since this is note restricted to just read_csv, and show a link from the whatsnew.

This should be a code-block as well. Pls make this readable simple code.

@@ -93,6 +99,11 @@ def __next__(self):
BaseIterator.next = lambda self: self.__next__()


def is_requests_pkg_avail():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just use the variable directly

@@ -47,6 +47,12 @@
except:
_PY_PATH_INSTALLED = False

try:
import requests
_PY_REQUESTS_INSTALLED = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_REQUESTS_INSTALLED, no _PY

"""
Generate python-requests session from http_params dict
"""
s = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to assert that _REQUESTS_INSTALLED here as its a private function and the reader doesn't know this.
pls expand the doc-string a bit.


def fetch_url(url, http_params=None, skip_requests=False):
"""
If url is url, first try python-requests else try urllib.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just call this requests

"""Try to read from a url, file or string.

Parameters
----------
obj : str, unicode, or file-like

http_params : dict or requests.Session(), default None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so again put this template in pandas.io.common, then you can import it and add it rather than copy-pasting all over the place.

@@ -856,6 +869,12 @@ def read_html(io, match='.+', flavor=None, header=None, index_col=None,

.. versionadded:: 0.19.0

http_params : requests.Session(), default None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

import pandas as pd
import pandas.util.testing as tm

if pd.io.common.is_requests_pkg_avail():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of doing this, simply import the _REQUESTS_INSTALLED

or as parameters, self signed ssl certs or trusted ssl certs, no auth
or basic auth
"""
def gen_level1_tc():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these can be module level functions rather than nested

return str(df.to_json()) == j


@tm.network
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parametrize this for the various read functions

@jreback
Copy link
Contributor

jreback commented Aug 3, 2017

this is not being tested on travis; rather all of the new code is skipped. you need to add requests to the pandas/ci/requirements-* files. But only add to 1 for 3.6 and 1 for 2.7 (of the main jobs).

@jreback
Copy link
Contributor

jreback commented Sep 23, 2017

@skynss if you want to rebase / update can take a look.

@skynss
Copy link
Author

skynss commented Oct 8, 2017

@jreback m trying to make time but just not getting enough time to the final fit and finish. I had thought i had checked in the correct solution with passing tests and rebased. What timeline is it for next milestone? If someone else wants to pick up you are welcome.. it might be few weeks for me. It would also help if i know what the outstanding things remain. There are lots of comments.

@jreback
Copy link
Contributor

jreback commented Oct 14, 2017

@skynss 0.21.0 is soon, we just released the rc for it. I'd like this to be merged in master for a while, so not a problem to do for the next milestone. This is a nice PR, but still need to test in the broader community after merge.

@jreback
Copy link
Contributor

jreback commented Nov 12, 2017

pls rebase / update when you can

@rsignell-usgs
Copy link

@skynss , we would love to see this capability in pandas. Are you still working on this?

@skynss
Copy link
Author

skynss commented Dec 5, 2017

@rsignell-usgs I have been too occupied with my work.. but will try to see if I can. I am not a routine contributor, so the logistics of submitting is whats timeconsuming, not the actual fix.
@jreback @gfyoung which branch should I rebase to? master or 0.21.x? Is it better to just close this pull request and create new one based on new branch so we don't have to filter thru all the prior comments?

@gfyoung
Copy link
Member

gfyoung commented Dec 5, 2017

@skynss : Keep the PR as is and rebase onto master. We definitely want to merge this!

@skynss
Copy link
Author

skynss commented Dec 6, 2017

@jreback @gfyoung
Can you take a look at following code of simple test case - it works in python2 but fails in python3

import pandas.util.testing as tm

from urllib3.exceptions import InsecureRequestWarning
import requests

def test_this():
    # succeeds in PY2, fails in PY3
    with tm.assert_produces_warning(InsecureRequestWarning):
        r = requests.get('https://cnn.com/', verify=False)

This is the only issue remaining.
Also please review the code. Next steps?

@jreback
Copy link
Contributor

jreback commented Jan 21, 2018

needs a rebase

@ardidogol
Copy link

Set skip human verification set and lounch in my com

@jreback
Copy link
Contributor

jreback commented Apr 11, 2018

closing as stale. would like to have this, if one of the core wanted to pick this up, or if @skynss can rebase. pls ping to reopen.

@jreback jreback closed this Apr 11, 2018
@ocefpaf
Copy link

ocefpaf commented Jun 11, 2018

@skynss are you still interested in this? If not, do you mind if I try to re-submitted this PR?

@gfyoung
Copy link
Member

gfyoung commented Jun 11, 2018

@ocefpaf : It's been inactive for some time now. Go for it!

@skynss
Copy link
Author

skynss commented Jun 12, 2018

thanks @gfyoung
@ocefpaf please go ahead. I am just not getting any time with my work. Thx.
The last of mine commit worked on python2. But python3 failed for no apparent reason.. see my comment above on December 12. 2017. I think python3 failure had nothing to do with this pull, but with some other pandas issue at that time.

@ocefpaf ocefpaf mentioned this pull request Jun 15, 2018
4 tasks
@ocefpaf ocefpaf mentioned this pull request Oct 10, 2019
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants