-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG, ENH: Read Data From Password-Protected URL's and allow self signed SSL certs #16910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 8 commits
8b6e426
145c7f4
9473316
9c7524d
598cf7b
3b454dd
d359b2d
eb03fd3
cbe3f49
437e0a2
7b034b8
0a4607a
209dd58
f520f7b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -40,7 +40,8 @@ Other Enhancements | |
- :func:`DataFrame.clip()` and :func:`Series.clip()` have gained an ``inplace`` argument. (:issue:`15388`) | ||
- :func:`crosstab` has gained a ``margins_name`` parameter to define the name of the row / column that will contain the totals when ``margins=True``. (:issue:`15972`) | ||
- :func:`Dataframe.select_dtypes` now accepts scalar values for include/exclude as well as list-like. (:issue:`16855`) | ||
|
||
- :func:`read_csv`, :func:`read_html`, :func:`read_json`, :func:`read_html` now accept auth in url //<user>:<password>@<host>:<port>/<url-path>, or ``auth`` tuple (username, password) parameter | ||
- :func:`read_csv`, :func:`read_html`, :func:`read_json`, :func:`read_html` now accept ``verify_ssl`` False to disable https/ssl certificate verification (eg: self signed ssl certs in testing) (:issue:`16716`) | ||
.. _whatsnew_0210.api_breaking: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's condense into one line:
Note that I also put the issue number at the end of the line. |
||
|
||
Backwards incompatible API changes | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,6 +4,8 @@ | |
import csv | ||
import codecs | ||
import mmap | ||
import ssl | ||
import base64 | ||
from contextlib import contextmanager, closing | ||
|
||
from pandas.compat import StringIO, BytesIO, string_types, text_type | ||
|
@@ -49,7 +51,7 @@ | |
|
||
|
||
if compat.PY3: | ||
from urllib.request import urlopen, pathname2url | ||
from urllib.request import urlopen, pathname2url, Request | ||
_urlopen = urlopen | ||
from urllib.parse import urlparse as parse_url | ||
from urllib.parse import (uses_relative, uses_netloc, uses_params, | ||
|
@@ -58,6 +60,7 @@ | |
from http.client import HTTPException # noqa | ||
else: | ||
from urllib2 import urlopen as _urlopen | ||
from urllib2 import Request | ||
from urllib import urlencode, pathname2url # noqa | ||
from urlparse import urlparse as parse_url | ||
from urlparse import uses_relative, uses_netloc, uses_params, urljoin | ||
|
@@ -177,7 +180,8 @@ def _stringify_path(filepath_or_buffer): | |
|
||
|
||
def get_filepath_or_buffer(filepath_or_buffer, encoding=None, | ||
compression=None): | ||
compression=None, auth=None, | ||
verify_ssl=None): | ||
""" | ||
If the filepath_or_buffer is a url, translate and return the buffer. | ||
Otherwise passthrough. | ||
|
@@ -186,16 +190,39 @@ def get_filepath_or_buffer(filepath_or_buffer, encoding=None, | |
---------- | ||
filepath_or_buffer : a url, filepath (str, py.path.local or pathlib.Path), | ||
or buffer | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is misaligned. needs to be part of the sentence above. |
||
now supports 'https://<user>:<password>@<host>:<port>/<url-path>' | ||
|
||
.. versionadded:: 0.21.0 | ||
|
||
encoding : the encoding to use to decode py3 bytes, default is 'utf-8' | ||
|
||
compression : string, default None | ||
|
||
.. versionadded:: 0.18.1 | ||
|
||
auth : tuple, default None | ||
A tuple of string with (username, password) string for | ||
HTTP(s) basic auth: eg auth= ('roberto', 'panda$4life') | ||
|
||
.. versionadded:: 0.21.0 | ||
|
||
verify_ssl : boolean, Default True | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why is this default True? shouldn't the onus be on the user to pass this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is an attempt to mirror what |
||
If False, allow self signed and invalid SSL certificates for https | ||
|
||
.. versionadded:: 0.21.0 | ||
|
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
<var_name> : <data_type>, <defaults>
<description> There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually the |
||
Returns | ||
------- | ||
a filepath_or_buffer, the encoding, the compression | ||
""" | ||
filepath_or_buffer = _stringify_path(filepath_or_buffer) | ||
|
||
if _is_url(filepath_or_buffer): | ||
req = _urlopen(filepath_or_buffer) | ||
ureq, kwargs = get_urlopen_args(filepath_or_buffer, | ||
auth=auth, | ||
verify_ssl=verify_ssl) | ||
req = _urlopen(ureq, **kwargs) | ||
content_encoding = req.headers.get('Content-Encoding', None) | ||
if content_encoding == 'gzip': | ||
# Override compression based on Content-Encoding header | ||
|
@@ -244,6 +271,93 @@ def file_path_to_url(path): | |
} | ||
|
||
|
||
def split_auth_from_url(url_with_uname): | ||
""" | ||
If a url contains username and password, it is extracted and returned | ||
along with a url that does not contain it. | ||
|
||
Parameters | ||
---------- | ||
url_with_uname : string | ||
a url that may or may not contain username and password | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. too much indentation |
||
see section 3.1 RFC 1738 https://www.ietf.org/rfc/rfc1738.txt | ||
//<user>:<password>@<host>:<port>/<url-path> | ||
|
||
.. versionadded:: 0.21.0 | ||
|
||
Returns | ||
------- | ||
(username, password), url_no_usrpwd : tuple, string Default ('', '') url | ||
A tuple with (username, pwd) pair and | ||
url without username or password (if it contained it ) | ||
|
||
Raises | ||
------ | ||
ValueError for empty url | ||
""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. show what this Raises There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
However, in this case, it would be preferable to describe the returned object without any naming, since this is a nested
|
||
if not url_with_uname: | ||
msg = "Empty url: {_type}" | ||
raise ValueError(msg.format(_type=type(url_with_uname))) | ||
o = parse_url(url_with_uname) | ||
uname = o.username if o.username else '' | ||
pwd = o.password if o.password else '' | ||
url_no_usrpwd = url_with_uname | ||
if uname or pwd: | ||
usrch = '{}:{}@{}'.format(o.username, o.password, o.hostname) | ||
url_no_usrpwd = url_with_uname.replace(usrch, o.hostname) | ||
return (uname, pwd), url_no_usrpwd | ||
|
||
|
||
def get_urlopen_args(url_with_uname, auth=None, verify_ssl=True): | ||
""" | ||
generate args to pass to urlopen - including basic auth and and support | ||
for disabling verification of SSL certificates ( useful where | ||
self-signed SSL certificates are acceptable security risk -eg: Testing ) | ||
|
||
Parameters | ||
---------- | ||
url_with_uname : string | ||
a url that may or may not contain username and password | ||
see section 3.1 RFC 1738 https://www.ietf.org/rfc/rfc1738.txt | ||
//<user>:<password>@<host>:<port>/<url-path> | ||
|
||
.. versionadded:: 0.21.0 | ||
|
||
auth : tuple, default None | ||
A tuple of string with (username, password) string for | ||
HTTP(s) basic auth: eg auth= ('roberto', 'panda$4life') | ||
|
||
.. versionadded:: 0.21.0 | ||
|
||
verify_ssl : boolean, Default True | ||
If False, allow self signed and invalid SSL certificates for https | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so there are 3 possibilites? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There really should be two here. I agree that @skynss : Could you just only check for |
||
|
||
.. versionadded:: 0.21.0 | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See my comment here to patch the formatting for your parameters listed above. |
||
Returns | ||
------- | ||
Request, kwargs to pass to urlopen. kwargs may be {} or {'context': obj } | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See my comment here to patch the formatting for your return variable listed above. |
||
""" | ||
uname = pwd = None | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you have a username w/o a password (yes?), but of course a password w/o a username should be banned. where do you raise on this? |
||
url_no_usrpwd = url_with_uname | ||
if auth and len(auth) == 2: | ||
uname, pwd = auth | ||
if not uname and not pwd: | ||
(uname, pwd), url_no_usrpwd = split_auth_from_url(url_with_uname) | ||
req = Request(url_no_usrpwd) | ||
if uname or pwd: | ||
upstr = '{}:{}'.format(uname, pwd) | ||
if compat.PY3: | ||
b64str = base64.b64encode(bytes(upstr, 'ascii')).decode('utf-8') | ||
else: | ||
b64str = base64.encodestring(upstr).replace('\n', '') | ||
req.add_header("Authorization", "Basic {}".format(b64str)) | ||
kwargs = {} | ||
if verify_ssl not in [None, True]: | ||
kwargs['context'] = ssl._create_unverified_context() | ||
return req, kwargs | ||
|
||
|
||
def _infer_compression(filepath_or_buffer, compression): | ||
""" | ||
Get the compression method for filepath_or_buffer. If compression='infer', | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm personally still in favor of providing something more general like what I had suggested before and have a section explaining what you can do now.
@jreback : Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you elaborate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that we should condense into one line:
In addition, we should add a section about what you can do now with password-authenticated URL's.