-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG, ENH: Read Data From Password-Protected URL's and allow self signed SSL certs #16910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 12 commits
8b6e426
145c7f4
9473316
9c7524d
598cf7b
3b454dd
d359b2d
eb03fd3
cbe3f49
437e0a2
7b034b8
0a4607a
209dd58
f520f7b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -40,6 +40,8 @@ Other Enhancements | |
- :func:`DataFrame.clip()` and :func:`Series.clip()` have gained an ``inplace`` argument. (:issue:`15388`) | ||
- :func:`crosstab` has gained a ``margins_name`` parameter to define the name of the row / column that will contain the totals when ``margins=True``. (:issue:`15972`) | ||
- :func:`Dataframe.select_dtypes` now accepts scalar values for include/exclude as well as list-like. (:issue:`16855`) | ||
- It is now possible to read data (i.e. CSV, JSON, HTML) from a HTTP/HTTPS URL that requires Basic Authentication (:issue:`16716`) | ||
- It is now possible to skip check of validity of SSL certificate (eg: Testing using self-signed SSL certificates) (:issue:`16716`) | ||
|
||
.. _whatsnew_0210.api_breaking: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's condense into one line:
Note that I also put the issue number at the end of the line. |
||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,6 +4,9 @@ | |
import csv | ||
import codecs | ||
import mmap | ||
import ssl | ||
import base64 | ||
import warnings | ||
from contextlib import contextmanager, closing | ||
|
||
from pandas.compat import StringIO, BytesIO, string_types, text_type | ||
|
@@ -49,7 +52,7 @@ | |
|
||
|
||
if compat.PY3: | ||
from urllib.request import urlopen, pathname2url | ||
from urllib.request import urlopen, pathname2url, Request | ||
_urlopen = urlopen | ||
from urllib.parse import urlparse as parse_url | ||
from urllib.parse import (uses_relative, uses_netloc, uses_params, | ||
|
@@ -58,6 +61,7 @@ | |
from http.client import HTTPException # noqa | ||
else: | ||
from urllib2 import urlopen as _urlopen | ||
from urllib2 import Request | ||
from urllib import urlencode, pathname2url # noqa | ||
from urlparse import urlparse as parse_url | ||
from urlparse import uses_relative, uses_netloc, uses_params, urljoin | ||
|
@@ -177,25 +181,49 @@ def _stringify_path(filepath_or_buffer): | |
|
||
|
||
def get_filepath_or_buffer(filepath_or_buffer, encoding=None, | ||
compression=None): | ||
compression=None, auth=None, | ||
verify_ssl=None): | ||
""" | ||
If the filepath_or_buffer is a url, translate and return the buffer. | ||
Otherwise passthrough. | ||
|
||
Parameters | ||
---------- | ||
filepath_or_buffer : a url, filepath (str, py.path.local or pathlib.Path), | ||
or buffer | ||
or buffer now supports url with username and password | ||
eg: 'https://<user>:<password>@<host>:<port>/<url-path>' | ||
|
||
.. versionadded:: 0.21.0 | ||
|
||
encoding : the encoding to use to decode py3 bytes, default is 'utf-8' | ||
|
||
compression : string, default None | ||
|
||
.. versionadded:: 0.18.1 | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no blank lines in the doc-strings. Indent 4 spaces on the 2nd and other lines |
||
auth : tuple, default None | ||
A tuple of string with (username, password) string for | ||
HTTP(s) basic auth: eg auth= ('roberto', 'panda$4life') | ||
|
||
.. versionadded:: 0.21.0 | ||
|
||
verify_ssl : boolean, Default True | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why is this default True? shouldn't the onus be on the user to pass this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is an attempt to mirror what |
||
If False, allow self signed and invalid SSL certificates for https | ||
|
||
.. versionadded:: 0.21.0 | ||
|
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
<var_name> : <data_type>, <defaults>
<description> There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually the |
||
Returns | ||
------- | ||
a filepath_or_buffer, the encoding, the compression | ||
""" | ||
filepath_or_buffer = _stringify_path(filepath_or_buffer) | ||
|
||
if _is_url(filepath_or_buffer): | ||
req = _urlopen(filepath_or_buffer) | ||
ureq, kwargs = get_urlopen_args(filepath_or_buffer, | ||
auth=auth, | ||
verify_ssl=verify_ssl) | ||
req = _urlopen(ureq, **kwargs) | ||
content_encoding = req.headers.get('Content-Encoding', None) | ||
if content_encoding == 'gzip': | ||
# Override compression based on Content-Encoding header | ||
|
@@ -244,6 +272,106 @@ def file_path_to_url(path): | |
} | ||
|
||
|
||
class InsecureRequestWarning(Warning): | ||
"Warned when making an unverified HTTPS request. Borrowed from requests" | ||
pass | ||
|
||
|
||
def split_auth_from_url(url_with_uname): | ||
""" | ||
If a url contains username and password, it is extracted and returned | ||
along with a url that does not contain it. | ||
|
||
Parameters | ||
---------- | ||
url_with_uname : string | ||
a url that may or may not contain username and password | ||
see section 3.1 RFC 1738 https://www.ietf.org/rfc/rfc1738.txt | ||
//<user>:<password>@<host>:<port>/<url-path> | ||
|
||
.. versionadded:: 0.21.0 | ||
|
||
Returns | ||
------- | ||
(username, password), url_no_usrpwd : tuple, string Default ('', '') url | ||
A tuple with (username, pwd) pair and | ||
url without username or password (if it contained it ) | ||
|
||
Raises | ||
------ | ||
ValueError for empty url | ||
""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. show what this Raises There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
However, in this case, it would be preferable to describe the returned object without any naming, since this is a nested
|
||
if not url_with_uname: | ||
msg = "Empty url: {_type}" | ||
raise ValueError(msg.format(_type=type(url_with_uname))) | ||
o = parse_url(url_with_uname) | ||
uname = o.username if o.username else '' | ||
pwd = o.password if o.password else '' | ||
url_no_usrpwd = url_with_uname | ||
if uname or pwd: | ||
usrch = '{}:{}@{}'.format(o.username, o.password, o.hostname) | ||
url_no_usrpwd = url_with_uname.replace(usrch, o.hostname) | ||
return (uname, pwd), url_no_usrpwd | ||
|
||
|
||
def get_urlopen_args(url_with_uname, auth=None, verify_ssl=True): | ||
""" | ||
generate args to pass to urlopen - including basic auth and and support | ||
for disabling verification of SSL certificates ( useful where | ||
self-signed SSL certificates are acceptable security risk -eg: Testing ) | ||
|
||
Parameters | ||
---------- | ||
url_with_uname : string | ||
a url that may or may not contain username and password | ||
see section 3.1 RFC 1738 https://www.ietf.org/rfc/rfc1738.txt | ||
//<user>:<password>@<host>:<port>/<url-path> | ||
|
||
.. versionadded:: 0.21.0 | ||
|
||
auth : tuple, default None | ||
A tuple of string with (username, password) string for | ||
HTTP(s) basic auth: eg auth= ('roberto', 'panda$4life') | ||
|
||
.. versionadded:: 0.21.0 | ||
|
||
verify_ssl : boolean, Default True | ||
If False, allow self signed and invalid SSL certificates for https | ||
|
||
.. versionadded:: 0.21.0 | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See my comment here to patch the formatting for your parameters listed above. |
||
Returns | ||
------- | ||
Request, kwargs to pass to urlopen. kwargs may be {} or {'context': obj } | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See my comment here to patch the formatting for your return variable listed above. |
||
|
||
Raises | ||
------ | ||
ValueError if only one of username or password is provided. | ||
""" | ||
uname = pwd = None | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you have a username w/o a password (yes?), but of course a password w/o a username should be banned. where do you raise on this? |
||
url_no_usrpwd = url_with_uname | ||
if auth and len(auth) == 2: | ||
uname, pwd = auth | ||
if not uname and not pwd: | ||
(uname, pwd), url_no_usrpwd = split_auth_from_url(url_with_uname) | ||
req = Request(url_no_usrpwd) | ||
if uname and pwd: | ||
upstr = '{}:{}'.format(uname, pwd) | ||
if compat.PY3: | ||
b64str = base64.b64encode(bytes(upstr, 'ascii')).decode('utf-8') | ||
else: | ||
b64str = base64.encodestring(upstr).replace('\n', '') | ||
req.add_header("Authorization", "Basic {}".format(b64str)) | ||
elif uname or pwd: | ||
raise ValueError('Only username or password provided without providing the other') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Change the error message to "username and password must both be provided" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, make sure to write a test that raises an error here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
kwargs = {} | ||
if verify_ssl not in [None, True]: | ||
kwargs['context'] = ssl._create_unverified_context() | ||
msg = 'SSL certificate verification is being disabled for HTTPS. Possible security risk' | ||
warnings.warn(msg, InsecureRequestWarning) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For this, you can use the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
return req, kwargs | ||
|
||
|
||
def _infer_compression(filepath_or_buffer, compression): | ||
""" | ||
Get the compression method for filepath_or_buffer. If compression='infer', | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should condense into one line:
Add a section right under "Other Enhancements" with my shortened description followed by a longer description about what you can do now with authentication. Try to follow what was done with "Improved error handling during item assignment in pd.eval" in terms of seeing what types of code-blocks to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, will be happy to answer any questions you might have regarding this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, you have a merge conflict with this
whatsnew
file.