-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
wrap urlopen with requests #21504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wrap urlopen with requests #21504
Conversation
@contextmanager | ||
def urlopen(*args, **kwargs): | ||
with closing(_urlopen(*args, **kwargs)) as f: | ||
with closing(urlopen2(*args, **kwargs)) as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this workaround is used anywhere.
Maybe we can remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not remove this for now. Investigate in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NP. I'll leave it there. Just renamed the urlopen
variants to try to make it a little bit clear of what is going on.
compression = None | ||
content_encoding = None | ||
try: | ||
import requests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need to check if requests
is installed when using a session
object b/c requests
is needed to create the object, this is slightly simpler than the approach in #17087 and introduces less code to maintain.
compression = 'gzip' | ||
reader = BytesIO(req.read()) | ||
req.close() | ||
reader, compression = _urlopen(filepath_or_buffer, session=session) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compression
is not mentioned in the docs and I wonder if session
should.
If so we can use the template/Appender approach, just not sure how to do that in a clean way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compression
should be mentioned in the docs. Would be great to add that.- Given my first point,
session
should be as well.
And yes, template / Appender approach sounds like a good idea!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need some guidance on how to do this when crossing modules, common.py
in get_filepath_or_buffer
, and parsers.py
for all the parsers defined in _parser_params
.
The current situation is:
options/module-function | get_filepath_or_buffer |
_parser_params |
---|---|---|
filepath_or_buffer |
present | NA |
mode |
present | NA |
encoding |
present | present |
compression |
missing (added in this PR) | present |
session |
missing | present |
It is unclear to me how to fix this. Some questions I have:
- Should the
compression
andencoding
docstring be standardized? - If yes on the question above what is the best strategy for this? Should
encoding
,compression
, andsession
be defined incommon.py
and imported inparsers.py
for appending in_parser_params
? Or the other way around?
My guess is that I should do get the common options from parser.py
and add it to common.py
like:
_common_params = r"""
Parameters
----------
encoding : str, default None
Encoding to use for UTF when reading/writing (ex. 'utf-8'). `List of Python
standard encodings
<https://docs.python.org/3/library/codecs.html#standard-encodings>`_
session : requests.Session
object with the a requests session configuration for remote file.
(requires the requests library)
compression : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'
For on-the-fly decompression of on-disk data. If 'infer' and
`filepath_or_buffer` is path-like, then detect compression from the
following extensions: '.gz', '.bz2', '.zip', or '.xz' (otherwise no
decompression). If using 'zip', the ZIP file must contain only one data
file to be read in. Set to None for no decompression.
.. versionadded:: 0.18.1 support for 'zip' and 'xz' compression.
"""
Then compose the get_filepath_or_buffer
with the missing filepath_or_buffer
and mode
, and later import
_common_params
in parsers.py
to compose _parser_params
. Does that sound OK?
PS: if that is correct I'd prefer if a pandas doc expert did this instead of me, I'm kind of lost in the docstrings
and Appender
s 😄
pandas/io/common.py
Outdated
r = session.get(url) | ||
else: | ||
r = requests.get(url) | ||
r.raise_for_status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this line do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will raise an exception if there is any issue reading the URL (any non-200 OK status).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But don't you need to actually call the function? i.e. r.raise_for_status()
(with parentheses)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. I totally forgot that when copying-and-pasting 😄
Codecov Report
@@ Coverage Diff @@
## master #21504 +/- ##
==========================================
- Coverage 92.29% 92.01% -0.29%
==========================================
Files 161 169 +8
Lines 51497 50797 -700
==========================================
- Hits 47530 46740 -790
- Misses 3967 4057 +90
Continue to review full report at Codecov.
|
Tested this on a username/password protected CSV endpoint and it's working! import requests
session = requests.sessions.Session()
session.auth = ('Nope', 'NoWay!')
url = 'https://cgoms.coas.oregonstate.edu/erddap/tabledap/CE01ISSM-BUOY-001-GPS.csv'
import pandas as pd
df = pd.read_csv(url, session=session) |
Hello @ocefpaf! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on August 25, 2018 at 15:30 Hours UTC |
@@ -304,7 +304,8 @@ def read_excel(io, | |||
**kwds): | |||
|
|||
if not isinstance(io, ExcelFile): | |||
io = ExcelFile(io, engine=engine) | |||
session = kwds.get('session', None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just list session as a kwarg in read_excel (and in ExcelFile), then just pass it in
@@ -406,6 +406,7 @@ def read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, | |||
compression = _infer_compression(path_or_buf, compression) | |||
filepath_or_buffer, _, compression, should_close = get_filepath_or_buffer( | |||
path_or_buf, encoding=encoding, compression=compression, | |||
session=session, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a versionadded tag when you add session
@@ -297,6 +297,9 @@ | |||
If a filepath is provided for `filepath_or_buffer`, map the file object | |||
directly onto memory and access the data directly from there. Using this | |||
option can improve performance because there is no longer any I/O overhead. | |||
session : requests.Session | |||
object with the a requests session configuration for remote file. | |||
(requires the requests library) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a versionadded tag
@ocefpaf can you address latest comments and merge in changes from master? |
I'm in transit now but I'll have some time next week to check this. However, if someone wants to take it over please do. I do recall asking about a documenting strategy to avoid the repetition and the comments above. It would seems that to address some of the comments I would touch current docs, and I'm not really comfortable with that. I'll re-visit this soon... |
Can you merge master? |
Done! But I'm not sure of I got it right, git rebases are not my forte 😬 BTW I'm happy to iterate on the docs questions in https://github.com/pandas-dev/pandas/pull/21504/files#r196079159 but I'm quite confused where I should start (it is my first time contributing here). |
You are better off doing the following on your branch locally: git fetch upstream
git merge upstream/master That retains the history of comments on GitHub so makes it easier to track. Looks like you have some test failures with the latest push - if you could take a look would be great |
That is what I did :-/
I'll take a look ASAP. |
Hmm OK. Saw the force push here so that's why I thought you had done something else. Shouldn't need the |
@ocefpaf CI is red, can you fix any problem, and push again so there are no tests failures. |
I believe that when I first submitted this PR the |
Not sure what the problem is. I'll close this for now, as we can't merge with the conflicts and the CI in red, but if you find a solution for what you said, feel free to reopen. |
Would you be interested in a Python 3 only version of this PR? That is working and I guess that all the Python 2.7 code will be removed soon anyway. |
@ocefpaf we drop Py2 support for 0.25 so sure I think Python 3 only is fine for that milestone onwards |
git diff upstream/master -u -- "*.py" | flake8 --diff
Alternative to #17087
I'll write the whatsnew entry, tests, and the session option for
html
,json
, andexcel
if this path is OK with the devs.Ping @skynss who is the original author of #17087 and @gfyoung who reviewed it.
Note that the main difference in this approach is that I made the option slightly simpler by allowing only the
session
instead ofhttp_params
.You can see an example of this in action in:
http://nbviewer.jupyter.org/urls/gist.githubusercontent.com/ocefpaf/d6e9ab2c3569ff8fa181fc7885b6524d/raw/5d868fd4cbe2b81f84ccab7760b80251ef6e4651/pandas_test.ipynb