Skip to content

wrap urlopen with requests #21504

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from
Closed

Conversation

ocefpaf
Copy link

@ocefpaf ocefpaf commented Jun 15, 2018

Alternative to #17087

I'll write the whatsnew entry, tests, and the session option for html, json, and excel if this path is OK with the devs.

Ping @skynss who is the original author of #17087 and @gfyoung who reviewed it.

Note that the main difference in this approach is that I made the option slightly simpler by allowing only the session instead of http_params.

You can see an example of this in action in:

http://nbviewer.jupyter.org/urls/gist.githubusercontent.com/ocefpaf/d6e9ab2c3569ff8fa181fc7885b6524d/raw/5d868fd4cbe2b81f84ccab7760b80251ef6e4651/pandas_test.ipynb

@contextmanager
def urlopen(*args, **kwargs):
with closing(_urlopen(*args, **kwargs)) as f:
with closing(urlopen2(*args, **kwargs)) as f:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this workaround is used anywhere.
Maybe we can remove this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not remove this for now. Investigate in another PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NP. I'll leave it there. Just renamed the urlopen variants to try to make it a little bit clear of what is going on.

compression = None
content_encoding = None
try:
import requests
Copy link
Author

@ocefpaf ocefpaf Jun 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to check if requests is installed when using a session object b/c requests is needed to create the object, this is slightly simpler than the approach in #17087 and introduces less code to maintain.

compression = 'gzip'
reader = BytesIO(req.read())
req.close()
reader, compression = _urlopen(filepath_or_buffer, session=session)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compression is not mentioned in the docs and I wonder if session should.
If so we can use the template/Appender approach, just not sure how to do that in a clean way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. compression should be mentioned in the docs. Would be great to add that.
  2. Given my first point, session should be as well.

And yes, template / Appender approach sounds like a good idea!

Copy link
Author

@ocefpaf ocefpaf Jun 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need some guidance on how to do this when crossing modules, common.py in get_filepath_or_buffer, and parsers.py for all the parsers defined in _parser_params.

The current situation is:

options/module-function get_filepath_or_buffer _parser_params
filepath_or_buffer present NA
mode present NA
encoding present present
compression missing (added in this PR) present
session missing present

It is unclear to me how to fix this. Some questions I have:

  • Should the compression and encoding docstring be standardized?
  • If yes on the question above what is the best strategy for this? Should encoding, compression, and session be defined in common.py and imported in parsers.py for appending in _parser_params? Or the other way around?

My guess is that I should do get the common options from parser.py and add it to common.py like:

_common_params = r"""
Parameters
----------
encoding : str, default None
    Encoding to use for UTF when reading/writing (ex. 'utf-8'). `List of Python
    standard encodings
    <https://docs.python.org/3/library/codecs.html#standard-encodings>`_
session : requests.Session
    object with the a requests session configuration for remote file.
    (requires the requests library)
compression : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'
    For on-the-fly decompression of on-disk data. If 'infer' and
    `filepath_or_buffer` is path-like, then detect compression from the
    following extensions: '.gz', '.bz2', '.zip', or '.xz' (otherwise no
    decompression). If using 'zip', the ZIP file must contain only one data
    file to be read in. Set to None for no decompression.

    .. versionadded:: 0.18.1 support for 'zip' and 'xz' compression.
"""

Then compose the get_filepath_or_buffer with the missing filepath_or_buffer and mode, and later import _common_params in parsers.py to compose _parser_params. Does that sound OK?

PS: if that is correct I'd prefer if a pandas doc expert did this instead of me, I'm kind of lost in the docstrings and Appenders 😄

r = session.get(url)
else:
r = requests.get(url)
r.raise_for_status
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this line do?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will raise an exception if there is any issue reading the URL (any non-200 OK status).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But don't you need to actually call the function? i.e. r.raise_for_status() (with parentheses)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I totally forgot that when copying-and-pasting 😄

@gfyoung gfyoung added IO CSV read_csv, to_csv Enhancement labels Jun 16, 2018
@codecov
Copy link

codecov bot commented Jun 18, 2018

Codecov Report

Merging #21504 into master will decrease coverage by 0.28%.
The diff coverage is 26.47%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21504      +/-   ##
==========================================
- Coverage   92.29%   92.01%   -0.29%     
==========================================
  Files         161      169       +8     
  Lines       51497    50797     -700     
==========================================
- Hits        47530    46740     -790     
- Misses       3967     4057      +90
Flag Coverage Δ
#multiple 90.42% <26.47%> (-0.27%) ⬇️
#single 42.21% <14.7%> (-0.22%) ⬇️
Impacted Files Coverage Δ
pandas/io/json/json.py 92.47% <ø> (-0.63%) ⬇️
pandas/io/parsers.py 95.48% <100%> (+0.18%) ⬆️
pandas/io/common.py 66.78% <4.16%> (-6.08%) ⬇️
pandas/io/html.py 89.48% <77.77%> (-1.72%) ⬇️
pandas/io/parquet.py 71.79% <0%> (-12.83%) ⬇️
pandas/io/formats/console.py 65.15% <0%> (-10.61%) ⬇️
pandas/core/arrays/base.py 88% <0%> (-9.36%) ⬇️
pandas/io/formats/html.py 88.81% <0%> (-8.87%) ⬇️
pandas/errors/__init__.py 92.3% <0%> (-7.7%) ⬇️
pandas/core/dtypes/base.py 92.68% <0%> (-7.32%) ⬇️
... and 138 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e5c90e5...b46dd04. Read the comment docs.

@rsignell-usgs
Copy link

Tested this on a username/password protected CSV endpoint and it's working!

import requests
session = requests.sessions.Session()
session.auth = ('Nope', 'NoWay!')
url = 'https://cgoms.coas.oregonstate.edu/erddap/tabledap/CE01ISSM-BUOY-001-GPS.csv'

import pandas as pd
df = pd.read_csv(url, session=session)

@pep8speaks
Copy link

pep8speaks commented Jun 18, 2018

Hello @ocefpaf! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on August 25, 2018 at 15:30 Hours UTC

@@ -304,7 +304,8 @@ def read_excel(io,
**kwds):

if not isinstance(io, ExcelFile):
io = ExcelFile(io, engine=engine)
session = kwds.get('session', None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just list session as a kwarg in read_excel (and in ExcelFile), then just pass it in

@@ -406,6 +406,7 @@ def read_json(path_or_buf=None, orient=None, typ='frame', dtype=True,
compression = _infer_compression(path_or_buf, compression)
filepath_or_buffer, _, compression, should_close = get_filepath_or_buffer(
path_or_buf, encoding=encoding, compression=compression,
session=session,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a versionadded tag when you add session

@@ -297,6 +297,9 @@
If a filepath is provided for `filepath_or_buffer`, map the file object
directly onto memory and access the data directly from there. Using this
option can improve performance because there is no longer any I/O overhead.
session : requests.Session
object with the a requests session configuration for remote file.
(requires the requests library)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a versionadded tag

@WillAyd
Copy link
Member

WillAyd commented Aug 17, 2018

@ocefpaf can you address latest comments and merge in changes from master?

@ocefpaf
Copy link
Author

ocefpaf commented Aug 17, 2018

@ocefpaf can you address latest comments and merge in changes from master?

I'm in transit now but I'll have some time next week to check this. However, if someone wants to take it over please do.

I do recall asking about a documenting strategy to avoid the repetition and the comments above. It would seems that to address some of the comments I would touch current docs, and I'm not really comfortable with that. I'll re-visit this soon...

@WillAyd
Copy link
Member

WillAyd commented Nov 23, 2018

Can you merge master?

@ocefpaf
Copy link
Author

ocefpaf commented Nov 23, 2018

Can you merge master?

Done! But I'm not sure of I got it right, git rebases are not my forte 😬

BTW I'm happy to iterate on the docs questions in https://github.com/pandas-dev/pandas/pull/21504/files#r196079159 but I'm quite confused where I should start (it is my first time contributing here).

@WillAyd
Copy link
Member

WillAyd commented Nov 23, 2018

You are better off doing the following on your branch locally:

git fetch upstream
git merge upstream/master

That retains the history of comments on GitHub so makes it easier to track.

Looks like you have some test failures with the latest push - if you could take a look would be great

@ocefpaf
Copy link
Author

ocefpaf commented Nov 23, 2018

You are better off doing the following on your branch locally:

git fetch upstream
git merge upstream/master

That is what I did :-/

Looks like you have some test failures with the latest push - if you could take a look would be great

I'll take a look ASAP.

@WillAyd
Copy link
Member

WillAyd commented Nov 23, 2018

That is what I did :-/

Hmm OK. Saw the force push here so that's why I thought you had done something else. Shouldn't need the -f flag when you do it this way

@datapythonista
Copy link
Member

@ocefpaf CI is red, can you fix any problem, and push again so there are no tests failures.

@ocefpaf
Copy link
Author

ocefpaf commented Dec 20, 2018

@ocefpaf CI is red, can you fix any problem, and push again so there are no tests failures.

I believe that when I first submitted this PR the with-statement wrapper was not used or untested, that is why it passed. I tried to fix that but my Python knowledge and time is limited to do it before the end of the year. If anyone has an idea please let me know. (IMHO I would remove the wrapper b/c it will be removed anyway when pandas drop py27.)

@datapythonista
Copy link
Member

Not sure what the problem is. I'll close this for now, as we can't merge with the conflicts and the CI in red, but if you find a solution for what you said, feel free to reopen.

@ocefpaf
Copy link
Author

ocefpaf commented Feb 11, 2019

Not sure what the problem is. I'll close this for now, as we can't merge with the conflicts and the CI in red, but if you find a solution for what you said, feel free to reopen.

Would you be interested in a Python 3 only version of this PR? That is working and I guess that all the Python 2.7 code will be removed soon anyway.

@WillAyd
Copy link
Member

WillAyd commented Feb 11, 2019

@ocefpaf we drop Py2 support for 0.25 so sure I think Python 3 only is fine for that milestone onwards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_csv from HTTPs + basic-auth + custom port throws an error (urlopen error)
7 participants