-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: read_msgpack raise an error when passed an non existent path in Python 2 #16523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #16523 +/- ##
==========================================
+ Coverage 90.79% 90.8% +<.01%
==========================================
Files 161 161
Lines 51063 51064 +1
==========================================
+ Hits 46365 46367 +2
+ Misses 4698 4697 -1
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #16523 +/- ##
==========================================
- Coverage 91.24% 91.23% -0.02%
==========================================
Files 163 163
Lines 50168 50169 +1
==========================================
- Hits 45777 45770 -7
- Misses 4391 4399 +8
Continue to review full report at Codecov.
|
fh = compat.BytesIO(path_or_buf) | ||
return read(fh) | ||
# We can't distinguish between a path and a buffer of bytes in | ||
# Python 2 so instead assume the first byte of a valid path is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can simply check os.path.exists(path_or_buf)
(see how this is done in pandas.io.json.json.read_json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw your expl. But if os.path.exists(path_or_buf)
fails then you know for sure its NOT a path. Then you can try to read. I agree you can't then distinguish between an invalid path and an invalid byte stream, but so what.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benefit is when an incorrect path is accidentally used. As it stands the code can continue running with invalid data until it crashes mysteriously elsewhere or silently produces an incorrect result (the latter motivated me to make this PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty good.
pandas/tests/io/test_common.py
Outdated
@@ -107,6 +107,26 @@ def test_iterator(self): | |||
tm.assert_frame_equal(first, expected.iloc[[0]]) | |||
tm.assert_frame_equal(concat(it), expected.iloc[1:]) | |||
|
|||
@pytest.mark.parametrize('reader, module, error_class, fn_ext', [ | |||
(pd.read_csv, 'os', pd.compat.FileNotFoundError, 'csv'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just import FileNotFoundError
at the top to make this less verbose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make an issue to conform the error messages for feather, json to FileNotFound
(rather than ValueError
). and fix msgpack for same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@jreback I'm putting this here as it is relevant to both of your comments. Unfortunately I can't use df = pd.DataFrame({'a': np.random.random(5)})
print('Before = ', pd.read_json('[]'), end='\n\n')
df.to_json('[]')
print('After = ', pd.read_json('[]')) which results in this output:
Any examples that trigger this using json are somewhat contrived, however with msgpack almost any path has a valid decoded object associated with it. This also explains why The only place that this is clearly defined is for Python 3 where you can infer if it should be a path from the object type and all the solutions for the other cases that I can think of result in breaking changes to the |
doc/source/whatsnew/v0.20.2.txt
Outdated
@@ -39,6 +39,7 @@ Bug Fixes | |||
|
|||
- Bug in using ``pathlib.Path`` or ``py.path.local`` objects with io functions (:issue:`16291`) | |||
- Bug in ``DataFrame.update()`` with ``overwrite=False`` and ``NaN values`` (:issue:`15593`) | |||
- Bug in ``pd.read_msgpack()`` with a non existent file is passed in Python 2 (:issue:`15296`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to 0.21.0
fh = compat.BytesIO(path_or_buf) | ||
return read(fh) | ||
# We can't distinguish between a path and a buffer of bytes in | ||
# Python 2 so instead assume the first byte of a valid path is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw your expl. But if os.path.exists(path_or_buf)
fails then you know for sure its NOT a path. Then you can try to read. I agree you can't then distinguish between an invalid path and an invalid byte stream, but so what.
I am not familiar with the msg code, so a basic question @jreback : is there a reason this cannot use the common machinery to deal with path_or_buf as the other readers? |
we are talking about we need a modified version for msgpack/json that checks for file existance, but does not open the file or decode the bytes. These are handled by the unpackers. So this could be generalized a bit. |
# We can't distinguish between a path and a buffer of bytes in | ||
# Python 2 so instead assume the first byte of a valid path is | ||
# less than 0x80. | ||
if compat.PY3 or ord(path_or_buf[0]) >= 0x80: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would fail for a 0-len buffer. what does the 0x80
compare against? is this platform dependent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These seems fragile. IIUC, it's possible to have filenames that, according to Python, start with characters above 0x80, even if the filesystem does some encoding on the filename before reading on writing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I don't see a way to make this avoid the edge cases like this for Python 2 with the current API. I think this way minimised the number of affected users and if it does affect someone the check can be bypassed using ./filename
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would fail for a 0-len buffer. what does the 0x80 compare against? is this platform dependent?
According to the msgpack spec "Applications can assign 0 to 127 to store application-specific type information.". I believe pandas doesn't currently use this so this assumes if the first byte is below 0x80
it was supposed to be a filename rather than a collection of bytes to decode.
I'm not sure what the correct behaviour should be for passing read_msgpack("")
.
@TomAugspurger thoughts? |
can you move the release note to 0.21.0 I think this was ok, will review again after rebase. |
@jreback Branch rebased and release notes moved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor change
pandas/tests/io/test_common.py
Outdated
@@ -107,6 +107,26 @@ def test_iterator(self): | |||
tm.assert_frame_equal(first, expected.iloc[[0]]) | |||
tm.assert_frame_equal(concat(it), expected.iloc[1:]) | |||
|
|||
@pytest.mark.parametrize('reader, module, error_class, fn_ext', [ | |||
(pd.read_csv, 'os', pd.compat.FileNotFoundError, 'csv'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make this change
can you rebase and move note to 0.22.0 |
@jreback Done |
lgtm. @TomAugspurger if you can have a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this seems like an improvement.
Maybe someday we should split the path_or_buf
argument into two: one for paths, one for buffers.
Thanks for your patience @chrisburr! |
…Python 2 (pandas-dev#16523) * TST: Add tests for trying to read non-existent files pandas-dev#15296 * BUG: Fix passing non-existant file to read_msgpack pandas-dev#15296 * TST: Fix io.test_common.test_read_non_existant for external modules * CLN: Import FileNotFoundError in tests/io/test_common.py
…Python 2 (pandas-dev#16523) * TST: Add tests for trying to read non-existent files pandas-dev#15296 * BUG: Fix passing non-existant file to read_msgpack pandas-dev#15296 * TST: Fix io.test_common.test_read_non_existant for external modules * CLN: Import FileNotFoundError in tests/io/test_common.py
git diff upstream/master --name-only -- '*.py' | flake8 --diff
This PR adds tests to check that a suitable error is raised when a non existent path is passed to
pd.read_xxxxx()
.A fix is also included for
read_msgpack
with python 2 which works by checking that the first byte of the passed string is>= 0x80
. So far as I can tell these are reserved and unused by pandas so I think this shouldn't break any existing functionality.