Skip to content

BUG: read_msgpack raise an error when passed an non existent path in Python 2 #16523

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Oct 30, 2017

Conversation

chrisburr
Copy link
Contributor

@chrisburr chrisburr commented May 28, 2017

This PR adds tests to check that a suitable error is raised when a non existent path is passed to pd.read_xxxxx().

A fix is also included for read_msgpack with python 2 which works by checking that the first byte of the passed string is >= 0x80. So far as I can tell these are reserved and unused by pandas so I think this shouldn't break any existing functionality.

@codecov
Copy link

codecov bot commented May 28, 2017

Codecov Report

Merging #16523 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16523      +/-   ##
==========================================
+ Coverage   90.79%    90.8%   +<.01%     
==========================================
  Files         161      161              
  Lines       51063    51064       +1     
==========================================
+ Hits        46365    46367       +2     
+ Misses       4698     4697       -1
Flag Coverage Δ
#multiple 88.64% <100%> (ø) ⬆️
#single 40.15% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/io/packers.py 88.58% <100%> (+0.34%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ef487d9...f4b6892. Read the comment docs.

@codecov
Copy link

codecov bot commented May 28, 2017

Codecov Report

Merging #16523 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16523      +/-   ##
==========================================
- Coverage   91.24%   91.23%   -0.02%     
==========================================
  Files         163      163              
  Lines       50168    50169       +1     
==========================================
- Hits        45777    45770       -7     
- Misses       4391     4399       +8
Flag Coverage Δ
#multiple 89.04% <100%> (ø) ⬆️
#single 40.27% <0%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/io/packers.py 88.65% <100%> (+0.34%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.75% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4489389...9d0f3b6. Read the comment docs.

fh = compat.BytesIO(path_or_buf)
return read(fh)
# We can't distinguish between a path and a buffer of bytes in
# Python 2 so instead assume the first byte of a valid path is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can simply check os.path.exists(path_or_buf) (see how this is done in pandas.io.json.json.read_json

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw your expl. But if os.path.exists(path_or_buf) fails then you know for sure its NOT a path. Then you can try to read. I agree you can't then distinguish between an invalid path and an invalid byte stream, but so what.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benefit is when an incorrect path is accidentally used. As it stands the code can continue running with invalid data until it crashes mysteriously elsewhere or silently produces an incorrect result (the latter motivated me to make this PR).

@jreback jreback added Msgpack IO Data IO issues that don't fit into a more specific label labels May 29, 2017
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good.

@@ -107,6 +107,26 @@ def test_iterator(self):
tm.assert_frame_equal(first, expected.iloc[[0]])
tm.assert_frame_equal(concat(it), expected.iloc[1:])

@pytest.mark.parametrize('reader, module, error_class, fn_ext', [
(pd.read_csv, 'os', pd.compat.FileNotFoundError, 'csv'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just import FileNotFoundError at the top to make this less verbose

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make an issue to conform the error messages for feather, json to FileNotFound (rather than ValueError). and fix msgpack for same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@chrisburr
Copy link
Contributor Author

@jreback I'm putting this here as it is relevant to both of your comments.

Unfortunately I can't use os.path.isfile as the API allows path_or_buf to be either a path, or a buffer containing data to be read. This bug can also be triggered using read_json as follows:

df = pd.DataFrame({'a': np.random.random(5)})
print('Before = ', pd.read_json('[]'), end='\n\n')
df.to_json('[]')
print('After = ', pd.read_json('[]'))

which results in this output:

Before =  Empty DataFrame
Columns: []
Index: []

After =            a
0  0.637333
1  0.337618
2  0.331358
3  0.073023
4  0.595739

Any examples that trigger this using json are somewhat contrived, however with msgpack almost any path has a valid decoded object associated with it.

This also explains why to_json and to_msgpack raise ValueError as in most cases it's not possible to distinguish between a non-existent path and malformed data.

The only place that this is clearly defined is for Python 3 where you can infer if it should be a path from the object type and all the solutions for the other cases that I can think of result in breaking changes to the read_json and read_msgpack API.

@@ -39,6 +39,7 @@ Bug Fixes

- Bug in using ``pathlib.Path`` or ``py.path.local`` objects with io functions (:issue:`16291`)
- Bug in ``DataFrame.update()`` with ``overwrite=False`` and ``NaN values`` (:issue:`15593`)
- Bug in ``pd.read_msgpack()`` with a non existent file is passed in Python 2 (:issue:`15296`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to 0.21.0

fh = compat.BytesIO(path_or_buf)
return read(fh)
# We can't distinguish between a path and a buffer of bytes in
# Python 2 so instead assume the first byte of a valid path is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw your expl. But if os.path.exists(path_or_buf) fails then you know for sure its NOT a path. Then you can try to read. I agree you can't then distinguish between an invalid path and an invalid byte stream, but so what.

@jorisvandenbossche
Copy link
Member

I am not familiar with the msg code, so a basic question @jreback : is there a reason this cannot use the common machinery to deal with path_or_buf as the other readers?

@jreback
Copy link
Contributor

jreback commented Jun 15, 2017

am not familiar with the msg code, so a basic question @jreback : is there a reason this cannot use the common machinery to deal with path_or_buf as the other readers?

we are talking about pandas.io.common.get_filepath_or_buffer

we need a modified version for msgpack/json that checks for file existance, but does not open the file or decode the bytes. These are handled by the unpackers. So this could be generalized a bit.

# We can't distinguish between a path and a buffer of bytes in
# Python 2 so instead assume the first byte of a valid path is
# less than 0x80.
if compat.PY3 or ord(path_or_buf[0]) >= 0x80:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would fail for a 0-len buffer. what does the 0x80 compare against? is this platform dependent?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seems fragile. IIUC, it's possible to have filenames that, according to Python, start with characters above 0x80, even if the filesystem does some encoding on the filename before reading on writing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I don't see a way to make this avoid the edge cases like this for Python 2 with the current API. I think this way minimised the number of affected users and if it does affect someone the check can be bypassed using ./filename.

Copy link
Contributor Author

@chrisburr chrisburr Jul 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would fail for a 0-len buffer. what does the 0x80 compare against? is this platform dependent?

According to the msgpack spec "Applications can assign 0 to 127 to store application-specific type information.". I believe pandas doesn't currently use this so this assumes if the first byte is below 0x80 it was supposed to be a filename rather than a collection of bytes to decode.

I'm not sure what the correct behaviour should be for passing read_msgpack("").

@jreback
Copy link
Contributor

jreback commented Jul 19, 2017

@TomAugspurger thoughts?

@jreback
Copy link
Contributor

jreback commented Aug 18, 2017

can you move the release note to 0.21.0

I think this was ok, will review again after rebase.

@chrisburr
Copy link
Contributor Author

@jreback Branch rebased and release notes moved

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor change

@@ -107,6 +107,26 @@ def test_iterator(self):
tm.assert_frame_equal(first, expected.iloc[[0]])
tm.assert_frame_equal(concat(it), expected.iloc[1:])

@pytest.mark.parametrize('reader, module, error_class, fn_ext', [
(pd.read_csv, 'os', pd.compat.FileNotFoundError, 'csv'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this change

@jreback
Copy link
Contributor

jreback commented Oct 28, 2017

can you rebase and move note to 0.22.0

@chrisburr
Copy link
Contributor Author

@jreback Done

@jreback jreback added this to the 0.22.0 milestone Oct 28, 2017
@jreback
Copy link
Contributor

jreback commented Oct 28, 2017

lgtm. @TomAugspurger if you can have a look.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this seems like an improvement.

Maybe someday we should split the path_or_buf argument into two: one for paths, one for buffers.

@TomAugspurger TomAugspurger merged commit 8449ffd into pandas-dev:master Oct 30, 2017
@TomAugspurger
Copy link
Contributor

Thanks for your patience @chrisburr!

@chrisburr chrisburr deleted the fix-15296 branch October 30, 2017 13:02
peterpanmj pushed a commit to peterpanmj/pandas that referenced this pull request Oct 31, 2017
…Python 2 (pandas-dev#16523)

* TST: Add tests for trying to read non-existent files pandas-dev#15296

* BUG: Fix passing non-existant file to read_msgpack pandas-dev#15296

* TST: Fix io.test_common.test_read_non_existant for external modules

* CLN: Import FileNotFoundError in tests/io/test_common.py
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
…Python 2 (pandas-dev#16523)

* TST: Add tests for trying to read non-existent files pandas-dev#15296

* BUG: Fix passing non-existant file to read_msgpack pandas-dev#15296

* TST: Fix io.test_common.test_read_non_existant for external modules

* CLN: Import FileNotFoundError in tests/io/test_common.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_msgpack returns garbage for non-existing files in python2
4 participants