BUG: read_msgpack raise an error when passed an non existent path in Python 2 #16523

chrisburr · 2017-05-28T10:20:00Z

closes read_msgpack returns garbage for non-existing files in python2 #15296
11 tests added / passed
passes git diff upstream/master --name-only -- '*.py' | flake8 --diff
whatsnew entry

This PR adds tests to check that a suitable error is raised when a non existent path is passed to pd.read_xxxxx().

A fix is also included for read_msgpack with python 2 which works by checking that the first byte of the passed string is >= 0x80. So far as I can tell these are reserved and unused by pandas so I think this shouldn't break any existing functionality.

codecov · 2017-05-28T11:17:48Z

Codecov Report

Merging #16523 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #16523      +/-   ##
==========================================
+ Coverage   90.79%    90.8%   +<.01%     
==========================================
  Files         161      161              
  Lines       51063    51064       +1     
==========================================
+ Hits        46365    46367       +2     
+ Misses       4698     4697       -1

Flag	Coverage Δ
#multiple	`88.64% <100%> (ø)`	⬆️
#single	`40.15% <0%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/packers.py	`88.58% <100%> (+0.34%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ef487d9...f4b6892. Read the comment docs.

codecov · 2017-05-28T11:17:49Z

Codecov Report

Merging #16523 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #16523      +/-   ##
==========================================
- Coverage   91.24%   91.23%   -0.02%     
==========================================
  Files         163      163              
  Lines       50168    50169       +1     
==========================================
- Hits        45777    45770       -7     
- Misses       4391     4399       +8

Flag	Coverage Δ
#multiple	`89.04% <100%> (ø)`	⬆️
#single	`40.27% <0%> (-0.07%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/packers.py	`88.65% <100%> (+0.34%)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.75% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4489389...9d0f3b6. Read the comment docs.

jreback · 2017-05-29T16:18:21Z

pandas/io/packers.py

-            fh = compat.BytesIO(path_or_buf)
-            return read(fh)
+            # We can't distinguish between a path and a buffer of bytes in
+            # Python 2 so instead assume the first byte of a valid path is


you can simply check os.path.exists(path_or_buf) (see how this is done in pandas.io.json.json.read_json

I saw your expl. But if os.path.exists(path_or_buf) fails then you know for sure its NOT a path. Then you can try to read. I agree you can't then distinguish between an invalid path and an invalid byte stream, but so what.

The benefit is when an incorrect path is accidentally used. As it stands the code can continue running with invalid data until it crashes mysteriously elsewhere or silently produces an incorrect result (the latter motivated me to make this PR).

jreback

looks pretty good.

jreback · 2017-05-29T16:20:46Z

pandas/tests/io/test_common.py

@@ -107,6 +107,26 @@ def test_iterator(self):
        tm.assert_frame_equal(first, expected.iloc[[0]])
        tm.assert_frame_equal(concat(it), expected.iloc[1:])

+    @pytest.mark.parametrize('reader, module, error_class, fn_ext', [
+        (pd.read_csv, 'os', pd.compat.FileNotFoundError, 'csv'),


just import FileNotFoundError at the top to make this less verbose

can you make an issue to conform the error messages for feather, json to FileNotFound (rather than ValueError). and fix msgpack for same.

can you make this change

chrisburr · 2017-05-30T08:10:23Z

@jreback I'm putting this here as it is relevant to both of your comments.

Unfortunately I can't use os.path.isfile as the API allows path_or_buf to be either a path, or a buffer containing data to be read. This bug can also be triggered using read_json as follows:

df = pd.DataFrame({'a': np.random.random(5)})
print('Before = ', pd.read_json('[]'), end='\n\n')
df.to_json('[]')
print('After = ', pd.read_json('[]'))

which results in this output:

Before =  Empty DataFrame
Columns: []
Index: []

After =            a
0  0.637333
1  0.337618
2  0.331358
3  0.073023
4  0.595739

Any examples that trigger this using json are somewhat contrived, however with msgpack almost any path has a valid decoded object associated with it.

This also explains why to_json and to_msgpack raise ValueError as in most cases it's not possible to distinguish between a non-existent path and malformed data.

The only place that this is clearly defined is for Python 3 where you can infer if it should be a path from the object type and all the solutions for the other cases that I can think of result in breaking changes to the read_json and read_msgpack API.

jreback · 2017-06-14T23:13:56Z

doc/source/whatsnew/v0.20.2.txt

@@ -39,6 +39,7 @@ Bug Fixes

 - Bug in using ``pathlib.Path`` or ``py.path.local`` objects with io functions (:issue:`16291`)
 - Bug in ``DataFrame.update()`` with ``overwrite=False`` and ``NaN values`` (:issue:`15593`)
+- Bug in ``pd.read_msgpack()`` with a non existent file is passed in Python 2 (:issue:`15296`)


move to 0.21.0

jreback · 2017-06-14T23:15:28Z

pandas/io/packers.py

-            fh = compat.BytesIO(path_or_buf)
-            return read(fh)
+            # We can't distinguish between a path and a buffer of bytes in
+            # Python 2 so instead assume the first byte of a valid path is


I saw your expl. But if os.path.exists(path_or_buf) fails then you know for sure its NOT a path. Then you can try to read. I agree you can't then distinguish between an invalid path and an invalid byte stream, but so what.

jorisvandenbossche · 2017-06-15T07:33:41Z

I am not familiar with the msg code, so a basic question @jreback : is there a reason this cannot use the common machinery to deal with path_or_buf as the other readers?

jreback · 2017-06-15T10:14:39Z

am not familiar with the msg code, so a basic question @jreback : is there a reason this cannot use the common machinery to deal with path_or_buf as the other readers?

we are talking about pandas.io.common.get_filepath_or_buffer

we need a modified version for msgpack/json that checks for file existance, but does not open the file or decode the bytes. These are handled by the unpackers. So this could be generalized a bit.

jreback · 2017-06-19T22:53:14Z

pandas/io/packers.py

+            # We can't distinguish between a path and a buffer of bytes in
+            # Python 2 so instead assume the first byte of a valid path is
+            # less than 0x80.
+            if compat.PY3 or ord(path_or_buf[0]) >= 0x80:


this would fail for a 0-len buffer. what does the 0x80 compare against? is this platform dependent?

These seems fragile. IIUC, it's possible to have filenames that, according to Python, start with characters above 0x80, even if the filesystem does some encoding on the filename before reading on writing.

Unfortunately I don't see a way to make this avoid the edge cases like this for Python 2 with the current API. I think this way minimised the number of affected users and if it does affect someone the check can be bypassed using ./filename.

this would fail for a 0-len buffer. what does the 0x80 compare against? is this platform dependent?

According to the msgpack spec "Applications can assign 0 to 127 to store application-specific type information.". I believe pandas doesn't currently use this so this assumes if the first byte is below 0x80 it was supposed to be a filename rather than a collection of bytes to decode.

I'm not sure what the correct behaviour should be for passing read_msgpack("").

jreback · 2017-07-19T10:34:50Z

@TomAugspurger thoughts?

jreback · 2017-08-18T00:59:45Z

can you move the release note to 0.21.0

I think this was ok, will review again after rebase.

chrisburr · 2017-08-20T04:14:28Z

@jreback Branch rebased and release notes moved

jreback

minor change

jreback · 2017-08-20T13:48:28Z

pandas/tests/io/test_common.py

@@ -107,6 +107,26 @@ def test_iterator(self):
        tm.assert_frame_equal(first, expected.iloc[[0]])
        tm.assert_frame_equal(concat(it), expected.iloc[1:])

+    @pytest.mark.parametrize('reader, module, error_class, fn_ext', [
+        (pd.read_csv, 'os', pd.compat.FileNotFoundError, 'csv'),


can you make this change

jreback · 2017-10-28T00:28:29Z

can you rebase and move note to 0.22.0

chrisburr · 2017-10-28T11:28:45Z

@jreback Done

jreback · 2017-10-28T15:24:50Z

lgtm. @TomAugspurger if you can have a look.

TomAugspurger

Yeah this seems like an improvement.

Maybe someday we should split the path_or_buf argument into two: one for paths, one for buffers.

TomAugspurger · 2017-10-30T12:54:05Z

Thanks for your patience @chrisburr!

…Python 2 (pandas-dev#16523) * TST: Add tests for trying to read non-existent files pandas-dev#15296 * BUG: Fix passing non-existant file to read_msgpack pandas-dev#15296 * TST: Fix io.test_common.test_read_non_existant for external modules * CLN: Import FileNotFoundError in tests/io/test_common.py

chrisburr force-pushed the fix-15296 branch from b69e4ac to f4b6892 Compare May 28, 2017 11:17

chrisburr force-pushed the fix-15296 branch from f4b6892 to 24cf61c Compare May 28, 2017 15:49

jreback requested changes May 29, 2017

View reviewed changes

jreback added Msgpack IO Data IO issues that don't fit into a more specific label labels May 29, 2017

jreback requested changes May 29, 2017

View reviewed changes

jreback requested changes Jun 14, 2017

View reviewed changes

jreback reviewed Jun 19, 2017

View reviewed changes

chrisburr force-pushed the fix-15296 branch from 24cf61c to 143d3b5 Compare August 20, 2017 04:12

jreback requested changes Aug 20, 2017

View reviewed changes

chrisburr added 3 commits October 28, 2017 09:38

TST: Add tests for trying to read non-existent files pandas-dev#15296

06a70b1

BUG: Fix passing non-existant file to read_msgpack pandas-dev#15296

02b041c

TST: Fix io.test_common.test_read_non_existant for external modules

6ea733c

chrisburr force-pushed the fix-15296 branch from 75ff84e to 6ea733c Compare October 28, 2017 08:51

CLN: Import FileNotFoundError in tests/io/test_common.py

9d0f3b6

jreback approved these changes Oct 28, 2017

View reviewed changes

jreback added this to the 0.22.0 milestone Oct 28, 2017

TomAugspurger approved these changes Oct 30, 2017

View reviewed changes

TomAugspurger merged commit 8449ffd into pandas-dev:master Oct 30, 2017

chrisburr deleted the fix-15296 branch October 30, 2017 13:02

rebecca-palmer mentioned this pull request Jan 21, 2020

Check for pyarrow not feather before pyarrow tests #31144

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_msgpack raise an error when passed an non existent path in Python 2 #16523

BUG: read_msgpack raise an error when passed an non existent path in Python 2 #16523

chrisburr commented May 28, 2017 •

edited

Loading

codecov bot commented May 28, 2017

codecov bot commented May 28, 2017 •

edited

Loading

jreback May 29, 2017

jreback Jun 14, 2017

chrisburr Jun 15, 2017

jreback left a comment

jreback May 29, 2017

jreback May 29, 2017

jreback Aug 20, 2017

chrisburr Aug 20, 2017

chrisburr commented May 30, 2017

jreback Jun 14, 2017

jreback Jun 14, 2017

jorisvandenbossche commented Jun 15, 2017

jreback commented Jun 15, 2017 •

edited

Loading

jreback Jun 19, 2017

TomAugspurger Jul 19, 2017

chrisburr Jul 19, 2017

chrisburr Jul 19, 2017 •

edited

Loading

jreback commented Jul 19, 2017

jreback commented Aug 18, 2017

chrisburr commented Aug 20, 2017

jreback left a comment

jreback Aug 20, 2017

jreback commented Oct 28, 2017

chrisburr commented Oct 28, 2017

jreback commented Oct 28, 2017

TomAugspurger left a comment

TomAugspurger commented Oct 30, 2017

BUG: read_msgpack raise an error when passed an non existent path in Python 2 #16523

BUG: read_msgpack raise an error when passed an non existent path in Python 2 #16523

Conversation

chrisburr commented May 28, 2017 • edited Loading

codecov bot commented May 28, 2017

Codecov Report

codecov bot commented May 28, 2017 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrisburr commented May 30, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jun 15, 2017

jreback commented Jun 15, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrisburr Jul 19, 2017 • edited Loading

Choose a reason for hiding this comment

jreback commented Jul 19, 2017

jreback commented Aug 18, 2017

chrisburr commented Aug 20, 2017

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Oct 28, 2017

chrisburr commented Oct 28, 2017

jreback commented Oct 28, 2017

TomAugspurger left a comment

Choose a reason for hiding this comment

TomAugspurger commented Oct 30, 2017

chrisburr commented May 28, 2017 •

edited

Loading

codecov bot commented May 28, 2017 •

edited

Loading

jreback commented Jun 15, 2017 •

edited

Loading

chrisburr Jul 19, 2017 •

edited

Loading