Skip to content

Segfault when passing a mock.Mock instance to pandas.read_csv (it was an accident!) #15337

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
erewok opened this issue Feb 7, 2017 · 9 comments
Labels
Compat pandas objects compatability with Numpy or Python functions IO CSV read_csv, to_csv
Milestone

Comments

@erewok
Copy link

erewok commented Feb 7, 2017

Code Sample

In [1]: import pandas as pd

In [2]: from unittest import mock

In [3]: pd.read_csv(mock.Mock())
zsh: segmentation fault  ipython

Problem description

I wrote a test where I accidentally ended up passing a mock.Mock object into pandas.read_csv and it segfaults.

Expected Output

Well, it would nice if it threw an error telling me that I screwed up, instead of segfaulting.

Output of pd.show_versions()

In [6]: pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Darwin
OS-release: 16.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.0
nose: None
pip: 9.0.1
setuptools: 24.0.2
Cython: 0.24.1
numpy: 1.10.4
scipy: None
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext)
jinja2: 2.8
boto: None

@erewok
Copy link
Author

erewok commented Feb 7, 2017

Couldn't help exploring it a bit because the test-case is so easy.

I installed the develop branch of pandas from source just now and traced the bug back to where it enters the parser defined in cython at this point in the read_csv process: https://github.com/pandas-dev/pandas/blob/master/pandas/io/parsers.py#L1534

Seems like a sound strategy would be to maybe rule out that whatever is getting passed to that parser inside parser.pyx is definitely not a string or file handle or buffer and if so to reject it?

@jreback
Copy link
Contributor

jreback commented Feb 7, 2017

sure it must be a string, bytes, or a file-like (which mainly involves having .read, .seek (maybe tell) I think might be enough).

@jreback jreback added Compat pandas objects compatability with Numpy or Python functions IO CSV read_csv, to_csv labels Feb 7, 2017
@erewok
Copy link
Author

erewok commented Feb 7, 2017

Yeah, but I am suspicious that pandas is testing for read and friends in the wrong place or that simply asking for those attributes is too fuzzy.

For instance, I isolated the error to this chunk of code:
https://github.com/pandas-dev/pandas/blob/master/pandas/parser.pyx#L700

If you comment this elif hasattr... block out, it throws the following, which is way better than a segfault:

OSError: Expected file path name or file-like object, got <class 'unittest.mock.Mock'> type

However, I think the test inside TextReader here is little bit too soft: my mock object goes through because it pretends to have a read attribute, but it might also be possible to craft something malicious with a read attribute, and it seems like that thing-with-a-read-object would make it all the way through to this cython parser.

Anyway, I believe that the test for whether it really is a filepath_or_buffer should happen much earlier in the call to read_csv and I am not sure testing whether there's a read or seek attribute is good enough?

Edit: I may have misread your comment, a bit. Apologies for any confusion.

@jreback
Copy link
Contributor

jreback commented Feb 7, 2017

@pellagic-puffbomb sure, you can test earlier. We have a fairly complicated opening process (because of many many possibilities of what it could be, encoding, from url etc). So it may not be possible to do it earlier (though it certainly might be).

I wouldn't worry about malicious things, that is the resonsibility of the user. We necessarily have a string to a file (which may or may not exist), or a file-like object which we read bytes. If someone wants to give a duck-like that is just fine.

If you can check earlier and have everything passing that would be great. We DO have a fairly extensive test suite for things like this.

@erewok
Copy link
Author

erewok commented Feb 7, 2017

Alright, well, I have an idea that may help a bit.

@gfyoung
Copy link
Member

gfyoung commented Apr 3, 2017

Hmm...seems like this thread has gone a bit stale. Reading through this, I do agree that we can do a better job if we encounter invalid objects for parsing. For example, specifying the Python engine leads to this:

>>> read_csv(mock.Mock(), engine="python")
...
TypeError: argument 1 must be an iterator

That comes from Python's native csv library. Not really that useful either.

The problem with attribute checking is that mock objects have overloaded the __getattr__ method so that most attributes will exist. However, one that does catch my attention is this:

>>> mock.Mock().__iter__
...
AttributeError: __iter__

If you take a look at the C-code for Python's csv library here, that's how they catch that mock.Mock() is an invalid file object.

@jreback
Copy link
Contributor

jreback commented Apr 3, 2017

it might be enough to check if a file-like (IOW not a string) is_iterator and otherwise raise a nice error (before passing to to anything else). xref to #15862.

sure it must be a string, bytes, or a file-like (which mainly involves having .read, .seek (maybe tell) I think might be enough).

from above, we might need a more sophisticated test (which could use is_iterator)

@gfyoung
Copy link
Member

gfyoung commented Apr 3, 2017

How about adding that as a check in pandas.io.common? We do plenty of file-handling there for all of our I/O file operations. It seems like we should be able to abstract this validation logic.

@jreback
Copy link
Contributor

jreback commented Apr 3, 2017

@gfyoung

yes is_file_like would be great! (maybe could put in pandas.types.common alternatively). its kind of a 'type' check.

Then calling in parser (and other methods would be great).

gfyoung added a commit to forking-repos/pandas that referenced this issue Apr 4, 2017
@jreback jreback added this to the 0.20.0 milestone Apr 4, 2017
gfyoung added a commit to forking-repos/pandas that referenced this issue Apr 5, 2017
gfyoung added a commit to forking-repos/pandas that referenced this issue Apr 5, 2017
gfyoung added a commit to forking-repos/pandas that referenced this issue Apr 5, 2017
@jreback jreback closed this as completed in e4e87ec Apr 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

3 participants