Skip to content

ENH: Add file buffer validation to I/O ops #15894

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Apr 4, 2017

  1. Allows for more uniform handling of invalid file buffers to our read_* functions.
  2. Adds a ton of new documentation to inference.py

Closes #15337.

Partially addresses #15895.



def is_file_like(obj):
file_attrs = ('read', 'write', 'seek', 'tell')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a doc-string describing this. (and of course any others you want to document; prob should open an issue for that).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't too hard to fix here in this PR. Will do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add a versionadded (for this one).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Will make sure.

if not hasattr(obj, attr):
return False

if not is_iterator(obj):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so a file_like must be an iterator, but an iterator is not necessarily file_like right?

Copy link
Member Author

@gfyoung gfyoung Apr 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. The reason is the following:

>>> from unittest import mock
>>> m = mock.Mock()
>>> hasattr(m, "read")
True
>>> hasattr(m, "__iter__")
False

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k, maybe add that statement (or describe). Remember these are actually user-level accessible :>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true. In the doc-string, I presume?

Copy link
Contributor

@jreback jreback Apr 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep exactly. after this is merged I am going to create an issue to improve these doc-strings (novice level); the ones you don't do :>

Copy link
Member Author

@gfyoung gfyoung Apr 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That won't be necessary. A commit is being added here for that.

EDIT: Ah, I see you expanded to include more than just this file. Fair enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah should document them all ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

examples too!

@jreback jreback added Compat pandas objects compatability with Numpy or Python functions IO CSV read_csv, to_csv labels Apr 4, 2017
@jreback
Copy link
Contributor

jreback commented Apr 4, 2017

lgtm.

@jreback jreback added this to the 0.20.0 milestone Apr 4, 2017
if _is_url(io):
url_passed = _is_url(io)

if url_passed:
io = _urlopen(io)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be simpler to have this in an if/elif

if _is_url(io):
    io = _urlopen(io)
elif not isinstance(io, .....)
   ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Fixed.

@codecov
Copy link

codecov bot commented Apr 4, 2017

Codecov Report

Merging #15894 into master will decrease coverage by <.01%.
The diff coverage is 95.12%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #15894      +/-   ##
==========================================
- Coverage   90.97%   90.97%   -0.01%     
==========================================
  Files         145      145              
  Lines       49496    49518      +22     
==========================================
+ Hits        45027    45047      +20     
- Misses       4469     4471       +2
Flag Coverage Δ
#multiple 88.73% <95.12%> (ø) ⬆️
#single 40.64% <75.6%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/types/api.py 100% <ø> (ø) ⬆️
pandas/io/excel.py 79.73% <100%> (+0.03%) ⬆️
pandas/io/common.py 70.08% <90.9%> (+0.75%) ⬆️
pandas/types/inference.py 98.3% <96.42%> (-1.7%) ⬇️
pandas/io/packers.py 88.19% <0%> (-0.32%) ⬇️
pandas/core/common.py 91.03% <0%> (+0.34%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0a37067...5a8f8da. Read the comment docs.

@gfyoung gfyoung force-pushed the validate-file-like branch from cc3a9b6 to 3c477be Compare April 5, 2017 00:49
@gfyoung
Copy link
Member Author

gfyoung commented Apr 5, 2017

@jreback : Everything is green and ready to go.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice docstrings!

# python 3 generators have __next__ instead of next
return hasattr(obj, 'next') or hasattr(obj, '__next__')
"""
Check if the object is an iterable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe good to notice here that a string does not give True (although it is iterable, but not an iterator)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nor is a datetime, yes would be good to mention these counter-examples

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

not isinstance(arg, string_and_binary_types))
def is_list_like(obj):
"""
Check if the object is list-like.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe give an inexhaustive list of objects that are regarded as list-like? (lists, sets, arrays, Series, ...)

I would also add example with np.array(..)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mention that strings & datetimes are NOT is_list_like .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@gfyoung gfyoung force-pushed the validate-file-like branch from 3c477be to ba08e6d Compare April 5, 2017 13:45
3) seek
4) tell

In addition, the object must be iterable, as mock
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is assuming a little bit of knowledge here. I would just eliminate this paragraph.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Done.

except TypeError:
return False
else:
return True


def is_sequence(x):
def is_sequence(obj):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would note that this is less strict than is_list_like. Further impl wise we almost never use this one. I don't really recall what the actual diffferences are though. you can have a look thru the code base and see if you can come up with a good reason for this (IIRC its in the printing routines somewhere).

Copy link
Member Author

@gfyoung gfyoung Apr 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think it's the other way around:

>>> i = iter([1, 2, 3])
>>> is_list_like(i)
True
>>> is_sequence(i)
False

is_sequence checks for len and iter properties, whereas is_list_like just checks for the __iter__ property.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think leaving both functions are fine for now, especially in the context of this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, just ideally wanted to document this.

@gfyoung gfyoung force-pushed the validate-file-like branch from ba08e6d to 5a8f8da Compare April 5, 2017 16:08
@gfyoung
Copy link
Member Author

gfyoung commented Apr 5, 2017

@jreback , @jorisvandenbossche : Everything is green and ready to go.

@jreback jreback closed this in e4e87ec Apr 5, 2017
@jreback
Copy link
Contributor

jreback commented Apr 5, 2017

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants