Skip to content

is_file_like requirements are too strict for boto3 S3 objects #16135

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
djlancelot opened this issue Apr 25, 2017 · 6 comments
Closed

is_file_like requirements are too strict for boto3 S3 objects #16135

djlancelot opened this issue Apr 25, 2017 · 6 comments
Labels
IO Data IO issues that don't fit into a more specific label Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@djlancelot
Copy link

Code Sample, a copy-pastable example if possible

my_s3_object = boto3.client('s3').get_object(Bucket=<S3 bucket>, Key=<S3 path>)
pd.read_csv(my_s3_object['Body'])

Problem description

When trying to read csv files from AWS S3 directly using boto3, the returned botocore.response. StreamingBody object does not have all the neccessary methods required in is_file_like function of pandas.core.dtypes.inference package. Although the pd.read_csv call worked flawlessly before, since the commit 20 days ago ( e4e87ec ) our app is broken.

See botocore reference ( http://botocore.readthedocs.io/en/latest/reference/response.html ) for details on the StreamingBody class.

The issue is caused by the too strict constraint in is_file_like function ( https://github.com/pandas-dev/pandas/blob/master/pandas/core/dtypes/inference.py#L140 )

Expected Output

No error should be raised if read method is available on read operations.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Darwin
OS-release: 16.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.0rc1
pytest: 3.0.7
pip: 9.0.1
setuptools: 35.0.1
Cython: None
numpy: 1.12.0
scipy: 0.19.0
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

klintan referenced this issue Apr 25, 2017
1) Allows for more uniform handling of invalid file buffers to our `read_*` functions.
2) Adds a ton of new documentation to `inference.py`

Closes #15337.
xref #15895.

Author: gfyoung <[email protected]>

Closes #15894 from gfyoung/validate-file-like and squashes the following commits:

5a8f8da [gfyoung] DOC: Document all of inference.py
81103f7 [gfyoung] ENH: Add file buffer validation to I/O ops
@chris-b1
Copy link
Contributor

chris-b1 commented Apr 26, 2017

cc @gfyoung - do we ever use anything more than read() with a file-like object?

xref #15894

@chris-b1 chris-b1 added IO Data IO issues that don't fit into a more specific label Regression Functionality that used to work in a prior pandas version labels Apr 26, 2017
@jreback
Copy link
Contributor

jreback commented Apr 26, 2017

is_file_like is correct
read is just not enough you need the ability to seek and tell as well
steaming does not meet this

@djlancelot
Copy link
Author

As far as I saw the read_csv uses the input as a buffer and it is only using the read and the close methods. It would be nice to have a less strict is_readonly_buffer_like helper function for read validation (and maybe an is_writeonly_buffer_like helper function for write validation).

@gfyoung
Copy link
Member

gfyoung commented Apr 26, 2017

Given this breakage, I think we can relax the requirements in two ways:

  1. The object has EITHER read OR write and have the user deal with the consequences if the wrong attribute is available (e.g. we have read-only on a file to which we want to write). The intent of the user with the file object does not impact whether the object is file-like or not.

  2. Remove checks for seek and tell, as it does not seem like we need either of them:

class BadStringIO(StringIO):
   def tell(self):
      raise NotImplementedError("nope")

   def seek(self, pos, whence=0):
      raise NotImplementedError("nope")

>>> data = "a\n1"
>>> read_csv(BadStringIO(data), engine="c")
   a
0  1
>>>
>>> read_csv(BadStringIO(data), engine="python")
   a
0  1

@jreback
Copy link
Contributor

jreback commented Apr 26, 2017

@gfyoung

I guess we could change is_file_like to be (pseudo-code) is_iterator & hasattr(read/write)
that seems to pass the mock tests (which is not an iterator anyhow)

@jreback jreback added this to the 0.20.0 milestone Apr 26, 2017
@jreback
Copy link
Contributor

jreback commented Apr 26, 2017

@djlancelot can you provide a test based on one of our s3 buckets that replicates this

e.g. a streaming version of this
df = read_csv('s3://pandas-test/tips.csv')

gfyoung added a commit to forking-repos/pandas that referenced this issue Apr 26, 2017
Previously, we were requiring that
all file-like objects had "read,"
"write," "seek," and "tell" methods,
but that was too strict (e.g. read-only
buffers). This commit relaxes those
requirements to having EITHER "read"
or "write" as attributes.

Closes pandas-devgh-16135.
gfyoung added a commit to forking-repos/pandas that referenced this issue Apr 26, 2017
Previously, we were requiring that
all file-like objects had "read,"
"write," "seek," and "tell" methods,
but that was too strict (e.g. read-only
buffers). This commit relaxes those
requirements to having EITHER "read"
or "write" as attributes.

Closes pandas-devgh-16135.
gfyoung added a commit to forking-repos/pandas that referenced this issue Apr 27, 2017
Previously, we were requiring that
all file-like objects had "read,"
"write," "seek," and "tell" methods,
but that was too strict (e.g. read-only
buffers). This commit relaxes those
requirements to having EITHER "read"
or "write" as attributes.

Closes pandas-devgh-16135.
gfyoung added a commit to forking-repos/pandas that referenced this issue Apr 27, 2017
Previously, we were requiring that
all file-like objects had "read,"
"write," "seek," and "tell" methods,
but that was too strict (e.g. read-only
buffers). This commit relaxes those
requirements to having EITHER "read"
or "write" as attributes.

Closes pandas-devgh-16135.
jreback pushed a commit that referenced this issue Apr 27, 2017
Previously, we were requiring that
all file-like objects had "read,"
"write," "seek," and "tell" methods,
but that was too strict (e.g. read-only
buffers). This commit relaxes those
requirements to having EITHER "read"
or "write" as attributes.

Closes gh-16135.
pcluo pushed a commit to pcluo/pandas that referenced this issue May 22, 2017
Previously, we were requiring that
all file-like objects had "read,"
"write," "seek," and "tell" methods,
but that was too strict (e.g. read-only
buffers). This commit relaxes those
requirements to having EITHER "read"
or "write" as attributes.

Closes pandas-devgh-16135.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

4 participants