is_file_like requirements are too strict for boto3 S3 objects #16135

djlancelot · 2017-04-25T22:25:08Z

Code Sample, a copy-pastable example if possible

my_s3_object = boto3.client('s3').get_object(Bucket=<S3 bucket>, Key=<S3 path>)
pd.read_csv(my_s3_object['Body'])

Problem description

When trying to read csv files from AWS S3 directly using boto3, the returned botocore.response. StreamingBody object does not have all the neccessary methods required in is_file_like function of pandas.core.dtypes.inference package. Although the pd.read_csv call worked flawlessly before, since the commit 20 days ago ( e4e87ec ) our app is broken.

See botocore reference ( http://botocore.readthedocs.io/en/latest/reference/response.html ) for details on the StreamingBody class.

The issue is caused by the too strict constraint in is_file_like function ( https://github.com/pandas-dev/pandas/blob/master/pandas/core/dtypes/inference.py#L140 )

Expected Output

No error should be raised if read method is available on read operations.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Darwin
OS-release: 16.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.0rc1
pytest: 3.0.7
pip: 9.0.1
setuptools: 35.0.1
Cython: None
numpy: 1.12.0
scipy: 0.19.0
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

1) Allows for more uniform handling of invalid file buffers to our `read_*` functions. 2) Adds a ton of new documentation to `inference.py` Closes #15337. xref #15895. Author: gfyoung <[email protected]> Closes #15894 from gfyoung/validate-file-like and squashes the following commits: 5a8f8da [gfyoung] DOC: Document all of inference.py 81103f7 [gfyoung] ENH: Add file buffer validation to I/O ops

chris-b1 · 2017-04-26T00:12:51Z

cc @gfyoung - do we ever use anything more than read() with a file-like object?

xref #15894

jreback · 2017-04-26T00:45:39Z

is_file_like is correct
read is just not enough you need the ability to seek and tell as well
steaming does not meet this

djlancelot · 2017-04-26T01:01:05Z

As far as I saw the read_csv uses the input as a buffer and it is only using the read and the close methods. It would be nice to have a less strict is_readonly_buffer_like helper function for read validation (and maybe an is_writeonly_buffer_like helper function for write validation).

gfyoung · 2017-04-26T02:23:21Z

Given this breakage, I think we can relax the requirements in two ways:

The object has EITHER read OR write and have the user deal with the consequences if the wrong attribute is available (e.g. we have read-only on a file to which we want to write). The intent of the user with the file object does not impact whether the object is file-like or not.
Remove checks for seek and tell, as it does not seem like we need either of them:

class BadStringIO(StringIO):
   def tell(self):
      raise NotImplementedError("nope")

   def seek(self, pos, whence=0):
      raise NotImplementedError("nope")

>>> data = "a\n1"
>>> read_csv(BadStringIO(data), engine="c")
   a
0  1
>>>
>>> read_csv(BadStringIO(data), engine="python")
   a
0  1

jreback · 2017-04-26T13:55:09Z

@gfyoung

I guess we could change is_file_like to be (pseudo-code) is_iterator & hasattr(read/write)
that seems to pass the mock tests (which is not an iterator anyhow)

jreback · 2017-04-26T13:56:39Z

@djlancelot can you provide a test based on one of our s3 buckets that replicates this

e.g. a streaming version of this
df = read_csv('s3://pandas-test/tips.csv')

Previously, we were requiring that all file-like objects had "read," "write," "seek," and "tell" methods, but that was too strict (e.g. read-only buffers). This commit relaxes those requirements to having EITHER "read" or "write" as attributes. Closes pandas-devgh-16135.

Previously, we were requiring that all file-like objects had "read," "write," "seek," and "tell" methods, but that was too strict (e.g. read-only buffers). This commit relaxes those requirements to having EITHER "read" or "write" as attributes. Closes gh-16135.

Previously, we were requiring that all file-like objects had "read," "write," "seek," and "tell" methods, but that was too strict (e.g. read-only buffers). This commit relaxes those requirements to having EITHER "read" or "write" as attributes. Closes pandas-devgh-16135.

chris-b1 added IO Data IO issues that don't fit into a more specific label Regression Functionality that used to work in a prior pandas version labels Apr 26, 2017

jreback added this to the 0.20.0 milestone Apr 26, 2017

gfyoung mentioned this issue Apr 26, 2017

API: Relax is-file-like conditions #16150

Merged

jreback closed this as completed in #16150 Apr 27, 2017

uiur mentioned this issue Aug 1, 2017

still doesn't work with boto3 StreamingBody #17135

Closed

jorisvandenbossche mentioned this issue Sep 19, 2017

why is_file_like needs to see __iter__ #17591

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

is_file_like requirements are too strict for boto3 S3 objects #16135

is_file_like requirements are too strict for boto3 S3 objects #16135

djlancelot commented Apr 25, 2017

INSTALLED VERSIONS

chris-b1 commented Apr 26, 2017 •

edited

Loading

jreback commented Apr 26, 2017

djlancelot commented Apr 26, 2017

gfyoung commented Apr 26, 2017 •

edited

Loading

jreback commented Apr 26, 2017

jreback commented Apr 26, 2017 •

edited

Loading

is_file_like requirements are too strict for boto3 S3 objects #16135

is_file_like requirements are too strict for boto3 S3 objects #16135

Comments

djlancelot commented Apr 25, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

chris-b1 commented Apr 26, 2017 • edited Loading

jreback commented Apr 26, 2017

djlancelot commented Apr 26, 2017

gfyoung commented Apr 26, 2017 • edited Loading

jreback commented Apr 26, 2017

jreback commented Apr 26, 2017 • edited Loading

Output of `pd.show_versions()`

chris-b1 commented Apr 26, 2017 •

edited

Loading

gfyoung commented Apr 26, 2017 •

edited

Loading

jreback commented Apr 26, 2017 •

edited

Loading