Skip to content

pandas read_csv no longer supports file-like objects from tarfile (pandas 0.20.1) #16530

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gjanvier opened this issue May 29, 2017 · 17 comments
Closed
Labels
Compat pandas objects compatability with Numpy or Python functions IO CSV read_csv, to_csv
Milestone

Comments

@gjanvier
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
import tarfile

tar = tarfile.open(name="xxx.tar.bz2", mode='r')
myfile = tar.extractfile('yyy.csv') # file-like object with a read() method
data = pd.read_csv(myfile, sep=r'\s+')

This code generates this error:

  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 655, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 392, in _read
    filepath_or_buffer, encoding, compression)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/common.py", line 210, in get_filepath_or_buffer
    raise ValueError(msg.format(_type=type(filepath_or_buffer)))
ValueError: Invalid file path or buffer object type: <class 'tarfile.ExFileObject'>

Problem description

This code works with pandas 0.19.2 but fails with 0.20.1.

According to pandas doc for read_csv:

filepath_or_buffer : str, pathlib.Path, py._path.local.LocalPath or any object with a read() method (such as a file handle or StringIO)

I guess the new validations are too restrictive ?

@bashtage
Copy link
Contributor

Works as expected on windows with 0.20.1.

pd.read_csv(myfile, sep='\s+')
Out[25]: 
  Col1,COl2
0       a,1
1       b,2
2       c,3
3       d,4

Which Python? You should include the show_versions() output in the details area of the template.

@gjanvier
Copy link
Author

>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-78-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.1
pytest: None
pip: 8.1.1
setuptools: 20.7.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None

@bashtage
Copy link
Contributor

Did you mean to close? If it isn't working for you, please reopen. Could be combination of Python version/OS.

@gjanvier gjanvier reopened this May 29, 2017
@gjanvier
Copy link
Author

oups sorry, reopen.

@TomAugspurger
Copy link
Contributor

Working for me as well with python3 and master.

@gjanvier can you make a fully reproducible example (including writing the csv and adding it to the tar)

@gjanvier
Copy link
Author

Sure, here is a test case.

import tarfile
import pandas as pd

pd.show_versions()

data = pd.DataFrame(
    data=[[1,2], [3,4]],
    columns=['col1', 'col2']
)

print "data"
print data
print ""

data.to_csv('mydata.csv', sep="\t", index=False)

tar = tarfile.open('test.tar', 'w')
tar.add('mydata.csv')
tar.close()

tar = tarfile.open('test.tar', 'r')
myfile = tar.extractfile('mydata.csv')
data2 = pd.read_csv(myfile, sep=r'\s+')

print "data2"
print data2
print ""

FYI, I run my tests in a docker container...

Result with pandas 0.19.2

root@0a35b054b4da:xxxx# pip install pandas==0.19.2
Collecting pandas==0.19.2
  Downloading pandas-0.19.2-cp27-cp27mu-manylinux1_x86_64.whl (17.2MB)
    100% |################################| 17.2MB 25kB/s 
Requirement already satisfied (use --upgrade to upgrade): pytz>=2011k in /usr/local/lib/python2.7/dist-packages (from pandas==0.19.2)
Requirement already satisfied (use --upgrade to upgrade): python-dateutil in /usr/local/lib/python2.7/dist-packages (from pandas==0.19.2)
Requirement already satisfied (use --upgrade to upgrade): numpy>=1.7.0 in /usr/local/lib/python2.7/dist-packages (from pandas==0.19.2)
Requirement already satisfied (use --upgrade to upgrade): six>=1.5 in /usr/lib/python2.7/dist-packages (from python-dateutil->pandas==0.19.2)
Installing collected packages: pandas
  Found existing installation: pandas 0.20.1
    Uninstalling pandas-0.20.1:
      Successfully uninstalled pandas-0.20.1
Successfully installed pandas-0.19.2
You are using pip version 8.1.1, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
root@0a35b054b4da:xxxx# python test_tar_pd.py 

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-78-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.19.2
nose: None
pip: 8.1.1
setuptools: 20.7.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
statsmodels: 0.8.0
xarray: None
IPython: 5.3.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: 0.10.3
apiclient: 1.6.2
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None
data
   col1  col2
0     1     2
1     3     4

data2
   col1  col2
0     1     2
1     3     4

Result with pandas 0.20.1

root@0a35b054b4da:xxx# pip install pandas==0.20.1
Collecting pandas==0.20.1
  Downloading pandas-0.20.1-cp27-cp27mu-manylinux1_x86_64.whl (22.3MB)
    100% |################################| 22.3MB 19kB/s 
Requirement already satisfied (use --upgrade to upgrade): pytz>=2011k in /usr/local/lib/python2.7/dist-packages (from pandas==0.20.1)
Requirement already satisfied (use --upgrade to upgrade): python-dateutil in /usr/local/lib/python2.7/dist-packages (from pandas==0.20.1)
Requirement already satisfied (use --upgrade to upgrade): numpy>=1.7.0 in /usr/local/lib/python2.7/dist-packages (from pandas==0.20.1)
Requirement already satisfied (use --upgrade to upgrade): six>=1.5 in /usr/lib/python2.7/dist-packages (from python-dateutil->pandas==0.20.1)
Installing collected packages: pandas
  Found existing installation: pandas 0.19.2
    Uninstalling pandas-0.19.2:
      Successfully uninstalled pandas-0.19.2
Successfully installed pandas-0.20.1
You are using pip version 8.1.1, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
root@0a35b054b4da:xxx# python test_tar_pd.py 

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-78-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.1
pytest: None
pip: 8.1.1
setuptools: 20.7.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None
data
   col1  col2
0     1     2
1     3     4

Traceback (most recent call last):
  File "test_tar_pd.py", line 23, in <module>
    data2 = pd.read_csv(myfile, sep=r'\s+')
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 655, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 392, in _read
    filepath_or_buffer, encoding, compression)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/common.py", line 210, in get_filepath_or_buffer
    raise ValueError(msg.format(_type=type(filepath_or_buffer)))
ValueError: Invalid file path or buffer object type: <class 'tarfile.ExFileObject'>

@jreback
Copy link
Contributor

jreback commented May 29, 2017

IIRC @gfyoung fixed this by patching is_filelike, can't find the issue ATM

@gfyoung
Copy link
Member

gfyoung commented May 29, 2017

@jreback : The relevant PR is #16150.

@jreback
Copy link
Contributor

jreback commented May 29, 2017

hmm, so maybe this IS an issue on py2.7? maybe tarfile is not a proper iterator? (or doesn't have read?)

@jreback jreback added Compat pandas objects compatability with Numpy or Python functions IO CSV read_csv, to_csv labels May 29, 2017
@gfyoung
Copy link
Member

gfyoung commented May 29, 2017

This is indeed a compatibility issue. Turns out tarfile.ExFileObject is not a proper iterator object in our eyes under the Python 2.x implementation (it has no next or __next__ attribute, but Python 3.x tarfile.ExFileObject has the __next__ attribute).

I guess we just need to check for the __iter__ attribute ONLY it seems for is_file_like?

@jreback
Copy link
Contributor

jreback commented May 29, 2017

yeah maybe relax in the is_file_like only

gfyoung added a commit to forking-repos/pandas that referenced this issue May 29, 2017
Tarfile.ExFileObject has no "next" method in
Python 2.x, making it an invalid file-like
object in read_csv. However, they can be
read in just fine, meaning our check is too
strict for file-like. This commit relaxes
the check to just look for "__iter__".

Closes pandas-devgh-16530.
gfyoung added a commit to forking-repos/pandas that referenced this issue May 29, 2017
Tarfile.ExFileObject has no "next" method in
Python 2.x, making it an invalid file-like
object in read_csv. However, they can be
read in just fine, meaning our check is too
strict for file-like. This commit relaxes
the check to just look for "__iter__".

Closes pandas-devgh-16530.
@gfyoung
Copy link
Member

gfyoung commented May 29, 2017

@jreback : So the C engine doesn't require that the file-like have a next method, but the Python engine does (we explicitly call next(self.data)). This presents a slight dilemma then: how do we check that a file-like has next and that the engine specified is Python? If possible, I would want to switch to the C engine.

gfyoung added a commit to forking-repos/pandas that referenced this issue May 29, 2017
Tarfile.ExFileObject has no "next" method in
Python 2.x, making it an invalid file-like
object in read_csv. However, they can be
read in just fine, meaning our check is too
strict for file-like. This commit relaxes
the check to just look for "__iter__".

Closes pandas-devgh-16530.
@gfyoung
Copy link
Member

gfyoung commented May 29, 2017

Also, I should add reading tarfile objects isn't actually feasible in Python's csv library. So this is just for the C engine in Python 2.x

gfyoung added a commit to forking-repos/pandas that referenced this issue May 29, 2017
Tarfile.ExFileObject has no "next" method in
Python 2.x, making it an invalid file-like
object in read_csv. However, they can be
read in just fine, meaning our check is too
strict for file-like. This commit relaxes
the check to just look for "__iter__".

Closes pandas-devgh-16530.
gfyoung added a commit to forking-repos/pandas that referenced this issue May 29, 2017
Tarfile.ExFileObject has no "next" method in
Python 2.x, making it an invalid file-like
object in read_csv. However, they can be
read in just fine, meaning our check is too
strict for file-like. This commit relaxes
the check to just look for "__iter__".

Closes pandas-devgh-16530.
gfyoung added a commit to forking-repos/pandas that referenced this issue May 29, 2017
Tarfile.ExFileObject has no "next" method in
Python 2.x, making it an invalid file-like
object in read_csv. However, they can be
read in just fine, meaning our check is too
strict for file-like. This commit relaxes
the check to just look for "__iter__".

Closes pandas-devgh-16530.
gfyoung added a commit to forking-repos/pandas that referenced this issue May 29, 2017
Tarfile.ExFileObject has no "next" method in
Python 2.x, making it an invalid file-like
object in read_csv. However, they can be
read in just fine, meaning our check is too
strict for file-like. This commit relaxes
the check to just look for "__iter__".

Closes pandas-devgh-16530.
gfyoung added a commit to forking-repos/pandas that referenced this issue May 29, 2017
Tarfile.ExFileObject has no "next" method in
Python 2.x, making it an invalid file-like
object in read_csv. However, they can be
read in just fine, meaning our check is too
strict for file-like. This commit relaxes
the check to just look for "__iter__".

Closes pandas-devgh-16530.
@jreback jreback added this to the 0.20.2 milestone May 30, 2017
gfyoung added a commit to forking-repos/pandas that referenced this issue May 30, 2017
Tarfile.ExFileObject has no "next" method in
Python 2.x, making it an invalid file-like
object in read_csv. However, they can be
read in just fine, meaning our check is too
strict for file-like. This commit relaxes
the check to just look for "__iter__".

Closes pandas-devgh-16530.
gfyoung added a commit to forking-repos/pandas that referenced this issue May 30, 2017
Tarfile.ExFileObject has no "next" method in
Python 2.x, making it an invalid file-like
object in read_csv. However, they can be
read in just fine, meaning our check is too
strict for file-like. This commit relaxes
the check to just look for "__iter__".

Closes pandas-devgh-16530.
gfyoung added a commit to forking-repos/pandas that referenced this issue May 30, 2017
Tarfile.ExFileObject has no "next" method in
Python 2.x, making it an invalid file-like
object in read_csv. However, they can be
read in just fine, meaning our check is too
strict for file-like. This commit relaxes
the check to just look for "__iter__".

Closes pandas-devgh-16530.
gfyoung added a commit to forking-repos/pandas that referenced this issue May 30, 2017
Tarfile.ExFileObject has no "next" method in
Python 2.x, making it an invalid file-like
object in read_csv. However, they can be
read in just fine, meaning our check is too
strict for file-like. This commit relaxes
the check to just look for "__iter__".

Closes pandas-devgh-16530.
gfyoung added a commit to forking-repos/pandas that referenced this issue May 31, 2017
Tarfile.ExFileObject has no "next" method in
Python 2.x, making it an invalid file-like
object in read_csv. However, they can be
read in just fine, meaning our check is too
strict for file-like. This commit relaxes
the check to just look for "__iter__".

Closes pandas-devgh-16530.
@jtratner
Copy link
Contributor

@gfyoung - I hit the same issue when trying to pass Luigi's ReadableS3File to read_csv in pandas 0.20.1.

My understanding is that that while this is not required to be true for a file-like:

next(fp)
iter(fp)

What is required is that it produce an iterator with a next method.

it = iter(fp)
next(it)

So if pandas wants to use next(self.data) perhaps just need to call iter on it first and work from there?

@jtratner
Copy link
Contributor

But I'm not sure this is explicitly defined anywhere, so much as it looks like an in-practice kind of thing.

@gfyoung
Copy link
Member

gfyoung commented May 31, 2017

@jtratner : For an object to be file-like, I am proposing that the object just have an __iter__ method (it need not have next or __next__). What you propose might work but would require a little more beefing up, as iter objects are not file-like by themselves. If we combine attributes from iter and the object we are wrapping, we could get a valid file-like.

Interesting idea. Worth pursuing once this issue gets resolved.

@jtratner
Copy link
Contributor

@gfyoung - cool I agree with your definition :) - just was reinforcing that the definition of file-like as "has iter" is upheld many places but not having the next() method.

gfyoung added a commit to forking-repos/pandas that referenced this issue May 31, 2017
Tarfile.ExFileObject has no "next" method in
Python 2.x, making it an invalid file-like
object in read_csv. However, they can be
read in just fine, meaning our check is too
strict for file-like. This commit relaxes
the check to just look for "__iter__".

Closes pandas-devgh-16530.
gfyoung added a commit to forking-repos/pandas that referenced this issue May 31, 2017
Tarfile.ExFileObject has no "next" method in
Python 2.x, making it an invalid file-like
object in read_csv. However, they can be
read in just fine, meaning our check is too
strict for file-like. This commit relaxes
the check to just look for "__iter__".

Closes pandas-devgh-16530.
gfyoung added a commit to forking-repos/pandas that referenced this issue May 31, 2017
Tarfile.ExFileObject has no "next" method in
Python 2.x, making it an invalid file-like
object in read_csv. However, they can be
read in just fine, meaning our check is too
strict for file-like. This commit relaxes
the check to just look for "__iter__".

Closes pandas-devgh-16530.
gfyoung added a commit to forking-repos/pandas that referenced this issue Jun 1, 2017
Tarfile.ExFileObject has no "next" method in
Python 2.x, making it an invalid file-like
object in read_csv. However, they can be
read in just fine, meaning our check is too
strict for file-like. This commit relaxes
the check to just look for "__iter__".

Closes pandas-devgh-16530.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

6 participants