-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
pandas 1.0.1 read_csv() is broken for some file-like objects #31819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I don't mind expanding support (since we used to have it before), but Is there a generic way to check all of these? It seems they have varying interfaces. Would our existing file-like check suffice? |
I don't think it's enough to make sure it's just an idea.
Or consider a |
Alternatively, we could just drop the check for it being an instance of |
sadly |
I see. Given how many different file-like objects we have to support, I think this is the best way to go for the time being. It's better that we restore existing file-likes that worked before and then re-evaluate our logic later to catch "invalid file-likes" (for lack of a better term). |
Just idea, we might classify the object by it has import os
import pandas
import tempfile
fname = ""
with tempfile.NamedTemporaryFile(delete=False, mode="w+", encoding="shift-jis") as f:
f.write("てすと\nbar")
fname = f.name
print(hasattr(f, 'encoding')) # True
print(fname)
try:
with open(fname,mode="r", encoding="shift-jis") as f:
print(hasattr(f, 'encoding')) # True
result = pandas.read_csv(f)
print("read shift-jis")
print(result)
with open(fname,mode="r", encoding="shift-jis") as f:
print(hasattr(f, 'encoding')) # True
result = pandas.read_csv(f,encoding="utf-8")
print("open shift-jis file and read_csv with encoding: utf-8")
print(result)
with open(fname,mode="rb") as f:
print(hasattr(f, 'encoding')) # False
result = pandas.read_csv(f,encoding="shift-jis")
print("open binary with buffered and read_csv with encoding: shift-jis")
print(result)
with open(fname,mode="rb",buffering=0) as f:
print(hasattr(f, 'encoding')) # False
result = pandas.read_csv(f,encoding="shift-jis")
print("open binary without burrered and read_csv with encoding: shift-jis")
print(result)
except Exception as e:
print(e)
os.unlink(fname) and BufferedIOBase does not have
|
Not sure I follow you here. How would this help determine which files to pass through or not? |
Sorry, I think about @paihu 's following comment:
TextIOBase has RawIOBase and BufferedIOBase does not have if self.encoding and isinstance(source, (io.BufferedIOBase, io.RawIOBase)): to if self.encoding and hasattr(source, 'read') and not hasattr(source, 'encoding'): (or use |
@sasanquaneuf : You are more than welcome to give your suggestion a try! |
Not exactly sure, take a look on the #32392, please. It looks similar or\and connected issue and may affects possible solution. |
We're hoping to release 1.0.2 soon. Have we reached an agreed behavior, and is anyone working on this? |
@TomAugspurger : We haven't come upon an agreed-upon fix yet. I was hoping to get some more community input on this, but if not, I'll see what I can change here to fix the regression. |
I'm roughly targeting Wednesday for the release so if you're able to get something together quickly it'd be welcome. |
Sounds good. Let me see what I can put together. It may not be optimal, but if we can at least restore functionality, that would be good for this release. |
Restores behavior down to the fact that the Python engine cannot handle NamedTemporaryFile. Closes pandas-dev#31819
Restores behavior down to the fact that the Python engine cannot handle NamedTemporaryFile. Closes pandas-dev#31819 Credit to @sasanquaneuf for originating idea.
Restores behavior down to the fact that the Python engine cannot handle NamedTemporaryFile. Closes pandas-dev#31819 Credit to @sasanquaneuf for originating idea.
PR is up and ready for review |
Restores behavior down to the fact that the Python engine cannot handle NamedTemporaryFile. Closes pandas-dev#31819 Credit to @sasanquaneuf for originating idea.
Restores behavior down to the fact that the Python engine cannot handle NamedTemporaryFile. Closes pandas-dev#31819 Credit to @sasanquaneuf for originating idea.
Restores behavior down to the fact that the Python engine cannot handle NamedTemporaryFile. Closes #31819 Credit to @sasanquaneuf for originating idea.
Restores behavior down to the fact that the Python engine cannot handle NamedTemporaryFile. Closes pandas-dev#31819 Credit to @sasanquaneuf for originating idea.
Using
Either it's a bug in the code or an issue in the documentation, as documentation is stating that |
@Colin-b can you open a new issue? |
Hi @jorisvandenbossche , While trying to reproduce the issue with the smallest possible sample, I figured out that the issue might not be on pandas side. The And I guess pandas uses Do you still consider this as an issue and want me to open one? In the meantime I will see if it can be reported to werkzeug as I would expect FileStorage to expose read only mode. |
Sorry, I don't know enough about this part of pandas to answer your question. (@gfyoung ?) |
I will thanks 👍 |
I'm not sure why @Colin-b didn't follow up here, but I think this indicates an issue with Pandas as well. Pandas should accept |
Code Sample
Problem description
Pandas 1.0.1, this sample does not work. But pandas 0.25.3, this sample works fine.
As stated in issue #31575, the encode of file-like object is ignored when its class is not io.BufferedIOBase neither RawIOBase.
However, some file-like objects are NOT inherited one of them, although the "actual" inner object is one of them.
In this code sample case, according to the cpython implementation, they has file as their attribute
self.file = file
, and__getattr__()
returns the file's attribute as their attribute.So the code is not work. The identic problems are in other file-like objects, for example, tempfile.*File class, werkzeug's FileStorage class, and so on.
Note: I first recognized this problem with using pandas via flask's posted file. The file-like object is an instance of werkzeug's FileStorage. I avoided this problem with following code:
Expected Output
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.6.8.final.0
python-bits : 64
OS : Linux
OS-release : 4.14.138-89.102.amzn1.x86_64
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : ja_JP.UTF-8
LOCALE : ja_JP.UTF-8
pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 9.0.3
setuptools : 36.2.7
Cython : None
pytest : 3.6.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.0.5
lxml.etree : 4.2.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10
IPython : None
pandas_datareader: None
bs4 : 4.6.0
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.2.1
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 3.6.2
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.3.4
tables : None
tabulate : None
xarray : None
xlrd : 1.1.0
xlwt : None
xlsxwriter : 1.0.5
numba : None
The text was updated successfully, but these errors were encountered: