You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the master branch of pandas.
Reproducible Example
importpandasaspdimporttempfilef=tempfile.SpooledTemporaryFile()
f.write('''\beløp,valuta10.0,BTC5.0,ETH'''.encode('latin-1'))
f.seek(0)
pd.read_csv(f, encoding='latin-1')
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 3: invalid start byte# If I switch SpooledTemporaryFile with TemporaryFile, everything works
Issue Description
When passing a SpooledTemporaryFile to pd.read_csv, the encoding-argument is ignored, and it seems like it tries to parse the bytes as utf-8. This will fail if the bytes contain non utf-8 characters and uses a different encoding.
In the example above the text contains a Norwegian letter in latin-1 encoding. Note that if I switch from using a SpooledTemporaryFile to a TemporaryFile, the problem goes away, and pd.read_csv uses the correct encoding.
Expected Behavior
pd.read_csv should respect the encoding argument when receiving a SpooledTemporaryFile, and be able to correctly parse the example above without raising an exception.
Installed Versions
INSTALLED VERSIONS
commit : 73c6825
python : 3.8.12.final.0
python-bits : 64
OS : Darwin
OS-release : 20.6.0
Version : Darwin Kernel Version 20.6.0: Mon Aug 30 06:12:21 PDT 2021; root:xnu-7195.141.6~3/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8
The reason we cannot open binary tempfile.SpooledTemporaryFile is because we need to wrap it inside io.TextWrapperIO which requires the attribute readable (and a few more) which are not implemented by tempfile.SpooledTemporaryFile.
We could implement a simpler version of io.TextWrapperIO for just the functions we need. This would allow us to support binary tempfile.SpooledTemporaryFile. Our own implementation will most likely be slower than io.TextWrapperIO. This would affect to/read_csv/json when the user provides a binary handle and for any handle for the mentioned functions when the user requests compression. @jreback
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the master branch of pandas.
Reproducible Example
Issue Description
When passing a SpooledTemporaryFile to pd.read_csv, the encoding-argument is ignored, and it seems like it tries to parse the bytes as utf-8. This will fail if the bytes contain non utf-8 characters and uses a different encoding.
In the example above the text contains a Norwegian letter in latin-1 encoding. Note that if I switch from using a SpooledTemporaryFile to a TemporaryFile, the problem goes away, and pd.read_csv uses the correct encoding.
Expected Behavior
pd.read_csv should respect the encoding argument when receiving a SpooledTemporaryFile, and be able to correctly parse the example above without raising an exception.
Installed Versions
INSTALLED VERSIONS
commit : 73c6825
python : 3.8.12.final.0
python-bits : 64
OS : Darwin
OS-release : 20.6.0
Version : Darwin Kernel Version 20.6.0: Mon Aug 30 06:12:21 PDT 2021; root:xnu-7195.141.6~3/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8
pandas : 1.3.3
numpy : 1.21.2
pytz : 2021.3
dateutil : 2.8.2
pip : 21.1.3
setuptools : 57.2.0.post0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.2
IPython : 7.28.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None
The text was updated successfully, but these errors were encountered: