Skip to content

BUG: pd.read_csv - encoding argument ignored for SpooledTemporaryFile #44748

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
bergkvist opened this issue Dec 4, 2021 · 3 comments · Fixed by #44761
Closed
2 of 3 tasks

BUG: pd.read_csv - encoding argument ignored for SpooledTemporaryFile #44748

bergkvist opened this issue Dec 4, 2021 · 3 comments · Fixed by #44761
Labels
Bug IO CSV read_csv, to_csv
Milestone

Comments

@bergkvist
Copy link

bergkvist commented Dec 4, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd
import tempfile
f = tempfile.SpooledTemporaryFile()
f.write('''\
beløp,valuta
10.0,BTC
5.0,ETH
'''.encode('latin-1'))
f.seek(0)
pd.read_csv(f, encoding='latin-1')
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 3: invalid start byte

# If I switch SpooledTemporaryFile with TemporaryFile, everything works

Issue Description

When passing a SpooledTemporaryFile to pd.read_csv, the encoding-argument is ignored, and it seems like it tries to parse the bytes as utf-8. This will fail if the bytes contain non utf-8 characters and uses a different encoding.

In the example above the text contains a Norwegian letter in latin-1 encoding. Note that if I switch from using a SpooledTemporaryFile to a TemporaryFile, the problem goes away, and pd.read_csv uses the correct encoding.

Expected Behavior

pd.read_csv should respect the encoding argument when receiving a SpooledTemporaryFile, and be able to correctly parse the example above without raising an exception.

Installed Versions

INSTALLED VERSIONS

commit : 73c6825
python : 3.8.12.final.0
python-bits : 64
OS : Darwin
OS-release : 20.6.0
Version : Darwin Kernel Version 20.6.0: Mon Aug 30 06:12:21 PDT 2021; root:xnu-7195.141.6~3/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8

pandas : 1.3.3
numpy : 1.21.2
pytz : 2021.3
dateutil : 2.8.2
pip : 21.1.3
setuptools : 57.2.0.post0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.2
IPython : 7.28.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@bergkvist bergkvist added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 4, 2021
@twoertwein
Copy link
Member

Similar to #43439

If you can open the SpooledTemporaryFile in text mode, it works:

import tempfile

import pandas as pd

with tempfile.SpooledTemporaryFile(mode="w+t") as f:
    f.write("\nbeløp,valuta\n10.0,BTC\n5.0,ETH")
    f.seek(0)
    pd.read_csv(f, encoding="latin-1")

@twoertwein twoertwein added IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 4, 2021
@twoertwein
Copy link
Member

The reason we cannot open binary tempfile.SpooledTemporaryFile is because we need to wrap it inside io.TextWrapperIO which requires the attribute readable (and a few more) which are not implemented by tempfile.SpooledTemporaryFile.

We could implement a simpler version of io.TextWrapperIO for just the functions we need. This would allow us to support binary tempfile.SpooledTemporaryFile. Our own implementation will most likely be slower than io.TextWrapperIO. This would affect to/read_csv/json when the user provides a binary handle and for any handle for the mentioned functions when the user requests compression. @jreback

@twoertwein
Copy link
Member

I think I found a simple way to support SpooledTemporaryFile (by lying that certain attributes exist and are true): #44761

@oskli

@jreback jreback added this to the 1.4 milestone Dec 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants