BUG: pd.read_csv - encoding argument ignored for SpooledTemporaryFile #44748

bergkvist · 2021-12-04T02:21:45Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd
import tempfile
f = tempfile.SpooledTemporaryFile()
f.write('''\
beløp,valuta
10.0,BTC
5.0,ETH
'''.encode('latin-1'))
f.seek(0)
pd.read_csv(f, encoding='latin-1')
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 3: invalid start byte

# If I switch SpooledTemporaryFile with TemporaryFile, everything works

Issue Description

When passing a SpooledTemporaryFile to pd.read_csv, the encoding-argument is ignored, and it seems like it tries to parse the bytes as utf-8. This will fail if the bytes contain non utf-8 characters and uses a different encoding.

In the example above the text contains a Norwegian letter in latin-1 encoding. Note that if I switch from using a SpooledTemporaryFile to a TemporaryFile, the problem goes away, and pd.read_csv uses the correct encoding.

Expected Behavior

pd.read_csv should respect the encoding argument when receiving a SpooledTemporaryFile, and be able to correctly parse the example above without raising an exception.

Installed Versions

INSTALLED VERSIONS

commit : 73c6825
python : 3.8.12.final.0
python-bits : 64
OS : Darwin
OS-release : 20.6.0
Version : Darwin Kernel Version 20.6.0: Mon Aug 30 06:12:21 PDT 2021; root:xnu-7195.141.6~3/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8

pandas : 1.3.3
numpy : 1.21.2
pytz : 2021.3
dateutil : 2.8.2
pip : 21.1.3
setuptools : 57.2.0.post0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.2
IPython : 7.28.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

twoertwein · 2021-12-04T03:23:24Z

Similar to #43439

If you can open the SpooledTemporaryFile in text mode, it works:

import tempfile

import pandas as pd

with tempfile.SpooledTemporaryFile(mode="w+t") as f:
    f.write("\nbeløp,valuta\n10.0,BTC\n5.0,ETH")
    f.seek(0)
    pd.read_csv(f, encoding="latin-1")

twoertwein · 2021-12-04T17:32:39Z

The reason we cannot open binary tempfile.SpooledTemporaryFile is because we need to wrap it inside io.TextWrapperIO which requires the attribute readable (and a few more) which are not implemented by tempfile.SpooledTemporaryFile.

We could implement a simpler version of io.TextWrapperIO for just the functions we need. This would allow us to support binary tempfile.SpooledTemporaryFile. Our own implementation will most likely be slower than io.TextWrapperIO. This would affect to/read_csv/json when the user provides a binary handle and for any handle for the mentioned functions when the user requests compression. @jreback

twoertwein · 2021-12-04T22:16:55Z

I think I found a simple way to support SpooledTemporaryFile (by lying that certain attributes exist and are true): #44761

@oskli

bergkvist added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 4, 2021

twoertwein added IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 4, 2021

twoertwein mentioned this issue Dec 4, 2021

ENH: support SpooledTemporaryFile #44761

Merged

4 tasks

jreback added this to the 1.4 milestone Dec 4, 2021

jreback closed this as completed in #44761 Dec 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: pd.read_csv - encoding argument ignored for SpooledTemporaryFile #44748

BUG: pd.read_csv - encoding argument ignored for SpooledTemporaryFile #44748

bergkvist commented Dec 4, 2021 •

edited

Loading

INSTALLED VERSIONS

twoertwein commented Dec 4, 2021

twoertwein commented Dec 4, 2021

twoertwein commented Dec 4, 2021

BUG: pd.read_csv - encoding argument ignored for SpooledTemporaryFile #44748

BUG: pd.read_csv - encoding argument ignored for SpooledTemporaryFile #44748

Comments

bergkvist commented Dec 4, 2021 • edited Loading

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

twoertwein commented Dec 4, 2021

twoertwein commented Dec 4, 2021

twoertwein commented Dec 4, 2021

bergkvist commented Dec 4, 2021 •

edited

Loading