Skip to content

BUG: read_csv raising when null bytes are in skipped rows #38989

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
phofl opened this issue Jan 6, 2021 · 6 comments · Fixed by #38997
Closed
3 tasks done

BUG: read_csv raising when null bytes are in skipped rows #38989

phofl opened this issue Jan 6, 2021 · 6 comments · Fixed by #38997
Labels
Bug IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@phofl
Copy link
Member

phofl commented Jan 6, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

I have a few weird csv files with null bytes in the first row. Since 1.2.0 they are raising even if I skip the first row,

pd.read_csv("test.csv", engine="python", skiprows=[0])
pd.read_csv("test.csv", engine="c", skiprows=[0])

Could not get this to fail with an example created in code.

Problem description

Both engines raising

Traceback (most recent call last):
  File "/home/developer/.config/JetBrains/PyCharm2020.3/scratches/scratch_4.py", line 377, in <module>
    pd.read_csv("/media/sf_Austausch/test.csv", engine="c", skiprows=[0], nrows=2)
  File "/home/developer/PycharmProjects/pandas/pandas/io/parsers.py", line 605, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/developer/PycharmProjects/pandas/pandas/io/parsers.py", line 457, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/developer/PycharmProjects/pandas/pandas/io/parsers.py", line 814, in __init__
    self._engine = self._make_engine(self.engine)
  File "/home/developer/PycharmProjects/pandas/pandas/io/parsers.py", line 1045, in _make_engine
    return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
  File "/home/developer/PycharmProjects/pandas/pandas/io/parsers.py", line 1894, in __init__
    self._reader = parsers.TextReader(self.handles.handle, **kwds)
  File "pandas/_libs/parsers.pyx", line 517, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 619, in pandas._libs.parsers.TextReader._get_header
  File "pandas/_libs/parsers.pyx", line 813, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 1942, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 8: invalid continuation byte

Process finished with exit code 1

This worked on 1.1.5.

Relevant changes are:

https://github.com/pandas-dev/pandas/blob/9b16b1e1ee049b042a7d59cddd8fbd913137223f/pandas/io/common.py#L556:L557
infers encoding now, which leads to

https://github.com/pandas-dev/pandas/blob/9b16b1e1ee049b042a7d59cddd8fbd913137223f/pandas/io/common.py#L643:L651
where errors is always strict for read_csv. On 1.1.5 in case of no encoding given, errors was set to replace, which caused this to work. Was this an intended change? Labeling as Regression for now.

Expected Output

   !   xyz  0  1
0  *  2000  1  2
1  *  2001  0  0

cc @gfyoung @twoertwein
This was caused by #36997

Output of pd.show_versions()

INSTALLED VERSIONS

commit : cd0224d
python : 3.8.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-58-generic
Version : #64-Ubuntu SMP Wed Dec 9 08:16:25 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.0.dev0+356.ge0cb09e917
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.3
setuptools : 49.6.0.post20201009
Cython : 0.29.21
pytest : 6.1.1
hypothesis : 5.37.1
sphinx : 3.2.1
blosc : None
feather : None
xlsxwriter : 1.3.7
lxml.etree : 4.5.2
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.18.1
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.8.3
fastparquet : 0.4.1
gcsfs : 0.7.1
matplotlib : 3.3.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : 2.0.0
pyxlsb : None
s3fs : 0.4.2
scipy : 1.5.2
sqlalchemy : 1.3.20
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.16.1
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.51.2

@phofl phofl added Bug Needs Triage Issue that has not been reviewed by a pandas team member IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 6, 2021
@twoertwein
Copy link
Member

where errors is always strict for read_csv. On 1.1.5 in case of no encoding given, errors was set to replace, which caused this to work. Was this an intended change?

Thank you @phofl for already narrowing it down :) That wasn't an intentional change.

I'm happy to make a PR to restore the old behavior. Which behavior is preferred?

@phofl
Copy link
Member Author

phofl commented Jan 6, 2021

I would say errors=replace when encoding is set to utf-8? Would be inline with the previous behavior

@gfyoung
Copy link
Member

gfyoung commented Jan 6, 2021

Agreed this is a regression if that's the case, and restoring behavior is the right call.

@twoertwein
Copy link
Member

@phofl Could you please read the beginning of the CSV in binary mode, that would be helpful to make a testcase out of it. The following does (unfortunately) not trigger this bug:

from io import BytesIO
import pandas as pd

buffer = BytesIO()
buffer.write(b'\0' + '\na\n1'.encode('utf-8'))
buffer.seek(0)
pd.read_csv(buffer, skiprows=[0])

@phofl
Copy link
Member Author

phofl commented Jan 7, 2021

buffer = io.BytesIO()
buffer.write(b' xyz gem\xe4\xdf Tabelle 4 \r\n')
buffer.seek(0)
pd.read_csv(buffer, skiprows=[0])

raises

@twoertwein
Copy link
Member

perfect, thank you :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants