BUG: read_csv with multi character sep does not raise error in Jupyter Notebook/Python in terminal #57885

auggie246 · 2024-03-18T10:13:48Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

data = pd.read_csv(
    "file.tsv",
    sep="\t",
    quotechar=False,
    engine="pyarrow",
    header=None,
    names=["class", "written", "normalized"],
    na_filter=False,
)

Issue Description

I have tried running the code in a notebook and in a terminal with python, it works to my surprise.
Because while using a similar read_csv method in kedro CSVDataset (important thing here is that it uses the same pandas in the same environment), it throws the following error: the 'pyarrow' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex)
Which according to read_csv documentation and parser source code is correct

So I am not too sure what is causing the method to work in a notebook when it is not suppose to. I have also check that my data that is read from the above code is correct, by inspecting the contents

Expected Behavior

The following error should be raised but not
the 'pyarrow' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex)

Installed Versions

INSTALLED VERSIONS

commit : bdc79c1
python : 3.10.13.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-1048-gke
Version : #53-Ubuntu SMP Tue Nov 28 00:39:01 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_SG.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.1
numpy : 1.23.5
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 69.2.0
pip : 23.1.2
Cython : 3.0.9
pytest : 7.4.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 5.1.0
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.3
IPython : 8.22.2
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2023.12.2
gcsfs : None
matplotlib : 3.8.3
numba : 0.59.0
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 15.0.1
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : 2023.12.2
scipy : 1.12.0
sqlalchemy : 2.0.28
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

asishm · 2024-03-18T11:44:04Z

Maybe I'm misinterpreting the issue, but you're expecting it to raise if sep='\t' with the pyarrow engine?

len('\t') == 1

auggie246 · 2024-03-18T14:29:04Z

Maybe I'm misinterpreting the issue, but you're expecting it to raise if sep='\t' with the pyarrow engine?

len('\t') == 1

Yes, exactly. For some reason, it works fine in notebook which isn't suppose to be the right behavior.
However, the dataframe looks perfectly fine.
If it works in notebook, but doesn't in kedro (which uses pandas), what should be the right behavior

phofl · 2024-03-18T20:30:46Z

any chance they are using sep=r"\t",?

phofl · 2024-03-18T20:32:09Z

Closing this for now, it looks like this is on their end. Please ping to reopen if you have validated that is indeed a pandas bug

auggie246 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 18, 2024

auggie246 changed the title ~~BUG: Jupyter Notebook/python in terminal does not flag "engine does not support regex separators" in read_csv~~ BUG: Jupyter Notebook/python in terminal does not raise error "engine does not support regex separators" in read_csv Mar 18, 2024

auggie246 changed the title ~~BUG: Jupyter Notebook/python in terminal does not raise error "engine does not support regex separators" in read_csv~~ BUG: read_csv with multi character sep does not raise error in Jupyter Notebook/Python in terminal Mar 18, 2024

phofl closed this as completed Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_csv with multi character sep does not raise error in Jupyter Notebook/Python in terminal #57885

BUG: read_csv with multi character sep does not raise error in Jupyter Notebook/Python in terminal #57885

auggie246 commented Mar 18, 2024

asishm commented Mar 18, 2024

auggie246 commented Mar 18, 2024 •

edited

Loading

phofl commented Mar 18, 2024

phofl commented Mar 18, 2024

BUG: read_csv with multi character sep does not raise error in Jupyter Notebook/Python in terminal #57885

BUG: read_csv with multi character sep does not raise error in Jupyter Notebook/Python in terminal #57885

Comments

auggie246 commented Mar 18, 2024

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

asishm commented Mar 18, 2024

auggie246 commented Mar 18, 2024 • edited Loading

phofl commented Mar 18, 2024

phofl commented Mar 18, 2024

auggie246 commented Mar 18, 2024 •

edited

Loading