Skip to content

BUG: read_csv with multi character sep does not raise error in Jupyter Notebook/Python in terminal #57885

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
auggie246 opened this issue Mar 18, 2024 · 4 comments
Closed
3 tasks done
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@auggie246
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

data = pd.read_csv(
    "file.tsv",
    sep="\t",
    quotechar=False,
    engine="pyarrow",
    header=None,
    names=["class", "written", "normalized"],
    na_filter=False,
)

Issue Description

I have tried running the code in a notebook and in a terminal with python, it works to my surprise.
Because while using a similar read_csv method in kedro CSVDataset (important thing here is that it uses the same pandas in the same environment), it throws the following error: the 'pyarrow' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex)
Which according to read_csv documentation and parser source code is correct

So I am not too sure what is causing the method to work in a notebook when it is not suppose to. I have also check that my data that is read from the above code is correct, by inspecting the contents

Expected Behavior

The following error should be raised but not
the 'pyarrow' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex)

Installed Versions

INSTALLED VERSIONS

commit : bdc79c1
python : 3.10.13.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-1048-gke
Version : #53-Ubuntu SMP Tue Nov 28 00:39:01 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_SG.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.1
numpy : 1.23.5
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 69.2.0
pip : 23.1.2
Cython : 3.0.9
pytest : 7.4.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 5.1.0
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.3
IPython : 8.22.2
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2023.12.2
gcsfs : None
matplotlib : 3.8.3
numba : 0.59.0
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 15.0.1
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : 2023.12.2
scipy : 1.12.0
sqlalchemy : 2.0.28
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

@auggie246 auggie246 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 18, 2024
@auggie246 auggie246 changed the title BUG: Jupyter Notebook/python in terminal does not flag "engine does not support regex separators" in read_csv BUG: Jupyter Notebook/python in terminal does not raise error "engine does not support regex separators" in read_csv Mar 18, 2024
@auggie246 auggie246 changed the title BUG: Jupyter Notebook/python in terminal does not raise error "engine does not support regex separators" in read_csv BUG: read_csv with multi character sep does not raise error in Jupyter Notebook/Python in terminal Mar 18, 2024
@asishm
Copy link
Contributor

asishm commented Mar 18, 2024

Maybe I'm misinterpreting the issue, but you're expecting it to raise if sep='\t' with the pyarrow engine?

len('\t') == 1

@auggie246
Copy link
Author

auggie246 commented Mar 18, 2024

Maybe I'm misinterpreting the issue, but you're expecting it to raise if sep='\t' with the pyarrow engine?

len('\t') == 1

Yes, exactly. For some reason, it works fine in notebook which isn't suppose to be the right behavior.
However, the dataframe looks perfectly fine.
If it works in notebook, but doesn't in kedro (which uses pandas), what should be the right behavior

@phofl
Copy link
Member

phofl commented Mar 18, 2024

any chance they are using sep=r"\t",?

@phofl
Copy link
Member

phofl commented Mar 18, 2024

Closing this for now, it looks like this is on their end. Please ping to reopen if you have validated that is indeed a pandas bug

@phofl phofl closed this as completed Mar 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

3 participants