Skip to content

Problem detecting separator (sep=None) with a DataFrame with a single column #17333

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Leobouloc opened this issue Aug 25, 2017 · 5 comments · Fixed by #17822
Closed

Problem detecting separator (sep=None) with a DataFrame with a single column #17333

Leobouloc opened this issue Aug 25, 2017 · 5 comments · Fixed by #17822
Labels
IO CSV read_csv, to_csv Usage Question
Milestone

Comments

@Leobouloc
Copy link

Code Sample

from io import StringIO

import pandas as pd

TESTDATA=StringIO("""col1
    val1
    val2d
    val3
    val4
    """)

df = pd.read_csv(TESTDATA, sep=None, engine='python')
print(df)

returns:

           c  l1
0       val1 NaN
1      val2d NaN
2       val3 NaN
3       val4 NaN

Problem

I would expect the input to be detected as a DataFrame with one column (with no seperator) which is not the case here...

Expected Output

        col1
0       val1
1      val2d
2       val3
3       val4

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-92-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 35.0.2
Cython: 0.23.4
numpy: 1.13.1
scipy: 0.19.0
xarray: None
IPython: 6.0.0
sphinx: 1.5.5
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.4.3
feather: None
matplotlib: 1.5.1
openpyxl: None
xlrd: 1.0.0
xlwt: None
xlsxwriter: 0.7.3
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None
@gfyoung gfyoung added IO CSV read_csv, to_csv Bug labels Aug 25, 2017
@gfyoung
Copy link
Member

gfyoung commented Aug 25, 2017

@Leobouloc : Thanks for reporting this! A little odd, especially since you somehow drop an "o" from col1. As a workaround, you should pass in delim_whitespace=True, and you should get the table that you expect. I can't tell at the moment whether this error is stemming from Python's native csv parsing or our own wrapper around it.

Investigation (and PR) are welcome!

@Licht-T
Copy link
Contributor

Licht-T commented Oct 8, 2017

@gfyoung This seems right behavior.
When sep=None, sniff the first line and check what the separator is.
https://github.com/pandas-dev/pandas/blob/master/pandas/io/parsers.py#L2167

The csv library treats 'o' as separator.
https://docs.python.org/2/library/csv.html#csv.Sniffer

Here is the example.

Input

import csv

sniffed = csv.Sniffer().sniff('col1')
sniffed.delimiter

Output

'o'

@gfyoung
Copy link
Member

gfyoung commented Oct 8, 2017

@Licht-T : Ah, interesting...this issue is beyond our control then since the sniffing behavior is on Python. Do you mind adding this example as a test to the Python engine?

@Leobouloc : We will close this issue after the test is added because the behavior causing this output is outside of our control. Feel free to ping the Python devs about this if you think this is sufficiently bothersome to address.

@gfyoung gfyoung added Usage Question and removed Bug labels Oct 8, 2017
@gfyoung gfyoung added this to the 0.21.0 milestone Oct 8, 2017
@gfyoung
Copy link
Member

gfyoung commented Oct 8, 2017

As this issue can be resolved by simply adding a test, I would want to add this addition to the release. If not resolved before 0.21.0, ping, and we'll get it added in time.

@gfyoung gfyoung self-assigned this Oct 8, 2017
@Licht-T
Copy link
Contributor

Licht-T commented Oct 8, 2017

@gfyoung Okay! I will write this test asap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Usage Question
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants