Skip to content

pd.read_csv prefix parameter seems do not works #27394

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Sniperq2 opened this issue Jul 15, 2019 · 13 comments · Fixed by #31383
Closed

pd.read_csv prefix parameter seems do not works #27394

Sniperq2 opened this issue Jul 15, 2019 · 13 comments · Fixed by #31383
Labels
Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv
Milestone

Comments

@Sniperq2
Copy link

Sniperq2 commented Jul 15, 2019

Code Sample, a copy-pastable example if possible

field,data,values
text1,34,hh,fail
text2,76,tt,fail2
import pandas as pd

if __name__ == "__main__":
    data_list = pd.read_csv('test.csv', prefix="X")
    print(data_list)

Problem description

A documentation said:

prefix : str, optional
    Prefix to add to column numbers when no header, e.g. ‘X’ for X0, X1, …

If I understood correctly this documentation I should get this

Expected Output

field    data    values  X0
text1     34    hh        fail
text2     76    tt         fail2

But I got this anyway:

Current Output

          field  data    values
text1     34    hh      fail
text2     76    tt        fail2

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 8.1
machine: AMD64
processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.2
pytest: None
pip: 19.1.1
setuptools: 40.1.0
Cython: None
numpy: 1.15.0
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: 2.5.5
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.1.1
bs4: 4.6.0
html5lib: None
sqlalchemy: 1.2.19
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
None

@TomAugspurger
Copy link
Contributor

I think you misunderstand. That's only used when there's no header and the default range(n) header is used.

In [9]: pd.read_csv(io.StringIO('0,1\n2,3'), header=None, prefix="X_")
Out[9]:
   X_0  X_1
0    0    1
1    2    3

In [10]: pd.read_csv(io.StringIO('0,1\n2,3'), prefix="X_")
Out[10]:
   0  1
0  2  3

We should probably raise when header and prefix are both not None.

@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone Jul 15, 2019
@TomAugspurger TomAugspurger added IO CSV read_csv, to_csv Error Reporting Incorrect or improved errors from pandas good first issue and removed Difficulty Intermediate good first issue labels Jul 15, 2019
@Sniperq2
Copy link
Author

Ah. That's it. "Prefix to add to column numbers when no header at all, e.g. ‘X’ for X0, X1, …"
So I consider it as NOTABUG

I look around for function to drop columns without a header. df.dropna() did not help? Could I create a feature request?

@TomAugspurger
Copy link
Contributor

I look around for function to drop columns without a header

I don't understand. Every column in a DataFrame always has a name. You can't have a "column without a header".

@sameshl
Copy link
Contributor

sameshl commented Jul 15, 2019

@TomAugspurger , what do you mean by

We should probably raise when header and prefix are both not None.

Could you explain me the issue. I would like to take it up.

@TomAugspurger
Copy link
Contributor

@sameshl great!

If you look at #27394 (comment), In[10] should raise a TypeError saying that prefix can only be used when header=None.

@WillAyd @simonjayhawkins does that sound right to you?

@WillAyd
Copy link
Member

WillAyd commented Jul 15, 2019

Makes sense to me

@sameshl
Copy link
Contributor

sameshl commented Jul 15, 2019

@TomAugspurger I see. Thanks for the info. Will work on it!

@Sniperq2
Copy link
Author

I look around for function to drop columns without a header

I don't understand. Every column in a DataFrame always has a name. You can't have a "column without a header".

Yes. I totally agree with you. But some csv files have columns without headers. Peoples make mistakes... Would be nice if pandas.read_csv method have an option to cut off such columns (without header).

ex. https://stackoverflow.com/questions/50301544/how-to-delete-columns-without-headers-in-python-pandas-read-csv

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 15, 2019 via email

@Sniperq2
Copy link
Author

Sniperq2 commented Jul 15, 2019

The answers there seem reasonable to me.

Yes. But it is not the same that I suggest:

Now pandas.read_csv has this

error_bad_lines : bool, default True
Lines with too many fields (e.g. a csv line with too many commas) will by default cause an exception to be raised, and no DataFrame will be returned. If False, then these “bad lines” will dropped from
the DataFrame that is returned.

May be do the same for columns?

error_bad_columns : bool, default True
Columns with empty header and/or [other "damages"] will by default cause an exception to be raised, and no DataFrame will be returned. If False, then these **“bad columns” will dropped from the DataFrame that is returned.**

What do you think?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 15, 2019 via email

@samyak-jn
Copy link
Contributor

Hey, can I work on the issue, if someone is not working on it now?

@TomAugspurger
Copy link
Contributor

Sure. It looks like #27998 started, but stalled. You may want to read through that PR first.

@jreback jreback modified the milestones: Contributions Welcome, 1.1 Feb 1, 2020
vinceatbluelabs added a commit to bluelabsio/records-mover that referenced this issue Aug 12, 2020
Pandas 1.1 has broken Record Mover's usage of the read_csv() function by adding error checking in cases where a certain argument would be unused.

Details of the Pandas change:

* pandas-dev/pandas#27394
* pandas-dev/pandas#31383

See Records Mover test failures here: 
* https://app.circleci.com/pipelines/github/bluelabsio/records-mover/1089/workflows/e62f1cf0-f8d0-4e22-9652-112df72b02b8/jobs/9439
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv
Projects
None yet
7 participants