Skip to content

BUG: read_csv adding additional columns as integers instead of strings #47137

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 27, 2022

Conversation

phofl
Copy link
Member

@phofl phofl commented May 27, 2022

All columns are interpreted as strings, even if they are numeric. So, we have to be consistent with the bogus columns

@phofl phofl added IO CSV read_csv, to_csv Bug labels May 27, 2022
@@ -1304,8 +1304,10 @@ cdef class TextReader:
if self.header is not None:
j = i - self.leading_cols
# generate extra (bogus) headers if there are more columns than headers
# These should be strings, not integers, because otherwise we might get
# issues with callables as usecols GH#46997
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if these are truely bogus, why are they even passed to the usecols callable? There are no circumstances where these "auto-generated" column names are/should be in the result?

Copy link
Member Author

@phofl phofl May 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #47138, this is the bogus case

You can access them when using an index-based usecols setting, but the column name is shifted afterwards.

Edit: If we want to change this (I think I would be in favor of that), we have to deprecate first

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the column name is shifted afterwards.

right, so if they don't currently make their way into the result, we are not at risk of changing existing behavior with this PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No.

At least I can not see a way in how this would impact anything.

You can not select these columns by name via usecols. If you select them by position, an available name is used, not the bogus name.

The only impact this should have is consistensy when feeding them into the usecols callable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least I can not see a way in how this would impact anything.

cool.

@@ -809,6 +809,7 @@ I/O
- Bug in :func:`read_parquet` when ``engine="pyarrow"`` which caused partial write to disk when column of unsupported datatype was passed (:issue:`44914`)
- Bug in :func:`DataFrame.to_excel` and :class:`ExcelWriter` would raise when writing an empty DataFrame to a ``.ods`` file (:issue:`45793`)
- Bug in :func:`read_html` where elements surrounding ``<br>`` were joined without a space between them (:issue:`29528`)
- Bug in :func:`read_csv` adding columns as integers instead of string when data is longer than header leading to issue with ``usecols`` (:issue:`46997`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you reword so others are not confused like I was. "adding columns as integers" is misleading?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but this is tricky. Technically this is an implementation detail that should not leak into the outside world...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jreback jreback added this to the 1.5 milestone May 27, 2022
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm ex @simonjayhawkins comments

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm merge on green

@phofl phofl merged commit c9ce063 into pandas-dev:main May 27, 2022
@phofl phofl deleted the 46997 branch May 27, 2022 14:18
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
pandas-dev#47137)

* BUG: read_csv adding additional columns as integers instead of strings

* Reword whatsnew
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
3 participants