Skip to content

BUG: Python Parser skipping over items if BOM present in first element of header #36365

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Sep 19, 2020

Conversation

asishm
Copy link
Contributor

@asishm asishm commented Sep 14, 2020

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look. Can you also add a whatsnew note for 1.2?

@@ -2876,7 +2876,7 @@ def _check_for_bom(self, first_row):
return [new_row] + first_row[1:]

elif len(first_row_bom) > 1:
return [first_row_bom[1:]]
return [first_row_bom[1:]] + first_row[1:]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I don't find this very clear why we would do this - can you try refactoring the code above this to better suit the requirement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from what I understand
the code block above removes the BOM from the first element of the row and removes the quotes if it was a quoted string. i.e. if it was of the format <BOM><QUOTE>abc<QUOTE>def then it extracts out abcdef

I've tried to refactor it slightly

@WillAyd WillAyd added the IO CSV read_csv, to_csv label Sep 15, 2020
@asishm asishm requested a review from WillAyd September 18, 2020 19:10
@jreback jreback added the Bug label Sep 19, 2020
@@ -2127,6 +2127,12 @@ def test_first_row_bom(all_parsers):
expected = DataFrame(columns=["Head1", "Head2", "Head3"])
tm.assert_frame_equal(result, expected)

# see gh-36343
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this a new test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback
Copy link
Contributor

jreback commented Sep 19, 2020

@gfyoung if you'd have a look

@gfyoung
Copy link
Member

gfyoung commented Sep 19, 2020

@jreback's comments notwithstanding, this looks good. Nice catch @asishm !

@WillAyd
Copy link
Member

WillAyd commented Sep 19, 2020

/azp run

@azure-pipelines
Copy link
Contributor

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm when green

@jreback jreback added this to the 1.2 milestone Sep 19, 2020
@jreback jreback merged commit a90d559 into pandas-dev:master Sep 19, 2020
@jreback
Copy link
Contributor

jreback commented Sep 19, 2020

thanks @asishm
nice. I think we might have some other BOM/EOM issues in the tracker if you'd have a look (and maybe this will close them).

@asishm
Copy link
Contributor Author

asishm commented Sep 19, 2020

@jreback

I searched for 'bom' and only found #31609 which can be closed - it talks about the same solution as this PR. not really sure what else to search for. (eom doesn't lead to anything)

@jreback
Copy link
Contributor

jreback commented Sep 19, 2020

k thanks @asishm yeah there might be some others but IIRC those are for the c-parser

@asishm asishm deleted the python_parser_bom branch September 20, 2020 03:35
kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: ZERO WIDTH NO-BREAK SPACE in column name causes a reading failure
4 participants