Skip to content

BUG: Throw a ParserError when header rows have unequal column counts … #43118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Sep 5, 2021

Conversation

quantumalaviya
Copy link
Contributor

@quantumalaviya quantumalaviya commented Aug 20, 2021

(GH43102)

  • closes BUG: IndexError when header rows have unequal column counts #43102
  • Fixed uncaught IndexError that is raised when the iterator i in extract(r) inside base_parser.py exceeds the length of a header row (when field_count > len(r)).
  • A Parser error is raised when the header row had unequal columns. Outputs the first row where a mismatch is found.
  • Added a test to match the error being raised.

@pep8speaks
Copy link

pep8speaks commented Aug 20, 2021

Hello @quantumalaviya! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-09-05 08:27:17 UTC

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls always start with tests. these should fail w/o the code change and pass after.

@jreback jreback added the IO CSV read_csv, to_csv label Aug 20, 2021
@quantumalaviya
Copy link
Contributor Author

@jreback I added a test inside pandas/tests/io/parser/test_header.py. Is this fine?

@@ -341,6 +341,14 @@ def _extract_multi_indexer_columns(
# extract the columns
field_count = len(header[0])

# check if header lengths are equal
for header_iter in range(len(header)):
Copy link
Member

@phofl phofl Aug 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do this like all(len(x) == len(ls[0]) for x in ls[1:])

has the downside that the element where the error was found is not clear

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you do suggest removing that part of the error? It can simply say "Header rows must have an equal number of columns."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I would go with that.

match="Header rows must have equal number of columns. "
"Mismatch found at header 1.",
):
parser.read_csv(StringIO(case), sep=",", header=[0, 2])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sep is the default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I will get rid of it, thanks

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add whatsnew

@quantumalaviya
Copy link
Contributor Author

I had added "A Parser error is raised when the header row had unequal columns. Outputs the first row where a mismatch is found." as a substitute for whatsnew. Do you want it to be more detailed or am I missing something?

@phofl
Copy link
Member

phofl commented Aug 20, 2021

Please add something about the uncaught error before to avoid the impression that a new error is raised

@quantumalaviya quantumalaviya requested a review from phofl August 20, 2021 18:43
@quantumalaviya
Copy link
Contributor Author

@jreback @phofl Does this look good?

@phofl
Copy link
Member

phofl commented Aug 21, 2021

Please be patient, we are all volunteers and will look when we have time.

Whatsnew is still missing

@quantumalaviya
Copy link
Contributor Author

quantumalaviya commented Aug 22, 2021

Thanks for the clarification, I was confused as to what the next step is.

EDIT: Changed whatsnew.

@@ -319,7 +319,7 @@ I/O
- Bug in :func:`json_normalize` where ``errors=ignore`` could fail to ignore missing values of ``meta`` when ``record_path`` has a length greater than one (:issue:`41876`)
- Bug in :func:`read_csv` with multi-header input and arguments referencing column names as tuples (:issue:`42446`)
- Bug in :func:`Series.to_json` and :func:`DataFrame.to_json` where some attributes were skipped when serialising plain Python objects to JSON (:issue:`42768`, :issue:`33043`)
-
- Bug in :func:`read_csv` where reading multi-header input with unequal lengths incorrectly raises an ``IndexError`` (:issue:`43102`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug in :func:read_csv where reading multi-header input with unequal lengths incorrectly raising uncontrolled IndexError (:issue:43102)

@phofl
Copy link
Member

phofl commented Aug 22, 2021

Yes the whatsnews are our release notes. Small comment, otherwise lgtm

@quantumalaviya quantumalaviya requested a review from phofl August 22, 2021 20:48
Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jreback jreback added the Error Reporting Incorrect or improved errors from pandas label Aug 31, 2021
@jreback jreback added this to the 1.4 milestone Aug 31, 2021
Copy link
Member

@lithomas1 lithomas1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you merge master and skip the test for the pyarrow engine?

Copy link
Member

@lithomas1 lithomas1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. This LGTM.

@jreback
Copy link
Contributor

jreback commented Sep 5, 2021

@lithomas1 merge away

@lithomas1 lithomas1 merged commit 343ac2a into pandas-dev:master Sep 5, 2021
@lithomas1
Copy link
Member

thanks @quantumalaviya

@quantumalaviya
Copy link
Contributor Author

My pleasure. Thanks for guiding me through it!

feefladder pushed a commit to feefladder/pandas that referenced this pull request Sep 7, 2021
pandas-dev#43118)

* BUG: Throw a ParserError when header rows have unequal column counts (GH43102)

* BUG: Throw a ParserError when header rows have unequal column counts. Updated to comply with PEP8 (GH43102)

* Added Test. (GH43102)

* Added Test. (GH43102)

* Added Test. (GH43102)

* Added Changes. (GH43102)

* Added whatsnew

* Added whatsnew

* Test without whatsnew

* Add whatsnew again

* Update v1.4.0.rst

* Merge upstream

* Skipping test on PyArrow
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: IndexError when header rows have unequal column counts
5 participants