Skip to content

ENH: Better error message if usecols doesn't match columns #17310

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
12 changes: 10 additions & 2 deletions pandas/io/parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -1662,14 +1662,22 @@ def __init__(self, src, **kwds):
# GH 14671
if (self.usecols_dtype == 'string' and
not set(usecols).issubset(self.orig_names)):
raise ValueError("Usecols do not match names.")
missing = [c for c in usecols if c not in self.orig_names]
raise ValueError(
"Usecols do not match columns, "
"columns expected but not found: %s" % missing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use .format(...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, should we be using .format everywhere?

If yes there's a lot of places where we don't currently do that, I'd be happy to start changing some of those over too (in a seperate PR).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#16130 (it's being worked on in chunks; feel free to pitch in)

)

if len(self.names) > len(usecols):
self.names = [n for i, n in enumerate(self.names)
if (i in usecols or n in usecols)]

if len(self.names) < len(usecols):
raise ValueError("Usecols do not match names.")
missing = [c for c in usecols if c not in self.orig_names]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, it should be checking for membership in self.names.

raise ValueError(
"Usecols do not match columns, "
"columns expected but not found: %s" % missing
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same


self._set_noconvert_columns()

Expand Down
2 changes: 1 addition & 1 deletion pandas/tests/io/parser/usecols.py
Original file line number Diff line number Diff line change
Expand Up @@ -481,7 +481,7 @@ def test_raise_on_usecols_names_mismatch(self):
data = 'a,b,c,d\n1,2,3,4\n5,6,7,8'

if self.engine == 'c':
msg = 'Usecols do not match names'
msg = "do not match columns, columns expected but not found"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to have a stronger check for which columns are listed (both in the self.orig_names and self.names case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is this non-helpful error an issue when you pass in engine='python' ? We would like to make sure that both engines are equally expressive about these types of errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on the test, it'll require a minor reformat of the structure of the test - if that's not an issue happy to go ahead (unless wildcarding the regex for the arguments is sufficient).

Will see if it's also an issue on the Python engine and if so will implement a fix!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer that you explicitly put in the string which elements are missing. If that's what you're going for, then go for it!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfyoung I've just tested with the Python engine, and the error is:

ValueError: ' column3' is not in list

Would we want the error to be consistent between both engines? If yes, would we prefer the proposed or existing format?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your error message is better. Let's use it for both engines.

else:
msg = 'is not in list'

Expand Down