Skip to content

BUG: DataFrame.dropna must not allow to set both how= and thresh= parameters #46644

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Apr 27, 2022

Conversation

arnaudlegout
Copy link
Contributor

@arnaudlegout arnaudlegout commented Apr 5, 2022

@pep8speaks
Copy link

pep8speaks commented Apr 5, 2022

Hello @arnaudlegout! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-04-27 16:38:33 UTC

@arnaudlegout arnaudlegout marked this pull request as ready for review April 6, 2022 14:59
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work, this'll need a whatsnewnote

@MarcoGorelli MarcoGorelli added the Error Reporting Incorrect or improved errors from pandas label Apr 6, 2022
@MarcoGorelli MarcoGorelli added this to the 1.5 milestone Apr 6, 2022
@MarcoGorelli MarcoGorelli changed the title BUG attempt to fix GH46575 BUG: DataFrame.dropna must not allow to set both how= and thresh= parameters Apr 6, 2022
@arnaudlegout
Copy link
Contributor Author

I added a what's new entry for this bug fix

how: str = "any",
how: str | None = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if users aren't supposed to pass None, this should probably be lib.no_default, e.g.

group_keys: bool_t | lib.NoDefault = lib.no_default,

Copy link
Contributor Author

@arnaudlegout arnaudlegout Apr 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure no_default is appropriate? Most of its usage is for a method parameter that is deprecated. So if this parameter is not no_default a warning could be raised. This is not the case here. Users aren't supposed to pass anything but "any" or "all", and the code correctly raises if how is not "any" or "all".

Using no_default seems semantically wrong. If you confirm how must default to no_default I am fine with it, but I want to cross check with you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's check with @simonjayhawkins

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the following use cases cover lib.no_default

  • when we need to detect that a user explicitly passed None where None is a valid value, i.e. hashable index label
  • where there is a non None default and we need to detect if a user passed any value if the keyword arg is depecated so that we can raise a warning.
  • where users must not specify two or more kwargs arguments simultaneously and we need to detect that the user passed values to distinguish from the defaults (for validation)

so at first glance it appears this issues falls into the third category.

Users aren't supposed to pass anything but "any" or "all", and the code correctly raises if how is not "any" or "all".

so it really depends on the order of validation on whether we need to use lib.no_default.

I would have though that the exception for explicit passing of both how and thresh regardless of the values passed should take precedence over the exception for passing anything but "any" or "all" to how?

so the question is should df.dropna(how=None, thresh=2) raise? With this PR it gives an empty DataFrame (the same as 1.4.2). But also df.dropna(how=object(), thresh=2) returns an empty DataFrame for 1.4.2 and now correctly raises TypeError: You cannot set both the how and thresh arguments at the same time.

Also, df.dropna(how=None) returns the same as df.dropna(how="any") whereas on 1.4.2 this gave TypeError: must specify how or thresh. This was tested behavior and the test for this has been removed in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I reworked this PR in light of the third use case (I also defaulted thresh to no_default and added typing)

I removed the test for TypeError: must specify how or thresh as it is not correct. We can call dropna without specifying how or thresh.

df.dropna(how=None) not raising was an "acceptable" side effect of my PR, but using no_default should fix it and make it raises, admittedly better.

By the way, would it make sense to add your three cases directly as a comment in the source code defining no_default. I would personally add this comment as it helps a lot developers to understand the usage, but I noticed that the pandas code is not that much commented, most likely for legacy reasons. If you think it makes sense, I can make a separate bug report and make the corresponding PR.

@arnaudlegout
Copy link
Contributor Author

take

@arnaudlegout
Copy link
Contributor Author

@simonjayhawkins are you still waiting for something from my side, or is this PR on its way to the next release. I believe I addressed all the required changes.

@jreback
Copy link
Contributor

jreback commented Apr 27, 2022

@arnaudlegout this looks fine, can you merge master and ping on green.

@jreback jreback merged commit 94280f0 into pandas-dev:main Apr 27, 2022
@jreback
Copy link
Contributor

jreback commented Apr 27, 2022

thanks @arnaudlegout

@arnaudlegout arnaudlegout deleted the fix-dropna-signature branch April 28, 2022 08:44
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: DataFrame.dropna must not allow to set both how= and thresh= parameters
5 participants