-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: DataFrame.dropna must not allow to set both how= and thresh= parameters #46644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @arnaudlegout! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2022-04-27 16:38:33 UTC |
…t/pandas into fix-dropna-signature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice work, this'll need a whatsnewnote
I added a what's new entry for this bug fix |
pandas/core/frame.py
Outdated
how: str = "any", | ||
how: str | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if users aren't supposed to pass None
, this should probably be lib.no_default
, e.g.
Line 7935 in 361021b
group_keys: bool_t | lib.NoDefault = lib.no_default, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure no_default
is appropriate? Most of its usage is for a method parameter that is deprecated. So if this parameter is not no_default
a warning could be raised. This is not the case here. Users aren't supposed to pass anything but "any" or "all", and the code correctly raises if how
is not "any" or "all".
Using no_default
seems semantically wrong. If you confirm how
must default to no_default
I am fine with it, but I want to cross check with you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's check with @simonjayhawkins
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the following use cases cover lib.no_default
- when we need to detect that a user explicitly passed
None
whereNone
is a valid value, i.e. hashable index label - where there is a non
None
default and we need to detect if a user passed any value if the keyword arg is depecated so that we can raise a warning. - where users must not specify two or more kwargs arguments simultaneously and we need to detect that the user passed values to distinguish from the defaults (for validation)
so at first glance it appears this issues falls into the third category.
Users aren't supposed to pass anything but "any" or "all", and the code correctly raises if
how
is not "any" or "all".
so it really depends on the order of validation on whether we need to use lib.no_default.
I would have though that the exception for explicit passing of both how
and thresh
regardless of the values passed should take precedence over the exception for passing anything but "any" or "all" to how
?
so the question is should df.dropna(how=None, thresh=2)
raise? With this PR it gives an empty DataFrame (the same as 1.4.2). But also df.dropna(how=object(), thresh=2)
returns an empty DataFrame for 1.4.2 and now correctly raises TypeError: You cannot set both the how and thresh arguments at the same time.
Also, df.dropna(how=None)
returns the same as df.dropna(how="any")
whereas on 1.4.2 this gave TypeError: must specify how or thresh
. This was tested behavior and the test for this has been removed in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I reworked this PR in light of the third use case (I also defaulted thresh to no_default and added typing)
I removed the test for TypeError: must specify how or thresh
as it is not correct. We can call dropna without specifying how
or thresh
.
df.dropna(how=None)
not raising was an "acceptable" side effect of my PR, but using no_default
should fix it and make it raises, admittedly better.
By the way, would it make sense to add your three cases directly as a comment in the source code defining no_default. I would personally add this comment as it helps a lot developers to understand the usage, but I noticed that the pandas code is not that much commented, most likely for legacy reasons. If you think it makes sense, I can make a separate bug report and make the corresponding PR.
take |
@simonjayhawkins are you still waiting for something from my side, or is this PR on its way to the next release. I believe I addressed all the required changes. |
@arnaudlegout this looks fine, can you merge master and ping on green. |
thanks @arnaudlegout |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.