Skip to content

BUG: DataFrame.dropna must not allow to set both how= and thresh= parameters #46644

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Apr 27, 2022
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 11 additions & 4 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -6073,7 +6073,7 @@ def notnull(self) -> DataFrame:
def dropna(
self,
axis: Axis = 0,
how: str = "any",
how: str | None = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if users aren't supposed to pass None, this should probably be lib.no_default, e.g.

group_keys: bool_t | lib.NoDefault = lib.no_default,

Copy link
Contributor Author

@arnaudlegout arnaudlegout Apr 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure no_default is appropriate? Most of its usage is for a method parameter that is deprecated. So if this parameter is not no_default a warning could be raised. This is not the case here. Users aren't supposed to pass anything but "any" or "all", and the code correctly raises if how is not "any" or "all".

Using no_default seems semantically wrong. If you confirm how must default to no_default I am fine with it, but I want to cross check with you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's check with @simonjayhawkins

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the following use cases cover lib.no_default

  • when we need to detect that a user explicitly passed None where None is a valid value, i.e. hashable index label
  • where there is a non None default and we need to detect if a user passed any value if the keyword arg is depecated so that we can raise a warning.
  • where users must not specify two or more kwargs arguments simultaneously and we need to detect that the user passed values to distinguish from the defaults (for validation)

so at first glance it appears this issues falls into the third category.

Users aren't supposed to pass anything but "any" or "all", and the code correctly raises if how is not "any" or "all".

so it really depends on the order of validation on whether we need to use lib.no_default.

I would have though that the exception for explicit passing of both how and thresh regardless of the values passed should take precedence over the exception for passing anything but "any" or "all" to how?

so the question is should df.dropna(how=None, thresh=2) raise? With this PR it gives an empty DataFrame (the same as 1.4.2). But also df.dropna(how=object(), thresh=2) returns an empty DataFrame for 1.4.2 and now correctly raises TypeError: You cannot set both the how and thresh arguments at the same time.

Also, df.dropna(how=None) returns the same as df.dropna(how="any") whereas on 1.4.2 this gave TypeError: must specify how or thresh. This was tested behavior and the test for this has been removed in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I reworked this PR in light of the third use case (I also defaulted thresh to no_default and added typing)

I removed the test for TypeError: must specify how or thresh as it is not correct. We can call dropna without specifying how or thresh.

df.dropna(how=None) not raising was an "acceptable" side effect of my PR, but using no_default should fix it and make it raises, admittedly better.

By the way, would it make sense to add your three cases directly as a comment in the source code defining no_default. I would personally add this comment as it helps a lot developers to understand the usage, but I noticed that the pandas code is not that much commented, most likely for legacy reasons. If you think it makes sense, I can make a separate bug report and make the corresponding PR.

thresh=None,
subset: IndexLabel = None,
inplace: bool = False,
Expand Down Expand Up @@ -6106,7 +6106,7 @@ def dropna(
* 'all' : If all values are NA, drop that row or column.

thresh : int, optional
Require that many non-NA values.
Require that many non-NA values. Cannot be combined with how.
subset : column label or sequence of labels, optional
Labels along other axis to consider, e.g. if you are dropping rows
these would be a list of columns to include.
Expand Down Expand Up @@ -6181,6 +6181,15 @@ def dropna(
name toy born
1 Batman Batmobile 1940-04-25
"""
if (how is not None) and (thresh is not None):
raise ValueError(
"You cannot set both the how and thresh arguments at the same time."
)

# default value for how
if how is None:
how = "any"

inplace = validate_bool_kwarg(inplace, "inplace")
if isinstance(axis, (tuple, list)):
# GH20987
Expand Down Expand Up @@ -6213,8 +6222,6 @@ def dropna(
else:
if how is not None:
raise ValueError(f"invalid how option: {how}")
else:
raise TypeError("must specify how or thresh")

if np.all(mask):
result = self.copy()
Expand Down
13 changes: 10 additions & 3 deletions pandas/tests/frame/methods/test_dropna.py
Original file line number Diff line number Diff line change
Expand Up @@ -158,9 +158,6 @@ def test_dropna_corner(self, float_frame):
msg = "invalid how option: foo"
with pytest.raises(ValueError, match=msg):
float_frame.dropna(how="foo")
msg = "must specify how or thresh"
with pytest.raises(TypeError, match=msg):
float_frame.dropna(how=None)
# non-existent column - 8303
with pytest.raises(KeyError, match=r"^\['X'\]$"):
float_frame.dropna(subset=["A", "X"])
Expand Down Expand Up @@ -274,3 +271,13 @@ def test_no_nans_in_frame(self, axis):
expected = df.copy()
result = df.dropna(axis=axis)
tm.assert_frame_equal(result, expected, check_index_type=True)

def test_how_thresh_param_incompatible(self):
# GH46575
df = DataFrame([1, 2, pd.NA])
msg = "You cannot set both the how and thresh arguments at the same time"
with pytest.raises(ValueError, match=msg):
df.dropna(how="all", thresh=2)

with pytest.raises(ValueError, match=msg):
df.dropna(how="any", thresh=2)