Skip to content

ENH: Allow Iterable[Hashable] in drop_duplicates #59392

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Aug 13, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v3.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ Other enhancements
- :meth:`Series.cummin` and :meth:`Series.cummax` now supports :class:`CategoricalDtype` (:issue:`52335`)
- :meth:`Series.plot` now correctly handle the ``ylabel`` parameter for pie charts, allowing for explicit control over the y-axis label (:issue:`58239`)
- Restore support for reading Stata 104-format and enable reading 103-format dta files (:issue:`58554`)
- Support passing a :class:`Set` input to :meth:`DataFrame.drop_duplicates` (:issue:`59237`)
- Support reading Stata 102-format (Stata 1) dta files (:issue:`58978`)
- Support reading Stata 110-format (Stata 7) dta files (:issue:`47176`)

Expand Down
12 changes: 6 additions & 6 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -6534,7 +6534,7 @@ def dropna(
@overload
def drop_duplicates(
self,
subset: Hashable | Sequence[Hashable] | None = ...,
subset: Hashable | Sequence[Hashable] | set | None = ...,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should just replace Sequence[Hashable] with Iterable[Hashable] here (and in the other overloads), because even a dict works now.

Also, I suggest making a separate PR in pandas-stubs once this PR is accepted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dr-Irv,

Updated. Could you specify which docs I should be updating to reflect these changes?

Also, what is the difference between pandas and pandas-stubs?
I am a new contributor and am happy to raise the PR there but would like to understand more about each repo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what is the difference between pandas and pandas-stubs? I am a new contributor and am happy to raise the PR there but would like to understand more about each repo.

pandas is the code that executes.

pandas-stubs is a set of typing declarations meant for users to use when type checking their code. It is bundled within Visual Studio Code. See https://github.com/pandas-dev/pandas-stubs/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying. I will do as you suggested and open a PR there once this one is merged.

*,
keep: DropKeep = ...,
inplace: Literal[True],
Expand All @@ -6544,7 +6544,7 @@ def drop_duplicates(
@overload
def drop_duplicates(
self,
subset: Hashable | Sequence[Hashable] | None = ...,
subset: Hashable | Sequence[Hashable] | set | None = ...,
*,
keep: DropKeep = ...,
inplace: Literal[False] = ...,
Expand All @@ -6554,7 +6554,7 @@ def drop_duplicates(
@overload
def drop_duplicates(
self,
subset: Hashable | Sequence[Hashable] | None = ...,
subset: Hashable | Sequence[Hashable] | set | None = ...,
*,
keep: DropKeep = ...,
inplace: bool = ...,
Expand All @@ -6563,7 +6563,7 @@ def drop_duplicates(

def drop_duplicates(
self,
subset: Hashable | Sequence[Hashable] | None = None,
subset: Hashable | Sequence[Hashable] | set | None = None,
*,
keep: DropKeep = "first",
inplace: bool = False,
Expand Down Expand Up @@ -6667,7 +6667,7 @@ def drop_duplicates(

def duplicated(
self,
subset: Hashable | Sequence[Hashable] | None = None,
subset: Hashable | Sequence[Hashable] | set | None = None,
keep: DropKeep = "first",
) -> Series:
"""
Expand Down Expand Up @@ -6793,7 +6793,7 @@ def f(vals) -> tuple[np.ndarray, int]:

if len(subset) == 1 and self.columns.is_unique:
# GH#45236 This is faster than get_group_index below
result = self[subset[0]].duplicated(keep)
result = self[next(iter(subset))].duplicated(keep)
result.name = None
else:
vals = (col.values for name, col in self.items() if name in subset)
Expand Down
38 changes: 38 additions & 0 deletions pandas/tests/frame/methods/test_drop_duplicates.py
Original file line number Diff line number Diff line change
Expand Up @@ -476,3 +476,41 @@ def test_drop_duplicates_non_boolean_ignore_index(arg):
msg = '^For argument "ignore_index" expected type bool, received type .*.$'
with pytest.raises(ValueError, match=msg):
df.drop_duplicates(ignore_index=arg)


def test_drop_duplicates_set():
# GH#59237
df = DataFrame(
{
"AAA": ["foo", "bar", "foo", "bar", "foo", "bar", "bar", "foo"],
"B": ["one", "one", "two", "two", "two", "two", "one", "two"],
"C": [1, 1, 2, 2, 2, 2, 1, 2],
"D": range(8),
}
)
# single column
result = df.drop_duplicates({"AAA"})
expected = df[:2]
tm.assert_frame_equal(result, expected)

result = df.drop_duplicates({"AAA"}, keep="last")
expected = df.loc[[6, 7]]
tm.assert_frame_equal(result, expected)

result = df.drop_duplicates({"AAA"}, keep=False)
expected = df.loc[[]]
tm.assert_frame_equal(result, expected)
assert len(result) == 0

# multi column
expected = df.loc[[0, 1, 2, 3]]
result = df.drop_duplicates({"AAA", "B"})
tm.assert_frame_equal(result, expected)

result = df.drop_duplicates({"AAA", "B"}, keep="last")
expected = df.loc[[0, 5, 6, 7]]
tm.assert_frame_equal(result, expected)

result = df.drop_duplicates({"AAA", "B"}, keep=False)
expected = df.loc[[0]]
tm.assert_frame_equal(result, expected)
Loading