Skip to content

ENH: Allow Iterable[Hashable] in drop_duplicates #59392

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Aug 13, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v3.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ Other enhancements
- :meth:`Series.cummin` and :meth:`Series.cummax` now supports :class:`CategoricalDtype` (:issue:`52335`)
- :meth:`Series.plot` now correctly handle the ``ylabel`` parameter for pie charts, allowing for explicit control over the y-axis label (:issue:`58239`)
- Restore support for reading Stata 104-format and enable reading 103-format dta files (:issue:`58554`)
- Support passing a :class:`Set` input to :meth:`DataFrame.drop_duplicates` (:issue:`59237`)
- Support passing a :class:`Set` and :class:`Iterable[Hashable]` input to :meth:`DataFrame.drop_duplicates` (:issue:`59237`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - can remove Set now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks!

- Support reading Stata 102-format (Stata 1) dta files (:issue:`58978`)
- Support reading Stata 110-format (Stata 7) dta files (:issue:`47176`)

Expand Down
24 changes: 9 additions & 15 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,6 @@
Iterator,
Mapping,
Sequence,
Set as AbstractSet,
)
import functools
from io import StringIO
Expand Down Expand Up @@ -6405,7 +6404,7 @@ def dropna(

thresh : int, optional
Require that many non-NA values. Cannot be combined with how.
subset : column label or sequence of labels, optional
subset : column label or iterable of labels, optional
Labels along other axis to consider, e.g. if you are dropping rows
these would be a list of columns to include.
inplace : bool, default False
Expand Down Expand Up @@ -6535,7 +6534,7 @@ def dropna(
@overload
def drop_duplicates(
self,
subset: Hashable | Sequence[Hashable] | AbstractSet | None = ...,
subset: Hashable | Iterable[Hashable] | set | None = ...,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for set anymore now that you've changed Sequence to Iterable, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. Updated.

*,
keep: DropKeep = ...,
inplace: Literal[True],
Expand All @@ -6545,7 +6544,7 @@ def drop_duplicates(
@overload
def drop_duplicates(
self,
subset: Hashable | Sequence[Hashable] | AbstractSet | None = ...,
subset: Hashable | Iterable[Hashable] | set | None = ...,
*,
keep: DropKeep = ...,
inplace: Literal[False] = ...,
Expand All @@ -6555,7 +6554,7 @@ def drop_duplicates(
@overload
def drop_duplicates(
self,
subset: Hashable | Sequence[Hashable] | AbstractSet | None = ...,
subset: Hashable | Iterable[Hashable] | set | None = ...,
*,
keep: DropKeep = ...,
inplace: bool = ...,
Expand All @@ -6564,7 +6563,7 @@ def drop_duplicates(

def drop_duplicates(
self,
subset: Hashable | Sequence[Hashable] | AbstractSet | None = None,
subset: Hashable | Iterable[Hashable] | set | None = None,
*,
keep: DropKeep = "first",
inplace: bool = False,
Expand All @@ -6578,7 +6577,7 @@ def drop_duplicates(

Parameters
----------
subset : column label or sequence of labels, optional
subset : column label or iterable of labels, optional
Only consider certain columns for identifying duplicates, by
default use all of the columns.
keep : {'first', 'last', ``False``}, default 'first'
Expand Down Expand Up @@ -6668,7 +6667,7 @@ def drop_duplicates(

def duplicated(
self,
subset: Hashable | Sequence[Hashable] | AbstractSet | None = None,
subset: Hashable | Iterable[Hashable] | set | None = None,
keep: DropKeep = "first",
) -> Series:
"""
Expand All @@ -6678,7 +6677,7 @@ def duplicated(

Parameters
----------
subset : column label or sequence of labels, optional
subset : column label or iterable of labels, optional
Only consider certain columns for identifying duplicates, by
default use all of the columns.
keep : {'first', 'last', False}, default 'first'
Expand Down Expand Up @@ -6793,13 +6792,8 @@ def f(vals) -> tuple[np.ndarray, int]:
raise KeyError(Index(diff))

if len(subset) == 1 and self.columns.is_unique:
# GH#59237 adding support for single element sets
if isinstance(subset, set):
elem = subset.pop()
else:
elem = subset[0]
# GH#45236 This is faster than get_group_index below
result = self[elem].duplicated(keep)
result = self[next(iter(subset))].duplicated(keep)
result.name = None
else:
vals = (col.values for name, col in self.items() if name in subset)
Expand Down
Loading