Skip to content

ENH: Reduce type requirements for the subset parameter in drop_duplicates/duplicated #59237

Closed
@behrenhoff

Description

@behrenhoff

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I would like to pass a set to drop_duplicates like so:

df = get_some_df()
subset = {"column1", "column3"}
df_dropped = df.drop_duplicates(subset=subset)

According to type hints, subset is a Hashable | Sequence[Hashable] | None. The documentation says "column label or sequence of labels, optional".

The problem is, I would like to pass a set of columns. The name subset suggests that should be ok. And it does work indeed. However, a set is not a Sequence.

Can the requirements be lowered? Maybe to Collection? (or even Iterable, but that might come with problems).

Looking at the code: https://github.com/pandas-dev/pandas/blob/v2.2.2/pandas/core/frame.py#L6935

It checks if the subset is np.iterable, then a set is created and len(subset) is called. This could be done with collections as well.

Am I missing anything or could this Sequence be make a Collection?

Feature Description

If I am not mistaken, Sequence could simply be replaced with Collection in duplicated as well as drop_duplicates + in the docu.

Alternative Solutions

no alt solution required

Additional Context

No response

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions