ENH: Support mask in duplicated algorithm #48150

phofl · 2022-08-19T10:01:19Z

xref ENH: support masked arrays in hashtable-based functions (duplicated, isin, mode) #45776 (Replace xxxx with the Github issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Benchmarks don't show any changes for duplicated.

ser = pd.Series([1, 2, pd.NA] * 1_000_000, dtype="Int64")
%timeit ser.duplicated(keep="first")
# 7.22 ms ± 42.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
ser = pd.Series((list(range(1, 1_000_000)) + [pd.NA]) * 3, dtype="Int64")
%timeit ser.duplicated(keep="first")
# 12.3 ms ± 402 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

phofl · 2022-08-19T10:01:42Z

pandas/core/algorithms.py

@@ -1041,6 +1041,9 @@ def duplicated(
    -------
    duplicated : ndarray[bool]
    """
+    if hasattr(values, "dtype") and isinstance(values.dtype, BaseMaskedDtype):


This assumes, that we don't want to add duplicated to the ea interface

Might be good to add an issue to discuss this. Kinda unfortunate we need this exception for BasedMaskedDtype

Yeah not happy about this either, will open a follow up issue to discuss

…_duplicated

phofl · 2022-09-05T14:38:40Z

@mroeschke would you mind having a look?

mroeschke · 2022-09-06T17:56:32Z

pandas/_libs/hashtable_func_helper.pxi.in

@@ -158,9 +167,16 @@ cdef duplicated_{{dtype}}(const {{dtype}}_t[:] values, object keep='first'):
        with nogil:
        {{endif}}
            for i in range(n):


Would be nice if we could template for i in ... based on {{ if keep == "first"/"last" }} to deduplicate this similar logic, but could be a follow up.

Oh good idea, would do as a follow up to keep diff reviewable.

pandas/_libs/hashtable_func_helper.pxi.in

mroeschke

Do we need any additional testing for this?

phofl · 2022-09-06T18:55:44Z

Good point, assumed we had some but could not finde them 😂 will add some for all three cases

jorisvandenbossche · 2022-09-06T19:06:06Z

This is a much bigger change, but from looking at this PR, I started wondering: should we see multiple NAs as duplicated, or should we rather propagate NAs?
In the past, propagating was never an option, since we didn't have a nullable boolean dtype. But similarly as NAs now propagate in comparison ops, they could also propagate here?

mroeschke · 2022-09-06T19:15:01Z

I started wondering: should we see multiple NAs as duplicated, or should we rather propagate NAs?

IMO if I was using duplicated, I would be a little more surprised If NAs persisted instead of indicating whether the row with NA was a duplicate with another row with NA (in the same position). I get though that comparison semantics with NA should propagate NA.

I think a natural follow up question with propagating NAs would be "How can i check whether this row with NA is a duplicate of this row with NA?"

jorisvandenbossche · 2022-09-06T19:27:08Z

Ah, good point, I wasn't considering the DataFrame case where you are checking for duplicated rows (multiple columns at once), but I was only thinking about the more simple Series.duplicated (in which case the position doesn't matter, it's just which values are NA in that Series).

Yes, for DataFrames it certainly makes sense to consider NAs at the same location as equal, when considering whether two rows are equal (that's eg also what do in equals())

…_duplicated

phofl · 2022-09-06T20:50:03Z

Added the tests

I think this is a topic worth having a discussion, from a users perspective it would be a bit confusing If we consider them equal in one case and unequal in another case.

Would this impact operations like isin and merges and similar?

phofl · 2022-09-07T18:57:14Z

@mroeschke are you ok with merging this as is?

mroeschke · 2022-09-07T19:01:42Z

pandas/tests/series/methods/test_duplicated.py

+)
+def test_duplicated_mask(keep, vals):
+    # GH#48150
+    ser = Series([1, 2, NA, NA], dtype="Int64")


Nit: Could you add one more NA at the end? IIRC I think technically the last

else: out[i] = 1

Doesn't have coverage with this test

Good point, thx. Added

But we should probably also cover the case when out[i] does not get set, i.e. there is only a single NA?

thx, this found a bug, we did not set out[i] at all, but since it was empty it was cast to True

…_duplicated

mroeschke · 2022-09-07T19:04:50Z

Small comment otherwise LGTM. Generally I am in favor of not propagating NA for both DataFrame/Series.duplicated as this PR does. Is it worth discussing that in a separate issue @jorisvandenbossche before merging?

…_duplicated

jorisvandenbossche

Small request for an additional test with non-duplicate NA, otherwise looks good!

jorisvandenbossche · 2022-09-12T20:35:37Z

Is it worth discussing that in a separate issue @jorisvandenbossche before merging?

Given that this PR just preserves the current behaviour, that certainly doesn't need to block this PR from being merged.

If it would only be for Series.duplicated, I would take the time to open an issue to argue for propagating NAs, but since it's less clear how to deal with the DataFrame.duplicated case, I don't feel strongly enough about it.

mroeschke · 2022-09-13T00:13:53Z

Thanks @phofl

* ENH: Support mask in duplicated algorithm * Fix mypy * Add tests * Improve test * Fix bug

ENH: Support mask in duplicated algorithm

0cf2abe

phofl commented Aug 19, 2022

View reviewed changes

phofl added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff NA - MaskedArrays Related to pd.NA and nullable extension arrays duplicated duplicated, drop_duplicates labels Aug 19, 2022

phofl added 5 commits August 19, 2022 13:32

Fix mypy

b40deb1

Merge remote-tracking branch 'upstream/main' into enh_support_mask_in…

1b1fb11

…_duplicated

Merge remote-tracking branch 'upstream/main' into enh_support_mask_in…

fe63bad

…_duplicated

Merge remote-tracking branch 'upstream/main' into enh_support_mask_in…

308cd94

…_duplicated

Merge remote-tracking branch 'upstream/main' into enh_support_mask_in…

4cabede

…_duplicated

mroeschke reviewed Sep 6, 2022

View reviewed changes

pandas/_libs/hashtable_func_helper.pxi.in Show resolved Hide resolved

mroeschke reviewed Sep 6, 2022

View reviewed changes

phofl added 2 commits September 6, 2022 22:44

Merge remote-tracking branch 'upstream/main' into enh_support_mask_in…

8d58b9e

…_duplicated

Add tests

c5226a4

phofl mentioned this pull request Sep 6, 2022

ENH: Add duplicated to MaskedArray/ExtensionArray interface #48424

Closed

3 tasks

mroeschke reviewed Sep 7, 2022

View reviewed changes

phofl added 2 commits September 7, 2022 21:03

Improve test

a1ae1d8

Merge remote-tracking branch 'upstream/main' into enh_support_mask_in…

801134b

…_duplicated

Merge remote-tracking branch 'upstream/main' into enh_support_mask_in…

d9a89fa

…_duplicated

mroeschke added this to the 1.6 milestone Sep 12, 2022

mroeschke approved these changes Sep 12, 2022

View reviewed changes

jorisvandenbossche approved these changes Sep 12, 2022

View reviewed changes

Fix bug

a334865

mroeschke merged commit 94044c8 into pandas-dev:main Sep 13, 2022

phofl deleted the enh_support_mask_in_duplicated branch September 13, 2022 07:27

tehunter mentioned this pull request Sep 26, 2022

BUG: pd.core.algorithms.duplicated fails with masked Series #48786

Closed

3 tasks

mroeschke modified the milestones: 1.6, 2.0 Oct 13, 2022

noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022

ENH: Support mask in duplicated algorithm (pandas-dev#48150)

e6a544c

* ENH: Support mask in duplicated algorithm * Fix mypy * Add tests * Improve test * Fix bug

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Support mask in duplicated algorithm #48150

ENH: Support mask in duplicated algorithm #48150

phofl commented Aug 19, 2022 •

edited

Loading

phofl Aug 19, 2022

mroeschke Sep 6, 2022

phofl Sep 6, 2022

phofl commented Sep 5, 2022

mroeschke Sep 6, 2022 •

edited

Loading

phofl Sep 6, 2022

mroeschke left a comment

phofl commented Sep 6, 2022

jorisvandenbossche commented Sep 6, 2022

mroeschke commented Sep 6, 2022

jorisvandenbossche commented Sep 6, 2022 •

edited

Loading

phofl commented Sep 6, 2022

phofl commented Sep 7, 2022

mroeschke Sep 7, 2022

phofl Sep 7, 2022

jorisvandenbossche Sep 12, 2022

phofl Sep 12, 2022 •

edited

Loading

mroeschke commented Sep 7, 2022

jorisvandenbossche left a comment

jorisvandenbossche commented Sep 12, 2022

mroeschke commented Sep 13, 2022

ENH: Support mask in duplicated algorithm #48150

ENH: Support mask in duplicated algorithm #48150

Conversation

phofl commented Aug 19, 2022 • edited Loading

phofl Aug 19, 2022

Choose a reason for hiding this comment

mroeschke Sep 6, 2022

Choose a reason for hiding this comment

phofl Sep 6, 2022

Choose a reason for hiding this comment

phofl commented Sep 5, 2022

mroeschke Sep 6, 2022 • edited Loading

Choose a reason for hiding this comment

phofl Sep 6, 2022

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

phofl commented Sep 6, 2022

jorisvandenbossche commented Sep 6, 2022

mroeschke commented Sep 6, 2022

jorisvandenbossche commented Sep 6, 2022 • edited Loading

phofl commented Sep 6, 2022

phofl commented Sep 7, 2022

mroeschke Sep 7, 2022

Choose a reason for hiding this comment

phofl Sep 7, 2022

Choose a reason for hiding this comment

jorisvandenbossche Sep 12, 2022

Choose a reason for hiding this comment

phofl Sep 12, 2022 • edited Loading

Choose a reason for hiding this comment

mroeschke commented Sep 7, 2022

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Sep 12, 2022

mroeschke commented Sep 13, 2022

phofl commented Aug 19, 2022 •

edited

Loading

mroeschke Sep 6, 2022 •

edited

Loading

jorisvandenbossche commented Sep 6, 2022 •

edited

Loading

phofl Sep 12, 2022 •

edited

Loading