Skip to content

ENH: Support mask in duplicated algorithm #48150

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Sep 13, 2022

Conversation

phofl
Copy link
Member

@phofl phofl commented Aug 19, 2022

cc @jorisvandenbossche

Benchmarks don't show any changes for duplicated.

ser = pd.Series([1, 2, pd.NA] * 1_000_000, dtype="Int64")
%timeit ser.duplicated(keep="first")
# 7.22 ms ± 42.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
ser = pd.Series((list(range(1, 1_000_000)) + [pd.NA]) * 3, dtype="Int64")
%timeit ser.duplicated(keep="first")
# 12.3 ms ± 402 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@@ -1041,6 +1041,9 @@ def duplicated(
-------
duplicated : ndarray[bool]
"""
if hasattr(values, "dtype") and isinstance(values.dtype, BaseMaskedDtype):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes, that we don't want to add duplicated to the ea interface

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to add an issue to discuss this. Kinda unfortunate we need this exception for BasedMaskedDtype

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah not happy about this either, will open a follow up issue to discuss

@phofl phofl added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff NA - MaskedArrays Related to pd.NA and nullable extension arrays duplicated duplicated, drop_duplicates labels Aug 19, 2022
@phofl
Copy link
Member Author

phofl commented Sep 5, 2022

@mroeschke would you mind having a look?

@@ -158,9 +167,16 @@ cdef duplicated_{{dtype}}(const {{dtype}}_t[:] values, object keep='first'):
with nogil:
{{endif}}
for i in range(n):
Copy link
Member

@mroeschke mroeschke Sep 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice if we could template for i in ... based on {{ if keep == "first"/"last" }} to deduplicate this similar logic, but could be a follow up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good idea, would do as a follow up to keep diff reviewable.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need any additional testing for this?

@phofl
Copy link
Member Author

phofl commented Sep 6, 2022

Good point, assumed we had some but could not finde them 😂 will add some for all three cases

@jorisvandenbossche
Copy link
Member

This is a much bigger change, but from looking at this PR, I started wondering: should we see multiple NAs as duplicated, or should we rather propagate NAs?
In the past, propagating was never an option, since we didn't have a nullable boolean dtype. But similarly as NAs now propagate in comparison ops, they could also propagate here?

@mroeschke
Copy link
Member

I started wondering: should we see multiple NAs as duplicated, or should we rather propagate NAs?

IMO if I was using duplicated, I would be a little more surprised If NAs persisted instead of indicating whether the row with NA was a duplicate with another row with NA (in the same position). I get though that comparison semantics with NA should propagate NA.

I think a natural follow up question with propagating NAs would be "How can i check whether this row with NA is a duplicate of this row with NA?"

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Sep 6, 2022

Ah, good point, I wasn't considering the DataFrame case where you are checking for duplicated rows (multiple columns at once), but I was only thinking about the more simple Series.duplicated (in which case the position doesn't matter, it's just which values are NA in that Series).

Yes, for DataFrames it certainly makes sense to consider NAs at the same location as equal, when considering whether two rows are equal (that's eg also what do in equals())

@phofl
Copy link
Member Author

phofl commented Sep 6, 2022

Added the tests

I think this is a topic worth having a discussion, from a users perspective it would be a bit confusing If we consider them equal in one case and unequal in another case.

Would this impact operations like isin and merges and similar?

@phofl
Copy link
Member Author

phofl commented Sep 7, 2022

@mroeschke are you ok with merging this as is?

)
def test_duplicated_mask(keep, vals):
# GH#48150
ser = Series([1, 2, NA, NA], dtype="Int64")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Could you add one more NA at the end? IIRC I think technically the last

                    else:
                        out[i] = 1

Doesn't have coverage with this test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, thx. Added

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we should probably also cover the case when out[i] does not get set, i.e. there is only a single NA?

Copy link
Member Author

@phofl phofl Sep 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx, this found a bug, we did not set out[i] at all, but since it was empty it was cast to True

@mroeschke
Copy link
Member

Small comment otherwise LGTM. Generally I am in favor of not propagating NA for both DataFrame/Series.duplicated as this PR does. Is it worth discussing that in a separate issue @jorisvandenbossche before merging?

@mroeschke mroeschke added this to the 1.6 milestone Sep 12, 2022
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small request for an additional test with non-duplicate NA, otherwise looks good!

@jorisvandenbossche
Copy link
Member

Is it worth discussing that in a separate issue @jorisvandenbossche before merging?

Given that this PR just preserves the current behaviour, that certainly doesn't need to block this PR from being merged.

If it would only be for Series.duplicated, I would take the time to open an issue to argue for propagating NAs, but since it's less clear how to deal with the DataFrame.duplicated case, I don't feel strongly enough about it.

@mroeschke mroeschke merged commit 94044c8 into pandas-dev:main Sep 13, 2022
@mroeschke
Copy link
Member

Thanks @phofl

@phofl phofl deleted the enh_support_mask_in_duplicated branch September 13, 2022 07:27
@mroeschke mroeschke modified the milestones: 1.6, 2.0 Oct 13, 2022
noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
* ENH: Support mask in duplicated algorithm

* Fix mypy

* Add tests

* Improve test

* Fix bug
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff duplicated duplicated, drop_duplicates NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants