-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Support mask in duplicated algorithm #48150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Support mask in duplicated algorithm #48150
Conversation
@@ -1041,6 +1041,9 @@ def duplicated( | |||
------- | |||
duplicated : ndarray[bool] | |||
""" | |||
if hasattr(values, "dtype") and isinstance(values.dtype, BaseMaskedDtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes, that we don't want to add duplicated to the ea interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to add an issue to discuss this. Kinda unfortunate we need this exception for BasedMaskedDtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah not happy about this either, will open a follow up issue to discuss
@mroeschke would you mind having a look? |
@@ -158,9 +167,16 @@ cdef duplicated_{{dtype}}(const {{dtype}}_t[:] values, object keep='first'): | |||
with nogil: | |||
{{endif}} | |||
for i in range(n): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice if we could template for i in ...
based on {{ if keep == "first"/"last" }}
to deduplicate this similar logic, but could be a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh good idea, would do as a follow up to keep diff reviewable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need any additional testing for this?
Good point, assumed we had some but could not finde them 😂 will add some for all three cases |
This is a much bigger change, but from looking at this PR, I started wondering: should we see multiple NAs as duplicated, or should we rather propagate NAs? |
IMO if I was using I think a natural follow up question with propagating NAs would be "How can i check whether this row with NA is a duplicate of this row with NA?" |
Ah, good point, I wasn't considering the DataFrame case where you are checking for duplicated rows (multiple columns at once), but I was only thinking about the more simple Series.duplicated (in which case the position doesn't matter, it's just which values are NA in that Series). Yes, for DataFrames it certainly makes sense to consider NAs at the same location as equal, when considering whether two rows are equal (that's eg also what do in |
Added the tests I think this is a topic worth having a discussion, from a users perspective it would be a bit confusing If we consider them equal in one case and unequal in another case. Would this impact operations like isin and merges and similar? |
@mroeschke are you ok with merging this as is? |
) | ||
def test_duplicated_mask(keep, vals): | ||
# GH#48150 | ||
ser = Series([1, 2, NA, NA], dtype="Int64") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Could you add one more NA
at the end? IIRC I think technically the last
else:
out[i] = 1
Doesn't have coverage with this test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, thx. Added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we should probably also cover the case when out[i]
does not get set, i.e. there is only a single NA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx, this found a bug, we did not set out[i]
at all, but since it was empty it was cast to True
Small comment otherwise LGTM. Generally I am in favor of not propagating NA for both |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small request for an additional test with non-duplicate NA, otherwise looks good!
Given that this PR just preserves the current behaviour, that certainly doesn't need to block this PR from being merged. If it would only be for Series.duplicated, I would take the time to open an issue to argue for propagating NAs, but since it's less clear how to deal with the DataFrame.duplicated case, I don't feel strongly enough about it. |
Thanks @phofl |
* ENH: Support mask in duplicated algorithm * Fix mypy * Add tests * Improve test * Fix bug
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.cc @jorisvandenbossche
Benchmarks don't show any changes for duplicated.