-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH/PERF: add ExtensionArray.duplicated #55255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH/PERF: add ExtensionArray.duplicated #55255
Conversation
pandas/core/algorithms.py
Outdated
values = values._to_masked() # type: ignore[union-attr] | ||
else: | ||
values = ( | ||
values._maybe_convert_datelike_array() # type: ignore[union-attr] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to implement ExtensionArray.duplicated
. This is now a bit more general in that it is faster for more than just the pyarrow timestamp and duration types. See updated timings in OP.
pandas/core/arrays/sparse/array.py
Outdated
def duplicated( | ||
self, keep: Literal["first", "last", False] = "first" | ||
) -> npt.NDArray[np.bool_]: | ||
return algos.duplicated(np.asarray(self), keep=keep) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does sparse need to pass mask for it's NA values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to pass the mask
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment otherwise looks good
asv_bench/benchmarks/algorithms.py
Outdated
@@ -80,7 +81,7 @@ class Duplicated: | |||
"datetime64[ns]", | |||
"datetime64[ns, tz]", | |||
"timestamp[ms][pyarrow]", | |||
"duration[s][pyarrow]", | |||
"string[pyarrow]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why getting rid of duration here?
i think at some pint "string" on L80 will become redundant with "string[pyarrow]" here, so maybe a TODO(3.0) to get rid of one when that happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original version of this PR only targeted pyarrow timestamp and duration types. The current version improves perf for a larger number of pyarrow types so thought I'd add a non-temporal type as well. I can add it back if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im fine either way, just curious. usually for asvs we go throw the kitchen sink at it but there are downsides to that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @lukemanley |
This patch may have induced a performance regression. If it was a necessary behavior change, this may have been expected and everything is okay. Please check the links below. If any ASVs are parameterized, the combinations of parameters that a regression has been detected for appear as subbullets. Subsequent benchmarks may have skipped some commits. The link below lists the commits that are between the two benchmark runs where the regression was identified. |
pd.Series.duplicated
/pd.core.algorithms.duplicated
fails with float-basedSparseArray
#48788doc/source/whatsnew/v2.2.0.rst
file if fixing a bug or adding a new feature.[updated]
asv continuous -f 1.1 upstream/main arrow-temporal-duplicated -b algorithms.Duplicated