ENH/PERF: ExtensionArray
should offer a duplicated
function
#48747
Labels
ExtensionArray
should offer a duplicated
function
#48747
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
In version 1.5.0, functions that use
Series.duplicated
(includingDataFrame.duplicated
with a single column subset and.drop_duplicates
) go throughpd.core.algorithms.duplicated
which calls_ensure_data
. Currently there is no method for anExtensionArray
to offer its ownduplicated
behavior, so when the program gets topd.core.algorithms._ensure_data
, it may be forced to fall back on thenp.asarray(values, dtype=object)
if theExtensionArray
is not a coerceable type. Here's the documentation for_ensure_data
:This
np.asarray
call can be very expensive. For example, I have an ExtensionArray with several million rows backed by 10 Categorical/numerical arrays. Thenp.asarray
function uses__iter__
to loop through my array and construct annp.array
out of base objects. This is a very expensive task. Unfortunately, I have no means of hinting to pandas that I have much more efficient algorithms for computing the duplicates which can take advantage of vectorization.Feature Description
Add a
duplicated
method to ExtensionArray with the following signature:Then have
IndexOpsMixin
call that function instead of directly callingpd.core.algorithms.duplicated
:Alternative Solutions
Currently, for doing a
duplicated
with multiple columns subset, pandas routes each column throughalgorithms.factorize
and then passes the whole thing throughget_group_index
. Possibly the Seriesduplicated
function could implement an algorithm based onfactorize
for customExtensionArrays
instead. This may eliminate the need for users to code up their ownduplicated
check if they already have a customfactorize
in place.Additional Context
This came out of PR #45534 as a result of #45236, so this might be viewed as a v1.5 regression; I hadn't implemented this
ExtensionArray
feature in my own code prior to 1.5 so I haven't backtested. It's also potentially related to #27035 and #15929 as well.The text was updated successfully, but these errors were encountered: