Skip to content

ENH/PERF: ExtensionArray should offer a duplicated function #48747

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
tehunter opened this issue Sep 23, 2022 · 1 comment
Closed
2 of 3 tasks

ENH/PERF: ExtensionArray should offer a duplicated function #48747

tehunter opened this issue Sep 23, 2022 · 1 comment
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@tehunter
Copy link
Contributor

tehunter commented Sep 23, 2022

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

In version 1.5.0, functions that use Series.duplicated (including DataFrame.duplicated with a single column subset and .drop_duplicates) go through pd.core.algorithms.duplicated which calls _ensure_data. Currently there is no method for an ExtensionArray to offer its own duplicated behavior, so when the program gets to pd.core.algorithms._ensure_data, it may be forced to fall back on the np.asarray(values, dtype=object) if the ExtensionArray is not a coerceable type. Here's the documentation for _ensure_data:

def _ensure_data(values: ArrayLike) -> np.ndarray:
    """
    routine to ensure that our data is of the correct
    input dtype for lower-level routines

    This will coerce:
    - ints -> int64
    - uint -> uint64
    - bool -> uint8
    - datetimelike -> i8
    - datetime64tz -> i8 (in local tz)
    - categorical -> codes

    Parameters
    ----------
    values : np.ndarray or ExtensionArray

    Returns
    -------
    np.ndarray
    """

This np.asarray call can be very expensive. For example, I have an ExtensionArray with several million rows backed by 10 Categorical/numerical arrays. The np.asarray function uses __iter__ to loop through my array and construct an np.array out of base objects. This is a very expensive task. Unfortunately, I have no means of hinting to pandas that I have much more efficient algorithms for computing the duplicates which can take advantage of vectorization.

Feature Description

Add a duplicated method to ExtensionArray with the following signature:

def duplicated(self, keep: Literal["first", "last", False]) -> npt.NDArray[np.bool_]:
    # Returns a boolean array indicating duplicated values in the ExtensionArray
    return pd.core.algorithms.duplicated(self._values, keep=keep)

Then have IndexOpsMixin call that function instead of directly calling pd.core.algorithms.duplicated:

    @final
    def _duplicated(
        self, keep: Literal["first", "last", False] = "first"
    ) -> npt.NDArray[np.bool_]:
        # Since self._values can be ExtensionArray or np.ndarray, may need to add a type check here
        # and fall back on pd.core.algorithms.duplicated if an np.ndarray
        return self._values.duplicated(keep=keep)

Alternative Solutions

Currently, for doing a duplicated with multiple columns subset, pandas routes each column through algorithms.factorize and then passes the whole thing through get_group_index. Possibly the Series duplicated function could implement an algorithm based on factorize for custom ExtensionArrays instead. This may eliminate the need for users to code up their own duplicated check if they already have a custom factorize in place.

Additional Context

This came out of PR #45534 as a result of #45236, so this might be viewed as a v1.5 regression; I hadn't implemented this ExtensionArray feature in my own code prior to 1.5 so I haven't backtested. It's also potentially related to #27035 and #15929 as well.

@tehunter tehunter added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 23, 2022
@phofl
Copy link
Member

phofl commented Sep 23, 2022

Hi, thanks for your report. This is a duplicate of #48424

@phofl phofl closed this as completed Sep 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

2 participants