Skip to content

Converting ExtensionArrays to indices causes coersion to object #29426

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Detry322 opened this issue Nov 6, 2019 · 1 comment
Closed

Converting ExtensionArrays to indices causes coersion to object #29426

Detry322 opened this issue Nov 6, 2019 · 1 comment
Labels
Duplicate Report Duplicate issue or pull request ExtensionArray Extending pandas with custom dtypes or arrays. Index Related to the Index class or subclasses

Comments

@Detry322
Copy link

Detry322 commented Nov 6, 2019

Code Sample (from pandas source)

        # pandas/core/indexes/base.py:342
        # extension dtype
        elif is_extension_array_dtype(data) or is_extension_array_dtype(dtype):
            data = np.asarray(data)
            if not (dtype is None or is_object_dtype(dtype)):
                # coerce to the provided dtype
                ea_cls = dtype.construct_array_type()
                data = ea_cls._from_sequence(data, dtype=dtype, copy=False)

            # coerce to the object dtype
            data = data.astype(object)
            return Index(data, dtype=object, copy=copy, name=name, **kwargs)

Problem description

Right now, when you create an index from an ExtensionArray, it causes a coercion to a numpy array of Python objects. See above. For many use cases, this is an expensive operation - many people switch to ExtensionArrays to avoid having numpy lists of objects to begin with! Ideally, it would be possible to store ExtensionArrays inside indices (or provide an interface for ExtensionArrays to implement) so that a conversion to object types would not be necessary.

For my use case, I have many 8-byte python objects that I want to do fast operations on. I use ExtensionArrays to store a contiguous np.uint64 array that has the proper repr inside a Series, and that get lazily converted on access. I want to be able to use them inside indices without a massive performance degradation!

Expected Output

A version of pandas where ExtensionArrays are stored directly inside indices, as opposed to coercing to object.

Output of pd.show_versions()

Current master branch.
A big thank you to all of the pandas maintainers. Without your work, I wouldn't be here in the first place! Thanks for considering this issue.
@jreback
Copy link
Contributor

jreback commented Nov 6, 2019

duplicate of #22861

there is a start of a PR linked there as well

this is non trivial to do; would happily take a PR for steps in this direction

@jreback jreback closed this as completed Nov 6, 2019
@jreback jreback added Duplicate Report Duplicate issue or pull request ExtensionArray Extending pandas with custom dtypes or arrays. Index Related to the Index class or subclasses labels Nov 6, 2019
@jreback jreback added this to the No action milestone Nov 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request ExtensionArray Extending pandas with custom dtypes or arrays. Index Related to the Index class or subclasses
Projects
None yet
Development

No branches or pull requests

2 participants