Skip to content

Performance of ExtensionArray display in DataFrame/Series #43020

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
samgd opened this issue Aug 13, 2021 · 3 comments
Closed

Performance of ExtensionArray display in DataFrame/Series #43020

samgd opened this issue Aug 13, 2021 · 3 comments
Labels
Duplicate Report Duplicate issue or pull request ExtensionArray Extending pandas with custom dtypes or arrays. Output-Formatting __repr__ of pandas objects, to_string Performance Memory or execution speed performance

Comments

@samgd
Copy link

samgd commented Aug 13, 2021

When printing a DataFrame/Series that is backed by an ExtensionArray the values are first converted to a numpy array:

def _format_strings(self) -> list[str]:
values = extract_array(self.values, extract_numpy=True)
formatter = self.formatter
if formatter is None:
# error: Item "ndarray" of "Union[Any, Union[ExtensionArray, ndarray]]" has
# no attribute "_formatter"
formatter = values._formatter(boxed=True) # type: ignore[union-attr]
if isinstance(values, Categorical):
# Categorical is special for now, so that we can preserve tzinfo
array = values._internal_get_values()
else:
array = np.asarray(values)

This is problematic for ExtensionArrays that are very expensive to convert to numpy arrays. i.e. converting an extremely large ExtensionArray to a numpy array to ultimately print ~10 values.

Can this be overcome or worked around? There is a already a special case for categorical here.

@samgd samgd changed the title ExtensionArray display in DataFrame/Series Performance of ExtensionArray display in DataFrame/Series Aug 13, 2021
@samgd
Copy link
Author

samgd commented Aug 13, 2021

Tagging @jreback as you may be able to help given the git blame?

@simonjayhawkins simonjayhawkins added ExtensionArray Extending pandas with custom dtypes or arrays. Performance Memory or execution speed performance labels Aug 16, 2021
@jbrockmendel
Copy link
Member

PR to address this would be welcome

@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone Aug 19, 2021
@mroeschke mroeschke added the Output-Formatting __repr__ of pandas objects, to_string label Aug 21, 2021
@TomAugspurger
Copy link
Contributor

I believe this is a duplicate of / covered by #26837.

#26837 (comment) has a proposed addition to the ExtensionArray interface to allow the EA to control the conversion from array -> List[str]

@TomAugspurger TomAugspurger added the Duplicate Report Duplicate issue or pull request label Nov 15, 2021
@TomAugspurger TomAugspurger modified the milestones: Contributions Welcome, No action Nov 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request ExtensionArray Extending pandas with custom dtypes or arrays. Output-Formatting __repr__ of pandas objects, to_string Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

5 participants