Skip to content

REF: reindex_indexer use np.void to avoid JoinUnit #43376

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Sep 8, 2021

Conversation

jbrockmendel
Copy link
Member

ATM we do a kludgy thing where we sometimes reindex before calling concat (e.g. in DataFrame.append) and that leads to all-NA columns that we choose to ignore when choosing the dtype for the concatenated column. This leads to a) funky behavior when we encounter an "real" all-NA column and b) a big performance hit in checking for all-NA columns.

Instead of doing that, this PR implements the keyword use_na_proxy (chosen to match ArrayManager's keyword) that ends up creating ndarrays with np.void dtype instead of all-NA float64 ndarrays.

This will allow us to simplify a bunch of code in internals.concat (e.g. get rid of case where unit.block is None) and improve the funky behavior mentioned above. Both of these are left to follow-ups to keep the diff manageable.

@jreback jreback added Internals Related to non-user accessible pandas implementation Refactor Internal refactoring of code labels Sep 4, 2021
@jreback jreback added this to the 1.4 milestone Sep 4, 2021
@@ -245,6 +247,38 @@ def concatenate_managers(
return BlockManager(tuple(blocks), axes)


def _reindex_columns_void(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would rename to _maybe_reindex_columns_na_proxy, e.g. don't use the term void as we don't use it elsewhere and instead use na_proxy (I know the dtype is V which is fine); alt make this clear in the doc-string here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed + green

@jbrockmendel
Copy link
Member Author

greenish, just test_to_csv_compression_encoding_gcs

@jreback jreback merged commit 7d27588 into pandas-dev:master Sep 8, 2021
@jbrockmendel jbrockmendel deleted the ref-concat-void branch September 8, 2021 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Internals Related to non-user accessible pandas implementation Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants