-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REF: reindex_indexer use np.void to avoid JoinUnit #43376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/internals/concat.py
Outdated
@@ -245,6 +247,38 @@ def concatenate_managers( | |||
return BlockManager(tuple(blocks), axes) | |||
|
|||
|
|||
def _reindex_columns_void( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would rename to _maybe_reindex_columns_na_proxy
, e.g. don't use the term void
as we don't use it elsewhere and instead use na_proxy
(I know the dtype is V which is fine); alt make this clear in the doc-string here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed + green
greenish, just test_to_csv_compression_encoding_gcs |
ATM we do a kludgy thing where we sometimes reindex before calling concat (e.g. in DataFrame.append) and that leads to all-NA columns that we choose to ignore when choosing the dtype for the concatenated column. This leads to a) funky behavior when we encounter an "real" all-NA column and b) a big performance hit in checking for all-NA columns.
Instead of doing that, this PR implements the keyword use_na_proxy (chosen to match ArrayManager's keyword) that ends up creating ndarrays with np.void dtype instead of all-NA float64 ndarrays.
This will allow us to simplify a bunch of code in internals.concat (e.g. get rid of case where
unit.block is None
) and improve the funky behavior mentioned above. Both of these are left to follow-ups to keep the diff manageable.