-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix getitem dtype preservation with multiindexes #51895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 7 commits
a459276
eab4c6a
a3fc02f
1be4d25
e865548
fda3df8
533e0d8
1d221b1
968228a
f56e76c
e3c17af
6f260fe
079d543
6badc9e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -206,3 +206,20 @@ def test_multiindex_with_na_missing_key(self): | |
) | ||
with pytest.raises(KeyError, match="missing_key"): | ||
df[[("missing_key",)]] | ||
|
||
def test_multiindex_dtype_preservation(self): | ||
# GH51261 | ||
columns = MultiIndex.from_tuples([("A", "B")], names=["lvl1", "lvl2"]) | ||
df = DataFrame(["value"], columns=columns).astype("category") | ||
assert (df["A"].dtypes == "category").all() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sure |
||
|
||
# geopandas 1763 analogue | ||
df = DataFrame( | ||
[[1, 0], [0, 1]], | ||
columns=[ | ||
["foo", "foo"], | ||
["location", "location"], | ||
["x", "y"], | ||
], | ||
).assign(bools=Series([True, False], dtype="boolean")) | ||
assert df["bools"].dtype == "boolean" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For this case we actually want to be more explicit to check the extension dtype ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is faster than iloc? Did not profile but that's the only reason I can think of that it is done this way. To preserve the performance, I guess we can check something like _can_fast_transpose here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in the case where this matters, I expect that
loc
is a slice? And then I would expectiloc
to be fast as well? (it has separate paths for single blocks AFAIR)But yes, something to verify (indexing the numpy array will always be faster since you don't have the overhead of our indexing machinery. But if we also end up doing the same, the difference should be acceptable).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(and I think a bit of extra overhead is also acceptable to avoid needing the custom
add_references
code here (as in #51944)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good as long as is not to much overhead. We should check at least. That was my reasoning for adding the reference code instead of switching to iloc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small test:
So iloc is a bit slower, having a bit more overhead (mostly in creating the resulting MultiIndex columns, which we then override ..). But this mostly fixed overhead regardless of the size of the data, and we are speaking about microseconds.
The actual subsetting still happens with a slice. And just to be sure to compare to a non-slice case (where we do an actual "take") selecting the same subset: