-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: DataFrame.update bool dtype being converted to object #55509 #55634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
9183b31
af983c8
82ecad2
07046c0
e8ee233
4fd4102
85a107e
b2acbc4
c7958e0
2635c05
bb41261
a1786ba
79af494
ee6c3cf
ce1b7a7
fef821f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8857,11 +8857,11 @@ def update( | |
if not isinstance(other, DataFrame): | ||
other = DataFrame(other) | ||
|
||
other = other.reindex(self.index) | ||
indexes_intersection = self.index.intersection(other.index) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For a monotonic index, I'm seeing a perf increase
But a perf regression for nonmonotonic:
And a perf regression when the caller's index is small and the argument's index is large:
When there are duplicates and the index of the caller and argument are the same, both main and this PR succeed. When they are not the same, both fail but with different error messages
Also, when there is no overlap (the intersection is empty), this PR succeeds whereas main fails with the same error message. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A bit mixed when it comes to performance, but I'd guess sorted indices and having a large caller with a small updater is the common cause, and this is where it's an improvement. My recommendation would be to detect duplicate indices (in either the caller or the argument) and raise an informative error message, even when we would otherwise succeed (when there is no overlap). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. cc @mroeschke - do you agree? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah I think that makes sense to me. The error message for duplicates in this PR is not great compared to the error message on main. |
||
|
||
for col in self.columns.intersection(other.columns): | ||
this = self[col]._values | ||
that = other[col]._values | ||
this = self[col].loc[indexes_intersection]._values | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could we avoid chained indexing here? |
||
that = other[col].loc[indexes_intersection]._values | ||
|
||
if filter_func is not None: | ||
mask = ~filter_func(this) | isna(that) | ||
|
@@ -8881,7 +8881,7 @@ def update( | |
if mask.all(): | ||
continue | ||
|
||
self.loc[:, col] = expressions.where(mask, this, that) | ||
self.loc[indexes_intersection, col] = expressions.where(mask, this, that) | ||
|
||
# ---------------------------------------------------------------------- | ||
# Data reshaping | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -177,3 +177,12 @@ def test_update_dt_column_with_NaT_create_column(self): | |
{"A": [1.0, 3.0], "B": [pd.NaT, pd.to_datetime("2016-01-01")]} | ||
) | ||
tm.assert_frame_equal(df, expected) | ||
|
||
def test_update_preserve_column_dtype_bool(self): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add a test for when the index has duplicate values |
||
# GH#55509 | ||
df = DataFrame({"A": [True, True]}, index=[1, 2]) | ||
other = DataFrame({"A": [False]}, index=[1]) | ||
expected = DataFrame({"A": [False, True]}, index=[1, 2]) | ||
df.update(other) | ||
|
||
tm.assert_frame_equal(df, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a regression, so would go into 2.2.0.