Skip to content

BUG: DataFrame.update bool dtype being converted to object #55509 #55634

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v2.1.2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ Bug fixes
- Fixed bug in :meth:`DataFrame.__setitem__` not inferring string dtype for zero-dimensional array with ``infer_string=True`` (:issue:`55366`)
- Fixed bug in :meth:`DataFrame.idxmin` and :meth:`DataFrame.idxmax` raising for arrow dtypes (:issue:`55368`)
- Fixed bug in :meth:`DataFrame.interpolate` raising incorrect error message (:issue:`55347`)
- Fixed bug in :meth:`DataFrame.update` bool dtype being converted to object (:issue:`55509`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a regression, so would go into 2.2.0.

- Fixed bug in :meth:`Index.insert` raising when inserting ``None`` into :class:`Index` with ``dtype="string[pyarrow_numpy]"`` (:issue:`55365`)
- Fixed bug in :meth:`Series.all` and :meth:`Series.any` not treating missing values correctly for ``dtype="string[pyarrow_numpy]"`` (:issue:`55367`)
- Fixed bug in :meth:`Series.floordiv` for :class:`ArrowDtype` (:issue:`55561`)
Expand Down
8 changes: 4 additions & 4 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -8857,11 +8857,11 @@ def update(
if not isinstance(other, DataFrame):
other = DataFrame(other)

other = other.reindex(self.index)
indexes_intersection = self.index.intersection(other.index)
Copy link
Member

@rhshadrach rhshadrach Oct 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a monotonic index, I'm seeing a perf increase

size = 100_000
df1 = pd.DataFrame({'a': np.random.randint(0, 100, size)}, index=range(1, size+1))
df2 = pd.DataFrame({'a': np.random.randint(0, 100, 3)}, index=[10, 12, 11])

%timeit df1.update(df2)
# 1.33 ms ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  <-- main
# 372 µs ± 663 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)    <-- PR

But a perf regression for nonmonotonic:

size = 100_000
index = list(range(1, size+1))
index[1], index[0] = index[0], index[1]
df1 = pd.DataFrame({'a': np.random.randint(0, 100, size)}, index=index)
df2 = pd.DataFrame({'a': np.random.randint(0, 100, 3)}, index=[10, 12, 11])
%timeit df1.update(df2)
# 1.1 ms ± 3.72 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  <-- main
# 2.52 ms ± 7.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)    <-- PR

And a perf regression when the caller's index is small and the argument's index is large:

size = 100_000
df1 = pd.DataFrame({'a': np.random.randint(0, 100, 3)}, index=[10, 11, 12])
df2 = pd.DataFrame({'a': np.random.randint(0, 100, size)})
%timeit df1.update(df2)
# 186 µs ± 997 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)  <-- main
# 329 µs ± 595 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)   <-- PR

When there are duplicates and the index of the caller and argument are the same, both main and this PR succeed. When they are not the same, both fail but with different error messages

size = 10
df1 = pd.DataFrame({'a': np.random.randint(0, 100, size)})
df2 = pd.DataFrame({'a': [-1, -2, -3]}, index=[4, 4, 6])
df1.update(df2)
# ValueError: cannot reindex on an axis with duplicate labels  <-- main
# ValueError: operands could not be broadcast together with shapes (3,) (2,) (3,)  <-- PR

Also, when there is no overlap (the intersection is empty), this PR succeeds whereas main fails with the same error message.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit mixed when it comes to performance, but I'd guess sorted indices and having a large caller with a small updater is the common cause, and this is where it's an improvement.

My recommendation would be to detect duplicate indices (in either the caller or the argument) and raise an informative error message, even when we would otherwise succeed (when there is no overlap).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @mroeschke - do you agree?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think that makes sense to me. The error message for duplicates in this PR is not great compared to the error message on main.


for col in self.columns.intersection(other.columns):
this = self[col]._values
that = other[col]._values
this = self[col].loc[indexes_intersection]._values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we avoid chained indexing here?

that = other[col].loc[indexes_intersection]._values

if filter_func is not None:
mask = ~filter_func(this) | isna(that)
Expand All @@ -8881,7 +8881,7 @@ def update(
if mask.all():
continue

self.loc[:, col] = expressions.where(mask, this, that)
self.loc[indexes_intersection, col] = expressions.where(mask, this, that)

# ----------------------------------------------------------------------
# Data reshaping
Expand Down
10 changes: 10 additions & 0 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -8785,6 +8785,16 @@ def clip(
3 -1 6
4 5 -4

Clips using specific lower and upper thresholds per column:

>>> df.clip([-2, -1], [4,5])
col_0 col_1
0 4 -1
1 -2 -1
2 0 5
3 -1 5
4 4 -1

Clips using specific lower and upper thresholds per column element:

>>> t = pd.Series([2, -4, -1, 6, 3])
Expand Down
9 changes: 9 additions & 0 deletions pandas/tests/frame/methods/test_update.py
Original file line number Diff line number Diff line change
Expand Up @@ -177,3 +177,12 @@ def test_update_dt_column_with_NaT_create_column(self):
{"A": [1.0, 3.0], "B": [pd.NaT, pd.to_datetime("2016-01-01")]}
)
tm.assert_frame_equal(df, expected)

def test_update_preserve_column_dtype_bool(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test for when the index has duplicate values

# GH#55509
df = DataFrame({"A": [True, True]}, index=[1, 2])
other = DataFrame({"A": [False]}, index=[1])
expected = DataFrame({"A": [False, True]}, index=[1, 2])
df.update(other)

tm.assert_frame_equal(df, expected)