Skip to content

BUG: Fix dataframe.update not updating dtype #55509 v2 #57637

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Mar 2, 2024
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
15 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v3.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -264,6 +264,7 @@ Bug fixes
~~~~~~~~~
- Fixed bug in :meth:`DataFrame.join` inconsistently setting result index name (:issue:`55815`)
- Fixed bug in :meth:`DataFrame.to_string` that raised ``StopIteration`` with nested DataFrames. (:issue:`16098`)
- Fixed bug in :meth:`DataFrame.update` bool dtype being converted to object (:issue:`55509`)
- Fixed bug in :meth:`Series.diff` allowing non-integer values for the ``periods`` argument. (:issue:`56607`)

Categorical
Expand Down
23 changes: 19 additions & 4 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -8692,6 +8692,10 @@ def update(
dict.update : Similar method for dictionaries.
DataFrame.merge : For column(s)-on-column(s) operations.

Notes
--------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having the ------- longer than the title used to make the CI crash. Not sure why the CI is happy now, but maybe worth making Notes and it the same length in case it fails again in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it has the same size.

1. Duplicate indices on `other` are not supported and raises `ValueError`.

Examples
--------
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [400, 500, 600]})
Expand Down Expand Up @@ -8764,11 +8768,22 @@ def update(
if not isinstance(other, DataFrame):
other = DataFrame(other)

other = other.reindex(self.index)
if other.index.has_duplicates:
raise ValueError("Update not allowed with duplicate indexes on other.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a note that this is not supported to the docstring. There isn't currently a notes section, something like this:

pandas/pandas/core/frame.py

Lines 1450 to 1454 in 737d390

Notes
-----
1. Because ``iterrows`` returns a Series for each row,
it does **not** preserve dtypes across the rows (dtypes are
preserved across columns for DataFrames).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhshadrach Done.

DataFrame.reindex raises on the same condition and does not mention it on the docs. What about adding a similar note there?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slight preference on keeping this PR just modifying update, but no strong objection to modifying the reindex docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. It can be done in another PR.


rows = other.index.intersection(self.index)
if rows.empty:
raise ValueError(
"Can't update dataframe when other has no index in common with "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No big deal, but in the previous error message you say "update not allowed" and here "can't update dataframe" to mean the same thing. Not sure how we usually write error messages, but I think it'd be better to be consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with keeping "update not allowed" because it states that it is possible but not allowed. Done,

"this dataframe."
)

other = other.reindex(rows)
this_data = self.loc[rows]

for col in self.columns.intersection(other.columns):
this = self[col]._values
that = other[col]._values
this = this_data[col]
that = other[col]

if filter_func is not None:
mask = ~filter_func(this) | isna(that)
Expand All @@ -8788,7 +8803,7 @@ def update(
if mask.all():
continue

self.loc[:, col] = self[col].where(mask, that)
self.loc[rows, col] = this.where(mask, that)

# ----------------------------------------------------------------------
# Data reshaping
Expand Down
44 changes: 44 additions & 0 deletions pandas/tests/frame/methods/test_update.py
Original file line number Diff line number Diff line change
Expand Up @@ -184,3 +184,47 @@ def test_update_dt_column_with_NaT_create_column(self):
{"A": [1.0, 3.0], "B": [pd.NaT, pd.to_datetime("2016-01-01")]}
)
tm.assert_frame_equal(df, expected)

@pytest.mark.parametrize(
"value_df, value_other, dtype",
[
(True, False, bool),
(1, 2, int),
(np.uint64(1), np.uint(2), np.dtype("uint64")),
(1.0, 2.0, float),
(1.0 + 1j, 2.0 + 2j, complex),
("a", "b", pd.StringDtype()),
(
pd.to_timedelta("1 ms"),
pd.to_timedelta("2 ms"),
np.dtype("timedelta64[ns]"),
),
(
np.datetime64("2000-01-01T00:00:00"),
np.datetime64("2000-01-02T00:00:00"),
np.dtype("datetime64[ns]"),
),
],
)
def test_update_preserve_dtype(self, value_df, value_other, dtype):
# GH#55509
df = DataFrame({"a": [value_df] * 2}, index=[1, 2])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't seem to use the dtype parameter in any fixture. Did you forget to specify it in the constructors, or am I missing something?

Copy link
Contributor Author

@aureliobarbosa aureliobarbosa Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review and suggestions @datapythonista .

I will address all points as soon as possible (hope to finish by tomorrow).

Regarding the tests, it's an interesting that it is not there! I will take a more detailed look on them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question is, why do you have a parameter for the dtype in the pytest parametrization, if it's not being used? I had the impression that you are planning to set the dtype explicitly when creating the dataframe, but maybe you forgot. If it's not needed, maybe better to remove from the pytest parameters?

Copy link
Contributor Author

@aureliobarbosa aureliobarbosa Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@datapythonista
You are absolutely right!

My tests were originally base on the tests proposed by @MichaelTiemannOSC. I tried to turn them simpler and similar to other tests in the same test set, but since they were failing on main and passing this PR I forgot about them.

This is addressed in the last two commits, which make two of the tests that I propose strictier. Nothing changes but those tests are better now! Thanks for the guidance on this point.

EDIT: I worked on this first because it looked more potentially problematic. As soon as possible the other points will be addressed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complementing, I think it is better to let the dtypes to be explicitly stated on those cases. The reason is that in this way it is possible to check dtype consistency for native python int, numpy, and Extended Arrays, as expected through the 'parametrization'.

other = DataFrame({"a": [value_other]}, index=[1])
expected = DataFrame({"a": [value_other, value_df]}, index=[1, 2])
df.update(other)
tm.assert_frame_equal(df, expected)

def test_update_raises_on_duplicate_argument_index(self):
# GH#55509
df = DataFrame({"a": [1, 1]}, index=[1, 2])
other = DataFrame({"a": [2, 3]}, index=[1, 1])
with pytest.raises(ValueError, match="duplicate index"):
df.update(other)

def test_update_on_duplicate_frame_unique_argument_index(self):
# GH#55509
df = DataFrame({"a": [1, 1, 1]}, index=[1, 1, 2])
other = DataFrame({"a": [2, 3]}, index=[1, 2])
expected = DataFrame({"a": [2, 2, 3]}, index=[1, 1, 2])
df.update(other)
tm.assert_frame_equal(df, expected)