Skip to content

BUG: fix DataFrame.apply returning wrong result when dealing with dtype (#28773) #30304

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 14 commits into from
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1161,6 +1161,7 @@ Other
- Bug in :meth:`DataFrame.append` that raised ``IndexError`` when appending with empty list (:issue:`28769`)
- Fix :class:`AbstractHolidayCalendar` to return correct results for
years after 2030 (now goes up to 2200) (:issue:`27790`)
- Bug in :meth:`DataFrame.apply` returning wrong result in some cases when dtype was involved in passed function (:issue:`28773`)
- Fixed :class:`~arrays.IntegerArray` returning ``inf`` rather than ``NaN`` for operations dividing by ``0`` (:issue:`27398`)
- Fixed ``pow`` operations for :class:`~arrays.IntegerArray` when the other value is ``0`` or ``1`` (:issue:`29997`)
- Bug in :meth:`Series.count` raises if use_inf_as_na is enabled (:issue:`29478`)
Expand Down
18 changes: 15 additions & 3 deletions pandas/core/apply.py
Original file line number Diff line number Diff line change
Expand Up @@ -274,13 +274,17 @@ def apply_standard(self):

# we cannot reduce using non-numpy dtypes,
# as demonstrated in gh-12244
if (
flag = (
self.result_type in ["reduce", None]
and not self.dtypes.apply(is_extension_array_dtype).any()
# Disallow complex_internals since libreduction shortcut
# cannot handle MultiIndex
and not isinstance(self.agg_axis, ABCMultiIndex)
):
and len(set(self.dtypes))
)
return_result = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using it to determine if line 314 was run.


if flag:

values = self.values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Reksbril the underlying problem here is that this call to self.values is casting to object dtype which you want to avoid. if you add to the check in 277-283 and self.obj._is_homogeneous_type that should solve the whole thing

index = self.obj._get_axis(self.axis)
Expand Down Expand Up @@ -308,11 +312,19 @@ def apply_standard(self):
# reached via numexpr; fall back to python implementation
pass
else:
return self.obj._constructor_sliced(result, index=labels)
return_result = self.obj._constructor_sliced(result, index=labels)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are you trying to do here? we have a specific subclass for Series where this could live, but I am not clear what you are trying to catch

if self.axis != 0 and self.axis != "index":
return return_result

# compute the result using the series generator
results, res_index = self.apply_series_generator()

if flag and return_result is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this case tryng to catch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 cases

  1. can_reduce is False - the function doesn't change its behaviour, compared to previous version
  2. can_reduce is True - the return_result is calculated like before, but it's returned only if we don't use apply on columns or we don't have mixed dtypes. In other case I use apply_series_generator to obtain column-by-column result and after that I'm wrapping it using _constructor_sliced. It's done this way to preserve the "old" way - it was solution to problem with my function sometimes returning Series instead of DataFrame (or the other way). I assumed, that already existing way of getting result would be better, than thinking about some new solutions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

results = np.array(list(results.values()))
return self.obj._constructor_sliced(
results, index=res_index, dtype=return_result.dtype
)

# wrap results
return self.wrap_results(results, res_index)

Expand Down
6 changes: 6 additions & 0 deletions pandas/tests/frame/test_apply.py
Original file line number Diff line number Diff line change
Expand Up @@ -691,6 +691,12 @@ def test_apply_dup_names_multi_agg(self):

tm.assert_frame_equal(result, expected)

def test_apply_get_dtype(self):
# GH 28773
df = DataFrame({"col_1": [1, 2, 3], "col_2": ["hi", "there", "friend"]})
expected = Series(data=["int64", "object"], index=["col_1", "col_2"])
tm.assert_series_equal(df.apply(lambda x: x.dtype), expected)

def test_apply_nested_result_axis_1(self):
# GH 13820
def apply_list(row):
Expand Down