BUG: fix DataFrame.apply returning wrong result when dealing with dtype (#28773) #30304

Reksbril · 2019-12-17T12:35:06Z

closes apply sometimes unexpectantly casts int64 series to objects #28773
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

The DataFrame.apply was sometimes returning wrong result when we passed
function, that was dealing with dtypes. It was caused by retrieving
the DataFrame.values of whole DataFrame, and applying the function
to it: values are represented by NumPy array, which has one
type for all data inside. It sometimes caused treating objects
in DataFrame as if they had one common type. What's worth mentioning,
the problem only existed, when we were applying function on columns.

The implemented solution "cuts" the DataFrame by columns and applies
function to each part, as it was whole DataFrame. After that, all
results are concatenated into final result on whole DataFrame.
The "cuts" are done in following way: the first column is taken, and
then we iterate through next columns and take them into first cut
while their dtype is identical as in the first column. The process
is then repeated for the rest of DataFrame.

…pe (pandas-dev#28773) The DataFrame.apply was sometimes returning wrong result when we passed function, that was dealing with dtypes. It was caused by retrieving the DataFrame.values of whole DataFrame, and applying the function to it: values are represented by NumPy array, which has one type for all data inside. It sometimes caused treating objects in DataFrame as if they had one common type. What's worth mentioning, the problem only existed, when we were applying function on columns. The implemented solution "cuts" the DataFrame by columns and applies function to each part, as it was whole DataFrame. After that, all results are concatenated into final result on whole DataFrame. The "cuts" are done in following way: the first column is taken, and then we iterate through next columns and take them into first cut while their dtype is identical as in the first column. The process is then repeated for the rest of DataFrame

pep8speaks · 2019-12-17T12:35:29Z

Hello @Reksbril! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file pandas/core/apply.py:

Line 290:89: E501 line too long (101 > 88 characters)

Comment last updated at 2020-04-17 17:15:35 UTC

jbrockmendel · 2019-12-18T02:34:51Z

The rest of changes are result of black pandas reformatting.

I guess you're using a different version of black than our CI is? Regardless, can you revert non-relevant changes so we can focus on the important parts

Reksbril · 2019-12-18T07:53:41Z

@jbrockmendel done.

jbrockmendel · 2019-12-21T06:05:08Z

This is a lot of new code. A couple things:

Assuming this is the correct solution to this problem, the new code would belong in core.apply
the "cuts" are a lot like the DataFrame._data.blocks that already exist. Usually we discourage people from bothering with blocks but it is an option
The hard part of the problem is identifying when you should be operating column-wise (i.e. not call .values). Finding an elegant-ish way of doing that will be key here.

Reksbril · 2019-12-23T14:15:15Z

@jreback
I've resolved a minor problem which caused most of tests to fail.

* the "cuts" are a lot like the `DataFrame._data.blocks` that already exist.  Usually we discourage people from bothering with blocks but it is an option

What's the benefit from using DataFrame._data.blocks instead of current solution? I'm just using DataFrame.iloc to get only part of DF and it seems to be very "clean" approach.

* The hard part of the problem is identifying when you should be operating column-wise (i.e. not call .values).  Finding an elegant-ish way of doing that will be key here.

I'm not sure if we can determine that at all. The obvious thing we can check is if all dtypes are the same (I think it's already done in current code and some specific apply variant is used in that case). As the functions given in input can be very complicated, the only solution to this problem I see, is to give some experimental input to the function, to determine if it relies on dtypes. However, that would be neither elegant nor easy to do (if even possible).

jbrockmendel · 2019-12-23T18:37:19Z

@Reksbril you're going to have to go through frame_apply and try to adapt the existing machinery rather than pile a new layer on top of it. Within FrameApply there is a apply_series_generator that exists precisely to have a function operate column-by-column. So a good solution to this problem would be to edit FrameApply.apply_standard so that this case goes through the apply_series_generator route.

jreback

pls delve deeper in how apply; we already have a column-by-column approach, your test case needs to inform when to use that

pandas/core/frame.py

In new solution, existing machinery is used to apply the function column-wise, and to recreate final result.

Reksbril · 2020-01-10T08:11:57Z

@jreback ping

Reksbril · 2020-01-13T21:34:31Z

@jreback ping

jbrockmendel · 2020-01-14T01:56:38Z

@Reksbril needs rebase, ill take a look

pandas/core/apply.py

jreback

your patch makes this much more complex; pls try to simplify

pandas/core/apply.py

jreback · 2020-01-14T12:48:36Z

pandas/core/apply.py

-        ):
+            and len(set(self.dtypes))
+        )
+        return_result = None


why is this needed?

I'm using it to determine if line 314 was run.

jreback · 2020-01-14T12:49:46Z

pandas/core/apply.py

@@ -308,11 +312,19 @@ def apply_standard(self):
                # reached via numexpr; fall back to python implementation
                pass
            else:
-                return self.obj._constructor_sliced(result, index=labels)
+                return_result = self.obj._constructor_sliced(result, index=labels)


what are you trying to do here? we have a specific subclass for Series where this could live, but I am not clear what you are trying to catch

jreback · 2020-01-14T12:49:56Z

pandas/core/apply.py


        # compute the result using the series generator
        results, res_index = self.apply_series_generator()

+        if flag and return_result is not None:


what is this case tryng to catch?

There are 2 cases

can_reduce is False - the function doesn't change its behaviour, compared to previous version

can_reduce is True - the return_result is calculated like before, but it's returned only if we don't use apply on columns or we don't have mixed dtypes. In other case I use apply_series_generator to obtain column-by-column result and after that I'm wrapping it using _constructor_sliced. It's done this way to preserve the "old" way - it was solution to problem with my function sometimes returning Series instead of DataFrame (or the other way). I assumed, that already existing way of getting result would be better, than thinking about some new solutions.

pandas/tests/frame/test_apply.py

Reksbril · 2020-01-18T07:21:56Z

@jreback ping

WillAyd · 2020-02-12T00:56:16Z

@Reksbril looks like a merge conflict and CI failure - can you fix those up and someone will take a look

jbrockmendel · 2020-02-18T21:16:49Z

pandas/core/apply.py

+        )
+        return_result = None
+
+        if can_reduce:

            values = self.values


@Reksbril the underlying problem here is that this call to self.values is casting to object dtype which you want to avoid. if you add to the check in 277-283 and self.obj._is_homogeneous_type that should solve the whole thing

jbrockmendel · 2020-04-16T17:11:07Z

@Reksbril can you rebase and address comments

Reksbril · 2020-04-16T17:15:30Z

Yeah, I will. I'm sorry for long delay.

Reksbril · 2020-04-17T17:17:19Z

I won't be able to finish this for a few more months. The current version is the best I could do right now.

simonjayhawkins · 2020-05-22T11:22:41Z

I won't be able to finish this for a few more months. The current version is the best I could do right now.

Thanks @Reksbril for working on this. closing to clear queue for now. ping when you want to continue and will reopen.

Mateusz Górski added 2 commits December 17, 2019 13:13

Merge remote-tracking branch 'upstream/master' into apply_bug

d045aed

Reformatting

85a2af7

Reksbril force-pushed the apply_bug branch from e216164 to 85a2af7 Compare December 18, 2019 07:52

Resolve conflict with iterator variables

7b6e793

jreback requested changes Dec 23, 2019

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

jreback added Apply Apply, Aggregate, Transform, Map Dtype Conversions Unexpected or buggy dtype conversions labels Dec 23, 2019

Mateusz Górski and others added 3 commits January 6, 2020 08:20

Change approach to pandas-dev#28773 fix

dcaf64b

In new solution, existing machinery is used to apply the function column-wise, and to recreate final result.

Merge branch 'master' into apply_bug

f47b420

Update apply.py

a0b89e9

Reksbril requested a review from jreback January 6, 2020 20:51

jbrockmendel reviewed Jan 14, 2020

View reviewed changes

pandas/core/apply.py Show resolved Hide resolved

Mateusz Górski and others added 3 commits January 14, 2020 09:05

Filter out mixed-dtypes from regular standard_apply procedure

9f24abb

Merge branch 'master' into apply_bug

6eb6d08

Remove whitespace from test_apply.py

b45b44d

Reksbril requested a review from jbrockmendel January 14, 2020 09:04

jreback requested changes Jan 14, 2020

View reviewed changes

Refactor

3755138

Reksbril requested a review from jreback January 16, 2020 09:04

Black pandas refactor

fdb2f43

jbrockmendel reviewed Feb 18, 2020

View reviewed changes

Mateusz Górski added 2 commits April 17, 2020 17:49

Merge remote-tracking branch 'upstream/master' into apply_bug

78f0154

apply fix

1f81b77

simonjayhawkins closed this May 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: fix DataFrame.apply returning wrong result when dealing with dtype (#28773) #30304

BUG: fix DataFrame.apply returning wrong result when dealing with dtype (#28773) #30304

Reksbril commented Dec 17, 2019 •

edited

Loading

pep8speaks commented Dec 17, 2019 •

edited

Loading

jbrockmendel commented Dec 18, 2019

Reksbril commented Dec 18, 2019

jbrockmendel commented Dec 21, 2019

Reksbril commented Dec 23, 2019

jbrockmendel commented Dec 23, 2019

jreback left a comment

Reksbril commented Jan 10, 2020

Reksbril commented Jan 13, 2020

jbrockmendel commented Jan 14, 2020

jreback left a comment

jreback Jan 14, 2020

Reksbril Jan 16, 2020

jreback Jan 14, 2020

jreback Jan 14, 2020

Reksbril Jan 16, 2020

Reksbril Jan 21, 2020

Reksbril commented Jan 18, 2020

WillAyd commented Feb 12, 2020

jbrockmendel Feb 18, 2020

jbrockmendel commented Apr 16, 2020

Reksbril commented Apr 16, 2020

Reksbril commented Apr 17, 2020

simonjayhawkins commented May 22, 2020

BUG: fix DataFrame.apply returning wrong result when dealing with dtype (#28773) #30304

BUG: fix DataFrame.apply returning wrong result when dealing with dtype (#28773) #30304

Conversation

Reksbril commented Dec 17, 2019 • edited Loading

pep8speaks commented Dec 17, 2019 • edited Loading

Comment last updated at 2020-04-17 17:15:35 UTC

jbrockmendel commented Dec 18, 2019

Reksbril commented Dec 18, 2019

jbrockmendel commented Dec 21, 2019

Reksbril commented Dec 23, 2019

jbrockmendel commented Dec 23, 2019

jreback left a comment

Choose a reason for hiding this comment

Reksbril commented Jan 10, 2020

Reksbril commented Jan 13, 2020

jbrockmendel commented Jan 14, 2020

jreback left a comment

Choose a reason for hiding this comment

jreback Jan 14, 2020

Choose a reason for hiding this comment

Reksbril Jan 16, 2020

Choose a reason for hiding this comment

jreback Jan 14, 2020

Choose a reason for hiding this comment

jreback Jan 14, 2020

Choose a reason for hiding this comment

Reksbril Jan 16, 2020

Choose a reason for hiding this comment

Reksbril Jan 21, 2020

Choose a reason for hiding this comment

Reksbril commented Jan 18, 2020

WillAyd commented Feb 12, 2020

jbrockmendel Feb 18, 2020

Choose a reason for hiding this comment

jbrockmendel commented Apr 16, 2020

Reksbril commented Apr 16, 2020

Reksbril commented Apr 17, 2020

simonjayhawkins commented May 22, 2020

Reksbril commented Dec 17, 2019 •

edited

Loading

pep8speaks commented Dec 17, 2019 •

edited

Loading