Skip to content

BUG: fix DataFrame.apply returning wrong result when dealing with dtype (#28773) #30304

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 14 commits into from

Conversation

Reksbril
Copy link
Contributor

@Reksbril Reksbril commented Dec 17, 2019

The DataFrame.apply was sometimes returning wrong result when we passed
function, that was dealing with dtypes. It was caused by retrieving
the DataFrame.values of whole DataFrame, and applying the function
to it: values are represented by NumPy array, which has one
type for all data inside. It sometimes caused treating objects
in DataFrame as if they had one common type. What's worth mentioning,
the problem only existed, when we were applying function on columns.

The implemented solution "cuts" the DataFrame by columns and applies
function to each part, as it was whole DataFrame. After that, all
results are concatenated into final result on whole DataFrame.
The "cuts" are done in following way: the first column is taken, and
then we iterate through next columns and take them into first cut
while their dtype is identical as in the first column. The process
is then repeated for the rest of DataFrame.

Mateusz Górski added 2 commits December 17, 2019 13:13
…pe (pandas-dev#28773)

The DataFrame.apply was sometimes returning wrong result when we passed
function, that was dealing with dtypes. It was caused by retrieving
the DataFrame.values of whole DataFrame, and applying the function
to it: values are represented by NumPy array, which has one
type for all data inside. It sometimes caused treating objects
in DataFrame as if they had one common type. What's worth mentioning,
the problem only existed, when we were applying function on columns.

The implemented solution "cuts" the DataFrame by columns and applies
function to each part, as it was whole DataFrame. After that, all
results are concatenated into final result on whole DataFrame.
The "cuts" are done in following way: the first column is taken, and
then we iterate through next columns and take them into first cut
while their dtype is identical as in the first column. The process
is then repeated for the rest of DataFrame
@pep8speaks
Copy link

pep8speaks commented Dec 17, 2019

Hello @Reksbril! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 290:89: E501 line too long (101 > 88 characters)

Comment last updated at 2020-04-17 17:15:35 UTC

@jbrockmendel
Copy link
Member

The rest of changes are result of black pandas reformatting.

I guess you're using a different version of black than our CI is? Regardless, can you revert non-relevant changes so we can focus on the important parts

@Reksbril
Copy link
Contributor Author

@jbrockmendel done.

@jbrockmendel
Copy link
Member

This is a lot of new code. A couple things:

  • Assuming this is the correct solution to this problem, the new code would belong in core.apply
  • the "cuts" are a lot like the DataFrame._data.blocks that already exist. Usually we discourage people from bothering with blocks but it is an option
  • The hard part of the problem is identifying when you should be operating column-wise (i.e. not call .values). Finding an elegant-ish way of doing that will be key here.

@Reksbril
Copy link
Contributor Author

@jreback
I've resolved a minor problem which caused most of tests to fail.

* the "cuts" are a lot like the `DataFrame._data.blocks` that already exist.  Usually we discourage people from bothering with blocks but it is an option

What's the benefit from using DataFrame._data.blocks instead of current solution? I'm just using DataFrame.iloc to get only part of DF and it seems to be very "clean" approach.

* The hard part of the problem is identifying when you should be operating column-wise (i.e. not call .values).  Finding an elegant-ish way of doing that will be key here.

I'm not sure if we can determine that at all. The obvious thing we can check is if all dtypes are the same (I think it's already done in current code and some specific apply variant is used in that case). As the functions given in input can be very complicated, the only solution to this problem I see, is to give some experimental input to the function, to determine if it relies on dtypes. However, that would be neither elegant nor easy to do (if even possible).

@jbrockmendel
Copy link
Member

@Reksbril you're going to have to go through frame_apply and try to adapt the existing machinery rather than pile a new layer on top of it. Within FrameApply there is a apply_series_generator that exists precisely to have a function operate column-by-column. So a good solution to this problem would be to edit FrameApply.apply_standard so that this case goes through the apply_series_generator route.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls delve deeper in how apply; we already have a column-by-column approach, your test case needs to inform when to use that

@jreback jreback added Apply Apply, Aggregate, Transform, Map Dtype Conversions Unexpected or buggy dtype conversions labels Dec 23, 2019
Mateusz Górski and others added 3 commits January 6, 2020 08:20
In new solution, existing machinery is used to apply the function
column-wise, and to recreate final result.
@Reksbril Reksbril requested a review from jreback January 6, 2020 20:51
@Reksbril
Copy link
Contributor Author

@jreback ping

1 similar comment
@Reksbril
Copy link
Contributor Author

@jreback ping

@jbrockmendel
Copy link
Member

@Reksbril needs rebase, ill take a look

@Reksbril Reksbril requested a review from jbrockmendel January 14, 2020 09:04
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your patch makes this much more complex; pls try to simplify

):
and len(set(self.dtypes))
)
return_result = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using it to determine if line 314 was run.

@@ -308,11 +312,19 @@ def apply_standard(self):
# reached via numexpr; fall back to python implementation
pass
else:
return self.obj._constructor_sliced(result, index=labels)
return_result = self.obj._constructor_sliced(result, index=labels)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are you trying to do here? we have a specific subclass for Series where this could live, but I am not clear what you are trying to catch


# compute the result using the series generator
results, res_index = self.apply_series_generator()

if flag and return_result is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this case tryng to catch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 cases

  1. can_reduce is False - the function doesn't change its behaviour, compared to previous version
  2. can_reduce is True - the return_result is calculated like before, but it's returned only if we don't use apply on columns or we don't have mixed dtypes. In other case I use apply_series_generator to obtain column-by-column result and after that I'm wrapping it using _constructor_sliced. It's done this way to preserve the "old" way - it was solution to problem with my function sometimes returning Series instead of DataFrame (or the other way). I assumed, that already existing way of getting result would be better, than thinking about some new solutions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Reksbril Reksbril requested a review from jreback January 16, 2020 09:04
@Reksbril
Copy link
Contributor Author

@jreback ping

@WillAyd
Copy link
Member

WillAyd commented Feb 12, 2020

@Reksbril looks like a merge conflict and CI failure - can you fix those up and someone will take a look

)
return_result = None

if can_reduce:

values = self.values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Reksbril the underlying problem here is that this call to self.values is casting to object dtype which you want to avoid. if you add to the check in 277-283 and self.obj._is_homogeneous_type that should solve the whole thing

@jbrockmendel
Copy link
Member

@Reksbril can you rebase and address comments

@Reksbril
Copy link
Contributor Author

Yeah, I will. I'm sorry for long delay.

@Reksbril
Copy link
Contributor Author

I won't be able to finish this for a few more months. The current version is the best I could do right now.

@simonjayhawkins
Copy link
Member

I won't be able to finish this for a few more months. The current version is the best I could do right now.

Thanks @Reksbril for working on this. closing to clear queue for now. ping when you want to continue and will reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

apply sometimes unexpectantly casts int64 series to objects
6 participants