-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: fix DataFrame.apply returning wrong result when dealing with dtype (#28773) #30304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…pe (pandas-dev#28773) The DataFrame.apply was sometimes returning wrong result when we passed function, that was dealing with dtypes. It was caused by retrieving the DataFrame.values of whole DataFrame, and applying the function to it: values are represented by NumPy array, which has one type for all data inside. It sometimes caused treating objects in DataFrame as if they had one common type. What's worth mentioning, the problem only existed, when we were applying function on columns. The implemented solution "cuts" the DataFrame by columns and applies function to each part, as it was whole DataFrame. After that, all results are concatenated into final result on whole DataFrame. The "cuts" are done in following way: the first column is taken, and then we iterate through next columns and take them into first cut while their dtype is identical as in the first column. The process is then repeated for the rest of DataFrame
Hello @Reksbril! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2020-04-17 17:15:35 UTC |
I guess you're using a different version of black than our CI is? Regardless, can you revert non-relevant changes so we can focus on the important parts |
@jbrockmendel done. |
This is a lot of new code. A couple things:
|
@jreback
What's the benefit from using
I'm not sure if we can determine that at all. The obvious thing we can check is if all dtypes are the same (I think it's already done in current code and some specific apply variant is used in that case). As the functions given in input can be very complicated, the only solution to this problem I see, is to give some experimental input to the function, to determine if it relies on dtypes. However, that would be neither elegant nor easy to do (if even possible). |
@Reksbril you're going to have to go through |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls delve deeper in how apply; we already have a column-by-column approach, your test case needs to inform when to use that
In new solution, existing machinery is used to apply the function column-wise, and to recreate final result.
@jreback ping |
1 similar comment
@jreback ping |
@Reksbril needs rebase, ill take a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
your patch makes this much more complex; pls try to simplify
pandas/core/apply.py
Outdated
): | ||
and len(set(self.dtypes)) | ||
) | ||
return_result = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using it to determine if line 314 was run.
pandas/core/apply.py
Outdated
@@ -308,11 +312,19 @@ def apply_standard(self): | |||
# reached via numexpr; fall back to python implementation | |||
pass | |||
else: | |||
return self.obj._constructor_sliced(result, index=labels) | |||
return_result = self.obj._constructor_sliced(result, index=labels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are you trying to do here? we have a specific subclass for Series where this could live, but I am not clear what you are trying to catch
pandas/core/apply.py
Outdated
|
||
# compute the result using the series generator | ||
results, res_index = self.apply_series_generator() | ||
|
||
if flag and return_result is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this case tryng to catch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 2 cases
- can_reduce is False - the function doesn't change its behaviour, compared to previous version
- can_reduce is True - the return_result is calculated like before, but it's returned only if we don't use apply on columns or we don't have mixed dtypes. In other case I use apply_series_generator to obtain column-by-column result and after that I'm wrapping it using _constructor_sliced. It's done this way to preserve the "old" way - it was solution to problem with my function sometimes returning Series instead of DataFrame (or the other way). I assumed, that already existing way of getting result would be better, than thinking about some new solutions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback ping |
@Reksbril looks like a merge conflict and CI failure - can you fix those up and someone will take a look |
) | ||
return_result = None | ||
|
||
if can_reduce: | ||
|
||
values = self.values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Reksbril the underlying problem here is that this call to self.values
is casting to object dtype which you want to avoid. if you add to the check in 277-283 and self.obj._is_homogeneous_type
that should solve the whole thing
@Reksbril can you rebase and address comments |
Yeah, I will. I'm sorry for long delay. |
I won't be able to finish this for a few more months. The current version is the best I could do right now. |
Thanks @Reksbril for working on this. closing to clear queue for now. ping when you want to continue and will reopen. |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
The DataFrame.apply was sometimes returning wrong result when we passed
function, that was dealing with dtypes. It was caused by retrieving
the DataFrame.values of whole DataFrame, and applying the function
to it: values are represented by NumPy array, which has one
type for all data inside. It sometimes caused treating objects
in DataFrame as if they had one common type. What's worth mentioning,
the problem only existed, when we were applying function on columns.
The implemented solution "cuts" the DataFrame by columns and applies
function to each part, as it was whole DataFrame. After that, all
results are concatenated into final result on whole DataFrame.
The "cuts" are done in following way: the first column is taken, and
then we iterate through next columns and take them into first cut
while their dtype is identical as in the first column. The process
is then repeated for the rest of DataFrame.