-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Fixed DF.apply for functions returning a dict, #8735 #10740
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -142,6 +142,46 @@ Other enhancements | |||
|
|||
- ``pd.pivot`` will now allow passing index as ``None`` (:issue:`3962`). | |||
|
|||
- ``DataFrame.apply`` will return a Series of dicts if the passed function returns a dict and ``reduce=True`` (:issue:`8735`). | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just the issue line is fine. This is not a very common case.
Thanks, fixed. |
pls add a test cases, including |
Done! I found another issue when the original DataFrame is of dtype int, and I added tests for int and str dtypes. |
# Unlike filling with NA, this works for any dtype | ||
index = self._get_axis(axis) | ||
empty_arr = np.empty(len(index), dtype=values.dtype) | ||
dummy = Series(empty_arr, index=self._get_axis(axis), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just pass index here
The issue was with dtype=int (and maybe others) where NaN can't be casted to the dtype. The original code raised an exception when trying to do I updated the test to check one object DataFrame and one int DataFrame. The dtype issue happens with int, but not object, so I think that covers the relevant cases. Also, I noticed the Travis CI build failed. The only error is: |
that Travis error is legit run your tests locally and see if u can reproduce as its related to the change |
I tried "nosetests pandas" and I get "OK (SKIP=436)". The error is in the STATA format I/O tests, which makes me think it could be a random issue. It doesn't look like |
index = self._get_axis(axis) | ||
empty_arr = np.empty(len(index), dtype=values.dtype) | ||
dummy = Series(empty_arr, index=self._get_axis(axis), | ||
dtype=values.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't think of a test that will isolate this change, as the code is wrapped in DF._apply_standard, and the dummy array isn't returned. Part of the previous issue was that all exceptions were caught, so when it failed to create the dummy array, this branch silently failed. I took the dummy generation code outside of the try block, so if it fails, it will raise an exception in the one test I added. Otherwise, I'm not sure what else I can do to test it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well something must have caused you to change it. what was that? The point is we cannot make changes that are not tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is the previous code couldn't create an empty series of ints. I guess I could make this into a Series.empty_like
class method and add tests for that, then replace this block with a single call to that method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or I could fix Series(index=..., dtype=int)
to return a series of 0's or something, but I would have to make sure there's a well-defined empty/zero value for any dtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before you actually change anything. a test would be helpful.
@ringw can you rebase and update for comments above? in particular don't change code which is not getting tested (e.g. the dtype issue you reported), which might be valid but need a case for it. |
@ringw can you update according to comments |
I've rebased. The dtype change is necessary to fix this issue with an int array. Because it's inside of _apply_standard, the only test I can do (with the code as is) is to make sure the output of _apply_standard is correct (which I did). If it fails to create an empty Series of the right shape and dtype, then it will either raise an exception, or catch an exception and do the apply without reduce, in which case the output will not be a Series of dicts as expected. The test_apply_dict test will catch either of these problems. The only other option is to refactor those 4 lines as a utility method, so that I can test their output directly. If you think that's necessary, then where should that method go? There are other cases in the codebase where a Series is initialized with an array created with np.empty, so if that becomes a utility method, I think it would make sense to replace all of those instances with a call to that method. |
What issue with an int array. ok where's the test for that? show a reproducible example that shows this behavior. |
The test I added, test_apply_dict, tests a string DataFrame and an int DataFrame. Using the 2 arrays in the test:
Previous behavior:
New behavior:
The first fix (in pandas/src/reduce.pyx) is necessary to get the correct behavior with object and float dtypes. The dtype fix is necessary to make the behavior consistent with all other dtypes. |
merged via 59da781 thanks! fyi, doing this is quite inefficient, so hope you have a really good reason for this. |
closes #8735
Previously, when the function argument to DataFrame.apply returned a dict, the reduction code would mistake its "values" property for the values of a Pandas Series, and return a Series of "values" instance methods. The new check ensures that the "values" property is an np.ndarray.
Previous behavior:
New behavior:
If reduce=False, the result is a DataFrame (this did not change):