ENH: Fixed DF.apply for functions returning a dict, #8735 #10740

ringw · 2015-08-03T20:05:56Z

Previously, when the function argument to DataFrame.apply returned a dict, the reduction code would mistake its "values" property for the values of a Pandas Series, and return a Series of "values" instance methods. The new check ensures that the "values" property is an np.ndarray.

Previous behavior:

 In [1]: A = DataFrame([['foo', 'bar'], ['spam', 'eggs']])

 In [2]: A.apply(lambda c: c.to_dict(), reduce=True)
 Out[2]: 
 0    <built-in method values of dict object at 0x7f...
 1    <built-in method values of dict object at 0x7f...
 dtype: object

New behavior:

 In [1]: A = DataFrame([['foo', 'bar'], ['spam', 'eggs']])

 In [2]: A.apply(lambda c: c.to_dict(), reduce=True)
 Out[2]:
 0    {0: u'foo', 1: u'spam'}
 1    {0: u'bar', 1: u'eggs'}
 dtype: object

If reduce=False, the result is a DataFrame (this did not change):

 In [3]: A.apply(lambda c: c.to_dict(), reduce=False)
 Out[3]:
       0     1
 0   foo   bar
 1  spam  eggs

jreback · 2015-08-03T20:13:35Z

doc/source/whatsnew/v0.17.0.txt

@@ -142,6 +142,46 @@ Other enhancements

 - ``pd.pivot`` will now allow passing index as ``None`` (:issue:`3962`).

+- ``DataFrame.apply`` will return a Series of dicts if the passed function returns a dict and ``reduce=True`` (:issue:`8735`).
+


just the issue line is fine. This is not a very common case.

ringw · 2015-08-03T20:22:21Z

Thanks, fixed.

jreback · 2015-08-03T21:34:03Z

pls add a test cases, including reduce=True|False|None

ringw · 2015-08-04T01:02:18Z

Done! I found another issue when the original DataFrame is of dtype int, and I added tests for int and str dtypes.

jreback · 2015-08-04T21:29:33Z

pandas/core/frame.py

+                # Unlike filling with NA, this works for any dtype
+                index = self._get_axis(axis)
+                empty_arr = np.empty(len(index), dtype=values.dtype)
+                dummy = Series(empty_arr, index=self._get_axis(axis),


you can just pass index here

ringw · 2015-08-05T01:31:20Z

The issue was with dtype=int (and maybe others) where NaN can't be casted to the dtype. The original code raised an exception when trying to do Series(NA, dtype=int). If I do Series(index=..., dtype=int), it actually returns a Series of NaNs with dtype=float. I'm assuming I can't just pass lib.reduce a dummy series of the wrong dtype, so I think I need to get dummy values of the right dtype using np.empty or something similar.

I updated the test to check one object DataFrame and one int DataFrame. The dtype issue happens with int, but not object, so I think that covers the relevant cases.

Also, I noticed the Travis CI build failed. The only error is: AssertionError: Caused unexpected warning(s): ['ResourceWarning']., so I think maybe their system was just running out of resources.

jreback · 2015-08-05T01:39:34Z

that Travis error is legit
it means that something is asserting inside s deprecation somewhere

run your tests locally and see if u can reproduce as its related to the change

ringw · 2015-08-05T03:39:51Z

I tried "nosetests pandas" and I get "OK (SKIP=436)". The error is in the STATA format I/O tests, which makes me think it could be a random issue. It doesn't look like pandas/io/stata.py uses apply on a DataFrame at all. Are you able to rerun the Travis build?

ringw · 2015-08-06T12:12:51Z

pandas/core/frame.py

+            index = self._get_axis(axis)
+            empty_arr = np.empty(len(index), dtype=values.dtype)
+            dummy = Series(empty_arr, index=self._get_axis(axis),
+                           dtype=values.dtype)


I can't think of a test that will isolate this change, as the code is wrapped in DF._apply_standard, and the dummy array isn't returned. Part of the previous issue was that all exceptions were caught, so when it failed to create the dummy array, this branch silently failed. I took the dummy generation code outside of the try block, so if it fails, it will raise an exception in the one test I added. Otherwise, I'm not sure what else I can do to test it.

well something must have caused you to change it. what was that? The point is we cannot make changes that are not tested.

The problem is the previous code couldn't create an empty series of ints. I guess I could make this into a Series.empty_like class method and add tests for that, then replace this block with a single call to that method.

Or I could fix Series(index=..., dtype=int) to return a series of 0's or something, but I would have to make sure there's a well-defined empty/zero value for any dtype.

before you actually change anything. a test would be helpful.

jreback · 2015-08-20T18:32:22Z

@ringw can you rebase and update for comments above? in particular don't change code which is not getting tested (e.g. the dtype issue you reported), which might be valid but need a case for it.

jreback · 2015-08-26T01:25:36Z

@ringw can you update according to comments

…#8735)

ringw · 2015-08-27T14:38:55Z

I've rebased. The dtype change is necessary to fix this issue with an int array. Because it's inside of _apply_standard, the only test I can do (with the code as is) is to make sure the output of _apply_standard is correct (which I did). If it fails to create an empty Series of the right shape and dtype, then it will either raise an exception, or catch an exception and do the apply without reduce, in which case the output will not be a Series of dicts as expected. The test_apply_dict test will catch either of these problems.

The only other option is to refactor those 4 lines as a utility method, so that I can test their output directly. If you think that's necessary, then where should that method go? There are other cases in the codebase where a Series is initialized with an array created with np.empty, so if that becomes a utility method, I think it would make sense to replace all of those instances with a call to that method.

jreback · 2015-08-28T12:53:05Z

I've rebased. The dtype change is necessary to fix this issue with an int array.

What issue with an int array. ok where's the test for that? show a reproducible example that shows this behavior.

ringw · 2015-08-28T13:47:25Z

The test I added, test_apply_dict, tests a string DataFrame and an int DataFrame. Using the 2 arrays in the test:

>>> A = DataFrame([['foo', 'bar'], ['spam', 'eggs']])
>>> B = DataFrame([[0, 1], [2, 3]])

Previous behavior:

>>> A.apply(lambda x: x.to_dict())
0    <built-in method values of dict object at 0x7f...
1    <built-in method values of dict object at 0x7f...
dtype: object
>>> B.apply(lambda x: x.to_dict())
   0  1
0  0  1
1  2  3

New behavior:

>>> A.apply(lambda x: x.to_dict())
0    {0: u'foo', 1: u'spam'}
1    {0: u'bar', 1: u'eggs'}
dtype: object
>>> B.apply(lambda x: x.to_dict())
0    {0: 0, 1: 2}
1    {0: 1, 1: 3}
dtype: object

The first fix (in pandas/src/reduce.pyx) is necessary to get the correct behavior with object and float dtypes. The dtype fix is necessary to make the behavior consistent with all other dtypes.

jreback · 2015-08-28T18:17:41Z

merged via 59da781

thanks!

fyi, doing this is quite inefficient, so hope you have a really good reason for this.

jreback reviewed Aug 3, 2015
View reviewed changes

ringw force-pushed the apply-dict-fix branch from 4eb61af to 44c7e2e Compare August 3, 2015 20:22

jreback changed the title ~~ENH: Fixed DF.apply for functions returning a dict (closes #8735)~~ ENH: Fixed DF.apply for functions returning a dict, #8735 Aug 3, 2015

jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Aug 3, 2015

jreback added this to the 0.17.0 milestone Aug 3, 2015

ringw force-pushed the apply-dict-fix branch from 44c7e2e to bf2a120 Compare August 4, 2015 01:00

jreback reviewed Aug 4, 2015
View reviewed changes

ringw reviewed Aug 6, 2015
View reviewed changes

ringw added 2 commits August 27, 2015 10:26

ENH: Fixed DF.apply for functions returning a dict (closes pandas-dev…

f235902

…#8735)

Avoid catching exceptions unnecessarily in DF.apply

3871a6b

ringw force-pushed the apply-dict-fix branch from 32b895b to 3871a6b Compare August 27, 2015 14:27

jreback closed this Aug 28, 2015

AlexHentschel mentioned this pull request Dec 14, 2017

DataFrame.apply returns NaN if DataFrame contains datetime column #18775

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Fixed DF.apply for functions returning a dict, #8735 #10740

ENH: Fixed DF.apply for functions returning a dict, #8735 #10740

ringw commented Aug 3, 2015

jreback Aug 3, 2015

ringw commented Aug 3, 2015

jreback commented Aug 3, 2015

ringw commented Aug 4, 2015

jreback Aug 4, 2015

ringw commented Aug 5, 2015

jreback commented Aug 5, 2015

ringw commented Aug 5, 2015

ringw Aug 6, 2015

jreback Aug 6, 2015

ringw Aug 6, 2015

ringw Aug 6, 2015

jreback Aug 6, 2015

jreback commented Aug 20, 2015

jreback commented Aug 26, 2015

ringw commented Aug 27, 2015

jreback commented Aug 28, 2015

ringw commented Aug 28, 2015

jreback commented Aug 28, 2015

		@@ -142,6 +142,46 @@ Other enhancements

		- ``pd.pivot`` will now allow passing index as ``None`` (:issue:`3962`).

		- ``DataFrame.apply`` will return a Series of dicts if the passed function returns a dict and ``reduce=True`` (:issue:`8735`).

ENH: Fixed DF.apply for functions returning a dict, #8735 #10740

ENH: Fixed DF.apply for functions returning a dict, #8735 #10740

Conversation

ringw commented Aug 3, 2015

jreback Aug 3, 2015

Choose a reason for hiding this comment

ringw commented Aug 3, 2015

jreback commented Aug 3, 2015

ringw commented Aug 4, 2015

jreback Aug 4, 2015

Choose a reason for hiding this comment

ringw commented Aug 5, 2015

jreback commented Aug 5, 2015

ringw commented Aug 5, 2015

ringw Aug 6, 2015

Choose a reason for hiding this comment

jreback Aug 6, 2015

Choose a reason for hiding this comment

ringw Aug 6, 2015

Choose a reason for hiding this comment

ringw Aug 6, 2015

Choose a reason for hiding this comment

jreback Aug 6, 2015

Choose a reason for hiding this comment

jreback commented Aug 20, 2015

jreback commented Aug 26, 2015

ringw commented Aug 27, 2015

jreback commented Aug 28, 2015

ringw commented Aug 28, 2015

jreback commented Aug 28, 2015