Skip to content

ENH: Fixed DF.apply for functions returning a dict, #8735 #10740

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

ringw
Copy link
Contributor

@ringw ringw commented Aug 3, 2015

closes #8735

Previously, when the function argument to DataFrame.apply returned a dict, the reduction code would mistake its "values" property for the values of a Pandas Series, and return a Series of "values" instance methods. The new check ensures that the "values" property is an np.ndarray.

Previous behavior:

 In [1]: A = DataFrame([['foo', 'bar'], ['spam', 'eggs']])

 In [2]: A.apply(lambda c: c.to_dict(), reduce=True)
 Out[2]: 
 0    <built-in method values of dict object at 0x7f...
 1    <built-in method values of dict object at 0x7f...
 dtype: object

New behavior:

 In [1]: A = DataFrame([['foo', 'bar'], ['spam', 'eggs']])

 In [2]: A.apply(lambda c: c.to_dict(), reduce=True)
 Out[2]:
 0    {0: u'foo', 1: u'spam'}
 1    {0: u'bar', 1: u'eggs'}
 dtype: object

If reduce=False, the result is a DataFrame (this did not change):

 In [3]: A.apply(lambda c: c.to_dict(), reduce=False)
 Out[3]:
       0     1
 0   foo   bar
 1  spam  eggs

@@ -142,6 +142,46 @@ Other enhancements

- ``pd.pivot`` will now allow passing index as ``None`` (:issue:`3962`).

- ``DataFrame.apply`` will return a Series of dicts if the passed function returns a dict and ``reduce=True`` (:issue:`8735`).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just the issue line is fine. This is not a very common case.

@ringw
Copy link
Contributor Author

ringw commented Aug 3, 2015

Thanks, fixed.

@jreback jreback changed the title ENH: Fixed DF.apply for functions returning a dict (closes #8735) ENH: Fixed DF.apply for functions returning a dict, #8735 Aug 3, 2015
@jreback jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Aug 3, 2015
@jreback jreback added this to the 0.17.0 milestone Aug 3, 2015
@jreback
Copy link
Contributor

jreback commented Aug 3, 2015

pls add a test cases, including reduce=True|False|None

@ringw
Copy link
Contributor Author

ringw commented Aug 4, 2015

Done! I found another issue when the original DataFrame is of dtype int, and I added tests for int and str dtypes.

# Unlike filling with NA, this works for any dtype
index = self._get_axis(axis)
empty_arr = np.empty(len(index), dtype=values.dtype)
dummy = Series(empty_arr, index=self._get_axis(axis),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just pass index here

@ringw
Copy link
Contributor Author

ringw commented Aug 5, 2015

The issue was with dtype=int (and maybe others) where NaN can't be casted to the dtype. The original code raised an exception when trying to do Series(NA, dtype=int). If I do Series(index=..., dtype=int), it actually returns a Series of NaNs with dtype=float. I'm assuming I can't just pass lib.reduce a dummy series of the wrong dtype, so I think I need to get dummy values of the right dtype using np.empty or something similar.

I updated the test to check one object DataFrame and one int DataFrame. The dtype issue happens with int, but not object, so I think that covers the relevant cases.

Also, I noticed the Travis CI build failed. The only error is: AssertionError: Caused unexpected warning(s): ['ResourceWarning']., so I think maybe their system was just running out of resources.

@jreback
Copy link
Contributor

jreback commented Aug 5, 2015

that Travis error is legit
it means that something is asserting inside s deprecation somewhere

run your tests locally and see if u can reproduce as its related to the change

@ringw
Copy link
Contributor Author

ringw commented Aug 5, 2015

I tried "nosetests pandas" and I get "OK (SKIP=436)". The error is in the STATA format I/O tests, which makes me think it could be a random issue. It doesn't look like pandas/io/stata.py uses apply on a DataFrame at all. Are you able to rerun the Travis build?

index = self._get_axis(axis)
empty_arr = np.empty(len(index), dtype=values.dtype)
dummy = Series(empty_arr, index=self._get_axis(axis),
dtype=values.dtype)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of a test that will isolate this change, as the code is wrapped in DF._apply_standard, and the dummy array isn't returned. Part of the previous issue was that all exceptions were caught, so when it failed to create the dummy array, this branch silently failed. I took the dummy generation code outside of the try block, so if it fails, it will raise an exception in the one test I added. Otherwise, I'm not sure what else I can do to test it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well something must have caused you to change it. what was that? The point is we cannot make changes that are not tested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is the previous code couldn't create an empty series of ints. I guess I could make this into a Series.empty_like class method and add tests for that, then replace this block with a single call to that method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or I could fix Series(index=..., dtype=int) to return a series of 0's or something, but I would have to make sure there's a well-defined empty/zero value for any dtype.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before you actually change anything. a test would be helpful.

@jreback
Copy link
Contributor

jreback commented Aug 20, 2015

@ringw can you rebase and update for comments above? in particular don't change code which is not getting tested (e.g. the dtype issue you reported), which might be valid but need a case for it.

@jreback
Copy link
Contributor

jreback commented Aug 26, 2015

@ringw can you update according to comments

@ringw
Copy link
Contributor Author

ringw commented Aug 27, 2015

I've rebased. The dtype change is necessary to fix this issue with an int array. Because it's inside of _apply_standard, the only test I can do (with the code as is) is to make sure the output of _apply_standard is correct (which I did). If it fails to create an empty Series of the right shape and dtype, then it will either raise an exception, or catch an exception and do the apply without reduce, in which case the output will not be a Series of dicts as expected. The test_apply_dict test will catch either of these problems.

The only other option is to refactor those 4 lines as a utility method, so that I can test their output directly. If you think that's necessary, then where should that method go? There are other cases in the codebase where a Series is initialized with an array created with np.empty, so if that becomes a utility method, I think it would make sense to replace all of those instances with a call to that method.

@jreback
Copy link
Contributor

jreback commented Aug 28, 2015

I've rebased. The dtype change is necessary to fix this issue with an int array.

What issue with an int array. ok where's the test for that? show a reproducible example that shows this behavior.

@ringw
Copy link
Contributor Author

ringw commented Aug 28, 2015

The test I added, test_apply_dict, tests a string DataFrame and an int DataFrame. Using the 2 arrays in the test:

>>> A = DataFrame([['foo', 'bar'], ['spam', 'eggs']])
>>> B = DataFrame([[0, 1], [2, 3]])

Previous behavior:

>>> A.apply(lambda x: x.to_dict())
0    <built-in method values of dict object at 0x7f...
1    <built-in method values of dict object at 0x7f...
dtype: object
>>> B.apply(lambda x: x.to_dict())
   0  1
0  0  1
1  2  3

New behavior:

>>> A.apply(lambda x: x.to_dict())
0    {0: u'foo', 1: u'spam'}
1    {0: u'bar', 1: u'eggs'}
dtype: object
>>> B.apply(lambda x: x.to_dict())
0    {0: 0, 1: 2}
1    {0: 1, 1: 3}
dtype: object

The first fix (in pandas/src/reduce.pyx) is necessary to get the correct behavior with object and float dtypes. The dtype fix is necessary to make the behavior consistent with all other dtypes.

@jreback
Copy link
Contributor

jreback commented Aug 28, 2015

merged via 59da781

thanks!

fyi, doing this is quite inefficient, so hope you have a really good reason for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Strange output from DataFrame.apply when applied func creates a dict
2 participants