BUG: Allow 'apply' to be used with non-numpy-dtype DataFrames #12284

gfyoung · 2016-02-10T17:47:21Z

Addresses issue in #12244 in which a non-numpy-dtype for DataFrame.values causes a TypeError to be thrown in the reduce == True case for DataFrame.apply. Resolved by first passing DataFrame.values through Series initialization and taking its values attribute, which is an ndarray and hence will have a valid dtype. Note that the output of apply will still have the original dtype.

jreback · 2016-02-10T17:53:58Z

pandas/core/frame.py

-            values = self.values
+            # The 'values' attribute of a pandas Series
+            # is a numpy ndarray, so the dtype will be
+            # guaranteed to be valid when passed into


no, instead of the where np.empty is used, use pandas.core._sanitize_array.create_from_value (prob should strip that out into a private helper function). this provides compat for things like this.

It will not help in this case because this logic here IINM will just return the same data structure with the exact same data type that we are trying to avoid in the context of the bug.

Nevertheless, as has become painfully clear with Travis just now, an alternative solution will be needed since casting to a Series first does not work well non-1-dimensional arrays.

jreback · 2016-02-11T02:45:59Z

pandas/tests/frame/test_apply.py

@@ -400,3 +400,10 @@ def test_applymap(self):
        result = df.applymap(str)
        for f in ['datetime', 'timedelta']:
            self.assertEqual(result.loc[0, f], str(df.loc[0, f]))
+
+    # See gh-12244
+    def test_apply_non_numpy_dtype(self):


can you add a test for categorical as well (another extension type)

gfyoung · 2016-02-11T12:25:48Z

In light of #12291, I think there could be a better way to handle the Series dummy creation that will also address the issue in #12244 that I am testing right now.

jreback · 2016-02-11T12:33:02Z

see my very first comment
there is a way to do this in _sanitize_array but needs to be exposed as a private function

gfyoung · 2016-02-11T12:34:40Z

See my response to that comment of yours. _sanitize_array, at least when I tried implementing it, did not do what I needed it to do, which is return an object with a valid numpy dtype.

gfyoung · 2016-02-11T12:35:38Z

Also, there is a possibility that the numpy array creation is not even necessary if we can just create an empty Series object right from the get go, rendering any of this discussion about array sanitization (which in some ways implies that the array is 'dirty' when it really isn't in the context of scipy) moot.

jreback · 2016-02-11T12:36:58Z

you don't need to make this use numpy at sll

gfyoung · 2016-02-11T12:38:24Z

Yes, exactly. That's what I am testing right now. Removing that np.empty dependency would be even better than what I had initially done.

jreback · 2016-02-11T12:38:40Z

pandas/core/frame.py

-            empty_arr = np.empty(len(index), dtype=values.dtype)
-            dummy = Series(empty_arr, index=self._get_axis(axis),
-                           dtype=values.dtype)
+            dummy = Series(index=index, dtype=values.dtype)


just use self.dtype

I don't believe DataFrame objects have a dtype attribute IINM.

of course you have to handle

I don't quite understand your last comment. Could you elaborate?

gfyoung · 2016-02-11T13:12:51Z

Test failures due to the fact that Series initialization without any data for numerical types all default to np.float64. Reverting changes. See also my comment in #12291.

jreback · 2016-02-11T13:14:21Z

you can do something like: getattr(self, 'dtype', self.values.dtype)

gfyoung · 2016-02-11T13:16:38Z

I'm confused as to why I would need to do this. Isn't that just more complicated? Also, using the code provided in the issue, it just defaults to self.values.dtype, which is what I don't want.

jreback · 2016-02-11T13:37:01Z

I think the issue that the code in reduce.pyx cannot deal with the non-numpy dtypes. In theory they should but they are heavily numpy based currently. I think #11970 will eventually have to deal with this. Not sure exactly how we want to track a test/fix like this. IOW where it is makes sense, but want to remember to deal with this issue.

@wesm ?

gfyoung · 2016-02-11T18:12:53Z

lib.reduce can handle non-numpy dtypes AFAICT because I attempted to simplify the code by removing my original check for an ExtensionDType and changing the dummy Series initialization by writing dummy = Series(index=index, dtype=values.dtype), and the problematic code brought up in the issue behaved just fine. It's that casting to np.float64 for integer dtypes that I mentioned earlier that prevented me from keeping it in this PR.

gfyoung · 2016-02-11T18:14:49Z

Given that tests are passing and the fact that I've already dealt with several other alternative designs that couldn't quite pass through the testing phase, I think this PR should be good to merge unless there are other suggestions or input?

jreback · 2016-02-11T18:42:31Z

pls add a whatsnew note, ping when green.

jreback · 2016-02-11T18:44:26Z

also I think add a test where we actually do something in the apply. (this only makes sense with a single tz-aware dtype, but that's ok), e.g.
df.apply(lambda x: x + pd.Timedelta('1day'))

just as a confirming test

gfyoung · 2016-02-11T19:41:14Z

Changes have been made, though it looks like it will be some time before my build gets under way. On a separate note, how do I subscribe to the pandas developer list (is it the Google group or the Python email list)? Also, is there a way for me to get updates when PR's and commits have been made to the master branch?

gfyoung · 2016-02-12T00:30:36Z

@jreback : Finally! Travis is happy, and changes have all been made. Ready to merge.

jreback · 2016-02-12T01:05:17Z

pandas/core/frame.py

+        # as demonstrated in gh-12244
+        if (reduce and ExtensionDtype.is_dtype(
+                self.values.dtype)):
+            reduce = False


move this into the block below (which only activates if reduce).

the reason is that a mixed dtype frame could trigger .values twice

jreback · 2016-02-12T03:04:16Z

lgtm. I'll rebase when merge. For future reference, I put random blank lines in Bug Fixes. If you put the whatsnew note there it won't have conflicts with other people.

gfyoung · 2016-02-12T04:58:55Z

@jreback : Okay, good to know! Will rebase. Btw, is there anyway for me to get notifications when commits (or merges) are made to master?

Fixes bug in DataFrame.apply by avoiding reducing DataFrames whose values dtype is not a numpy dtype. Closes pandas-devgh-12244.

jreback · 2016-02-12T05:03:31Z

yes issue and PR will be closed and GitHub notifies

gfyoung · 2016-02-12T05:11:33Z

Sorry clarification: I am talking about other people's merged PR's and commits into master. I didn't realize I had a merge conflict until I saw the GitHub notification with your comment.

jreback · 2016-02-12T05:13:31Z

if u look at the prs it will you if it's clean to merge

u can also watch the repo to get notifications
I don't thinks very granular just on or off or mentions only

gfyoung · 2016-02-12T07:40:07Z

@jreback : Rebased and Travis gives the green light.

jorisvandenbossche · 2016-02-12T08:55:43Z

@gfyoung to respond to your other question about the mailing list: there are indeed two, the pydata google groups (https://groups.google.com/forum/?fromgroups#!forum/pydata) is a more general mailing list also with user questions/for the wider ecosystem, and the pandas-dev mailing list is more focused for pandas development (https://mail.python.org/mailman/listinfo/pandas-dev). You can for both subscribe on the web interface.

jreback · 2016-02-12T13:42:15Z

thanks!

jreback reviewed Feb 10, 2016
View reviewed changes

gfyoung force-pushed the df_apply_tz_aware branch 2 times, most recently from 3af9d30 to d01f9a5 Compare February 11, 2016 01:06

jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype labels Feb 11, 2016

jreback reviewed Feb 11, 2016
View reviewed changes

gfyoung force-pushed the df_apply_tz_aware branch from d01f9a5 to 9a7443a Compare February 11, 2016 09:28

jreback reviewed Feb 11, 2016
View reviewed changes

gfyoung force-pushed the df_apply_tz_aware branch 2 times, most recently from 1bec774 to 2190480 Compare February 11, 2016 13:13

gfyoung mentioned this pull request Feb 11, 2016

ENH: add empty() methods for DataFrame and Series #12291

Closed

jreback added this to the 0.18.0 milestone Feb 11, 2016

gfyoung force-pushed the df_apply_tz_aware branch from 2190480 to 29c189d Compare February 11, 2016 19:35

jreback reviewed Feb 12, 2016
View reviewed changes

gfyoung force-pushed the df_apply_tz_aware branch from 29c189d to 51453ed Compare February 12, 2016 01:38

BUG: Allow apply to be used with non-numpy-dtype DataFrames

b18b74f

Fixes bug in DataFrame.apply by avoiding reducing DataFrames whose values dtype is not a numpy dtype. Closes pandas-devgh-12244.

gfyoung force-pushed the df_apply_tz_aware branch from 51453ed to b18b74f Compare February 12, 2016 05:00

jreback closed this in 370f45f Feb 12, 2016

jreback mentioned this pull request Feb 12, 2016

BUG: DataFrame.apply fails on timezone aware datetime data #12244

Closed

gfyoung deleted the df_apply_tz_aware branch February 12, 2016 14:27

BUG: Allow 'apply' to be used with non-numpy-dtype DataFrames #12284

BUG: Allow 'apply' to be used with non-numpy-dtype DataFrames #12284

Conversation

gfyoung commented Feb 10, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung commented Feb 11, 2016

jreback commented Feb 11, 2016

gfyoung commented Feb 11, 2016

gfyoung commented Feb 11, 2016

jreback commented Feb 11, 2016

gfyoung commented Feb 11, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung commented Feb 11, 2016

jreback commented Feb 11, 2016

gfyoung commented Feb 11, 2016

jreback commented Feb 11, 2016

gfyoung commented Feb 11, 2016

gfyoung commented Feb 11, 2016

jreback commented Feb 11, 2016

jreback commented Feb 11, 2016

gfyoung commented Feb 11, 2016

gfyoung commented Feb 12, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Feb 12, 2016

gfyoung commented Feb 12, 2016

jreback commented Feb 12, 2016

gfyoung commented Feb 12, 2016

jreback commented Feb 12, 2016

gfyoung commented Feb 12, 2016

jorisvandenbossche commented Feb 12, 2016

jreback commented Feb 12, 2016