Skip to content

BUG: Allow 'apply' to be used with non-numpy-dtype DataFrames #12284

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Feb 10, 2016

Addresses issue in #12244 in which a non-numpy-dtype for DataFrame.values causes a TypeError to be thrown in the reduce == True case for DataFrame.apply. Resolved by first passing DataFrame.values through Series initialization and taking its values attribute, which is an ndarray and hence will have a valid dtype. Note that the output of apply will still have the original dtype.

values = self.values
# The 'values' attribute of a pandas Series
# is a numpy ndarray, so the dtype will be
# guaranteed to be valid when passed into
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, instead of the where np.empty is used, use pandas.core._sanitize_array.create_from_value (prob should strip that out into a private helper function). this provides compat for things like this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will not help in this case because this logic here IINM will just return the same data structure with the exact same data type that we are trying to avoid in the context of the bug.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevertheless, as has become painfully clear with Travis just now, an alternative solution will be needed since casting to a Series first does not work well non-1-dimensional arrays.

@gfyoung gfyoung force-pushed the df_apply_tz_aware branch 2 times, most recently from 3af9d30 to d01f9a5 Compare February 11, 2016 01:06
@jreback jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype labels Feb 11, 2016
@@ -400,3 +400,10 @@ def test_applymap(self):
result = df.applymap(str)
for f in ['datetime', 'timedelta']:
self.assertEqual(result.loc[0, f], str(df.loc[0, f]))

# See gh-12244
def test_apply_non_numpy_dtype(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test for categorical as well (another extension type)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@gfyoung
Copy link
Member Author

gfyoung commented Feb 11, 2016

In light of #12291, I think there could be a better way to handle the Series dummy creation that will also address the issue in #12244 that I am testing right now.

@jreback
Copy link
Contributor

jreback commented Feb 11, 2016

see my very first comment
there is a way to do this in _sanitize_array but needs to be exposed as a private function

@gfyoung
Copy link
Member Author

gfyoung commented Feb 11, 2016

See my response to that comment of yours. _sanitize_array, at least when I tried implementing it, did not do what I needed it to do, which is return an object with a valid numpy dtype.

@gfyoung
Copy link
Member Author

gfyoung commented Feb 11, 2016

Also, there is a possibility that the numpy array creation is not even necessary if we can just create an empty Series object right from the get go, rendering any of this discussion about array sanitization (which in some ways implies that the array is 'dirty' when it really isn't in the context of scipy) moot.

@jreback
Copy link
Contributor

jreback commented Feb 11, 2016

you don't need to make this use numpy at sll

@gfyoung
Copy link
Member Author

gfyoung commented Feb 11, 2016

Yes, exactly. That's what I am testing right now. Removing that np.empty dependency would be even better than what I had initially done.

empty_arr = np.empty(len(index), dtype=values.dtype)
dummy = Series(empty_arr, index=self._get_axis(axis),
dtype=values.dtype)
dummy = Series(index=index, dtype=values.dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just use self.dtype

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe DataFrame objects have a dtype attribute IINM.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of course you have to handle

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand your last comment. Could you elaborate?

@gfyoung
Copy link
Member Author

gfyoung commented Feb 11, 2016

Test failures due to the fact that Series initialization without any data for numerical types all default to np.float64. Reverting changes. See also my comment in #12291.

@gfyoung gfyoung force-pushed the df_apply_tz_aware branch 2 times, most recently from 1bec774 to 2190480 Compare February 11, 2016 13:13
@jreback
Copy link
Contributor

jreback commented Feb 11, 2016

you can do something like: getattr(self, 'dtype', self.values.dtype)

@gfyoung
Copy link
Member Author

gfyoung commented Feb 11, 2016

I'm confused as to why I would need to do this. Isn't that just more complicated? Also, using the code provided in the issue, it just defaults to self.values.dtype, which is what I don't want.

@jreback
Copy link
Contributor

jreback commented Feb 11, 2016

I think the issue that the code in reduce.pyx cannot deal with the non-numpy dtypes. In theory they should but they are heavily numpy based currently. I think #11970 will eventually have to deal with this. Not sure exactly how we want to track a test/fix like this. IOW where it is makes sense, but want to remember to deal with this issue.

@wesm ?

@gfyoung
Copy link
Member Author

gfyoung commented Feb 11, 2016

lib.reduce can handle non-numpy dtypes AFAICT because I attempted to simplify the code by removing my original check for an ExtensionDType and changing the dummy Series initialization by writing dummy = Series(index=index, dtype=values.dtype), and the problematic code brought up in the issue behaved just fine. It's that casting to np.float64 for integer dtypes that I mentioned earlier that prevented me from keeping it in this PR.

@gfyoung
Copy link
Member Author

gfyoung commented Feb 11, 2016

Given that tests are passing and the fact that I've already dealt with several other alternative designs that couldn't quite pass through the testing phase, I think this PR should be good to merge unless there are other suggestions or input?

@jreback
Copy link
Contributor

jreback commented Feb 11, 2016

pls add a whatsnew note, ping when green.

@jreback jreback added this to the 0.18.0 milestone Feb 11, 2016
@jreback
Copy link
Contributor

jreback commented Feb 11, 2016

also I think add a test where we actually do something in the apply. (this only makes sense with a single tz-aware dtype, but that's ok), e.g.
df.apply(lambda x: x + pd.Timedelta('1day'))

just as a confirming test

@gfyoung
Copy link
Member Author

gfyoung commented Feb 11, 2016

Changes have been made, though it looks like it will be some time before my build gets under way. On a separate note, how do I subscribe to the pandas developer list (is it the Google group or the Python email list)? Also, is there a way for me to get updates when PR's and commits have been made to the master branch?

@gfyoung
Copy link
Member Author

gfyoung commented Feb 12, 2016

@jreback : Finally! Travis is happy, and changes have all been made. Ready to merge.

# as demonstrated in gh-12244
if (reduce and ExtensionDtype.is_dtype(
self.values.dtype)):
reduce = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this into the block below (which only activates if reduce).

the reason is that a mixed dtype frame could trigger .values twice

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jreback
Copy link
Contributor

jreback commented Feb 12, 2016

lgtm. I'll rebase when merge. For future reference, I put random blank lines in Bug Fixes. If you put the whatsnew note there it won't have conflicts with other people.

@gfyoung
Copy link
Member Author

gfyoung commented Feb 12, 2016

@jreback : Okay, good to know! Will rebase. Btw, is there anyway for me to get notifications when commits (or merges) are made to master?

Fixes bug in DataFrame.apply by avoiding reducing DataFrames
whose values dtype is not a numpy dtype.

Closes pandas-devgh-12244.
@jreback
Copy link
Contributor

jreback commented Feb 12, 2016

yes issue and PR will be closed and GitHub notifies

@gfyoung
Copy link
Member Author

gfyoung commented Feb 12, 2016

Sorry clarification: I am talking about other people's merged PR's and commits into master. I didn't realize I had a merge conflict until I saw the GitHub notification with your comment.

@jreback
Copy link
Contributor

jreback commented Feb 12, 2016

if u look at the prs it will you if it's clean to merge

u can also watch the repo to get notifications
I don't thinks very granular just on or off or mentions only

@gfyoung
Copy link
Member Author

gfyoung commented Feb 12, 2016

@jreback : Rebased and Travis gives the green light.

@jorisvandenbossche
Copy link
Member

@gfyoung to respond to your other question about the mailing list: there are indeed two, the pydata google groups (https://groups.google.com/forum/?fromgroups#!forum/pydata) is a more general mailing list also with user questions/for the wider ecosystem, and the pandas-dev mailing list is more focused for pandas development (https://mail.python.org/mailman/listinfo/pandas-dev). You can for both subscribe on the web interface.

@jreback jreback closed this in 370f45f Feb 12, 2016
@jreback
Copy link
Contributor

jreback commented Feb 12, 2016

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants