-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Allow 'apply' to be used with non-numpy-dtype DataFrames #12284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
values = self.values | ||
# The 'values' attribute of a pandas Series | ||
# is a numpy ndarray, so the dtype will be | ||
# guaranteed to be valid when passed into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, instead of the where np.empty
is used, use pandas.core._sanitize_array.create_from_value
(prob should strip that out into a private helper function). this provides compat for things like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will not help in this case because this logic here IINM will just return the same data structure with the exact same data type that we are trying to avoid in the context of the bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevertheless, as has become painfully clear with Travis just now, an alternative solution will be needed since casting to a Series
first does not work well non-1-dimensional arrays.
3af9d30
to
d01f9a5
Compare
@@ -400,3 +400,10 @@ def test_applymap(self): | |||
result = df.applymap(str) | |||
for f in ['datetime', 'timedelta']: | |||
self.assertEqual(result.loc[0, f], str(df.loc[0, f])) | |||
|
|||
# See gh-12244 | |||
def test_apply_non_numpy_dtype(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a test for categorical as well (another extension type)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
d01f9a5
to
9a7443a
Compare
see my very first comment |
See my response to that comment of yours. |
Also, there is a possibility that the |
you don't need to make this use numpy at sll |
Yes, exactly. That's what I am testing right now. Removing that |
empty_arr = np.empty(len(index), dtype=values.dtype) | ||
dummy = Series(empty_arr, index=self._get_axis(axis), | ||
dtype=values.dtype) | ||
dummy = Series(index=index, dtype=values.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just use self.dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe DataFrame
objects have a dtype
attribute IINM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of course you have to handle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand your last comment. Could you elaborate?
Test failures due to the fact that |
1bec774
to
2190480
Compare
you can do something like: |
I'm confused as to why I would need to do this. Isn't that just more complicated? Also, using the code provided in the issue, it just defaults to |
I think the issue that the code in @wesm ? |
|
Given that tests are passing and the fact that I've already dealt with several other alternative designs that couldn't quite pass through the testing phase, I think this PR should be good to merge unless there are other suggestions or input? |
pls add a whatsnew note, ping when green. |
also I think add a test where we actually do something in the apply. (this only makes sense with a single tz-aware dtype, but that's ok), e.g. just as a confirming test |
2190480
to
29c189d
Compare
Changes have been made, though it looks like it will be some time before my build gets under way. On a separate note, how do I subscribe to the |
@jreback : Finally! Travis is happy, and changes have all been made. Ready to merge. |
# as demonstrated in gh-12244 | ||
if (reduce and ExtensionDtype.is_dtype( | ||
self.values.dtype)): | ||
reduce = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this into the block below (which only activates if reduce
).
the reason is that a mixed dtype frame could trigger .values
twice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
29c189d
to
51453ed
Compare
lgtm. I'll rebase when merge. For future reference, I put random blank lines in Bug Fixes. If you put the whatsnew note there it won't have conflicts with other people. |
@jreback : Okay, good to know! Will rebase. Btw, is there anyway for me to get notifications when commits (or merges) are made to |
Fixes bug in DataFrame.apply by avoiding reducing DataFrames whose values dtype is not a numpy dtype. Closes pandas-devgh-12244.
51453ed
to
b18b74f
Compare
yes issue and PR will be closed and GitHub notifies |
Sorry clarification: I am talking about other people's merged PR's and commits into |
if u look at the prs it will you if it's clean to merge u can also watch the repo to get notifications |
@jreback : Rebased and Travis gives the green light. |
@gfyoung to respond to your other question about the mailing list: there are indeed two, the pydata google groups (https://groups.google.com/forum/?fromgroups#!forum/pydata) is a more general mailing list also with user questions/for the wider ecosystem, and the pandas-dev mailing list is more focused for pandas development (https://mail.python.org/mailman/listinfo/pandas-dev). You can for both subscribe on the web interface. |
thanks! |
Addresses issue in #12244 in which a non-numpy-dtype for DataFrame.values causes a
TypeError
to be thrown in thereduce == True
case forDataFrame.apply
. Resolved by first passingDataFrame.values
throughSeries
initialization and taking itsvalues
attribute, which is anndarray
and hence will have a validdtype
. Note that the output ofapply
will still have the originaldtype
.