-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG GH23744 ufuncs on DataFrame keeps dtype sparseness #23755
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 13 commits
b85bdb9
ad33f76
c39fe11
4aba3f8
bcdf01b
79be557
99c8796
0868c47
de0ecf3
491b908
f6230f6
42ca43a
ee2c462
bca539f
d8670ef
30d83a6
c15afe3
b4ab44b
d153f74
d6e22a8
8f151dc
be8750f
551ced8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,7 +7,7 @@ | |
from pandas.util._decorators import cache_readonly | ||
|
||
from pandas.core.dtypes.common import ( | ||
is_dict_like, is_extension_type, is_list_like, is_sequence) | ||
is_dict_like, is_extension_type, is_list_like, is_sequence, is_sparse) | ||
from pandas.core.dtypes.generic import ABCSeries | ||
|
||
from pandas.io.formats.printing import pprint_thing | ||
|
@@ -131,6 +131,16 @@ def get_result(self): | |
|
||
# ufunc | ||
elif isinstance(self.f, np.ufunc): | ||
if any(is_sparse(dtype) for dtype in self.obj.dtypes): | ||
# Column-by-column construction is slow, so only use when | ||
# necessary (e.g. to preserve special dtypes) GH 23744 | ||
result = self.obj._constructor(index=self.index, | ||
copy=False) | ||
with np.errstate(all='ignore'): | ||
for col in self.columns: | ||
result[col] = self.f(self.obj[col].values) | ||
return result | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. as i said above, don't construct an empty Dataframe, rather use a dictionary (or just a list comprehension), then construct the dataframe from it right before return. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Edit: just figured out I could pass a dict into the constructor to get the proper behaviour. Please disregard the rest of this post. @jreback I can't use the constructor in that case because it stacks a list of series horizontally and there's no axis option. so I used
|
||
|
||
with np.errstate(all='ignore'): | ||
results = self.f(self.values) | ||
return self.obj._constructor(data=results, index=self.index, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -91,3 +91,14 @@ def test_applymap(frame): | |
# just test that it works | ||
result = frame.applymap(lambda x: x * 2) | ||
assert isinstance(result, SparseDataFrame) | ||
|
||
|
||
def test_apply_keep_sparse_dtype(): | ||
# GH 23744 | ||
expected = SparseDataFrame(np.array([[0, 1, 0], [0, 0, 0], [0, 0, 1]]), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you call this sdf. I find the repeated assignments slightly confusing here. |
||
columns=['a', 'b', 'c'], default_fill_value=1) | ||
result = DataFrame(expected) | ||
|
||
expected = expected.apply(np.exp) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. expected = sdft.apply(...) |
||
result = result.apply(np.exp) | ||
tm.assert_frame_equal(expected, result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is going to be very inefficient. use a list comprehension to iterate over the columns, then collect and contruct the result. something like
iterate thru the series and construct the result. construct a dict instead with the results, then do the construction at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When given a list of series, the
DataFrame
constructor converts them all to arrays, so the columns can't all be passed into the constructor at once. To prevent inefficient constructing in the common non-sparse case, how about checking whether there are sparse columns at all, and if there are then the construction happens column-by-column but if there aren't then it does what it did before?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to be careful here. Previously, for a homogenous DataFrame of non-extension array values,
df.apply(ufunc)
would result in one call to the ufunc.If we go columnwise, we'll have
n
calls to the ufunc.Should this be done block-wise and then the results stitched together?