-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG GH23744 ufuncs on DataFrame keeps dtype sparseness #23755
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @JustinZhengBC! Thanks for submitting the PR.
|
c7ea063
to
b85bdb9
Compare
pandas/core/apply.py
Outdated
@@ -133,8 +134,14 @@ def get_result(self): | |||
elif isinstance(self.f, np.ufunc): | |||
with np.errstate(all='ignore'): | |||
results = self.f(self.values) | |||
return self.obj._constructor(data=results, index=self.index, | |||
columns=self.columns, copy=False) | |||
result = self.obj._constructor(data=results, index=self.index, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'd like to avoid going to sparse in the first place. This seems to be taking the result and converting it back to sparse if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked closely at where things are going wrong though, so I'm not sure where you should be looking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bug report seems to specifically target the case where the user chooses to store sparse columns in a normal DataFrame
instead of a SparseDataFrame
, for instance by calling the DataFrame
constructor with a SparseDataFrame
as the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. It's a deeper issue though.
A ufunc on a SparseArray will stay sparse. It won't blow up your memory.
In [15]: df = pd.DataFrame({"A": pd.SparseArray([0, 1, 2])})
In [16]: np.exp(df.A.values)
Out[16]:
[1.0, 2.718281828459045, 7.38905609893065]
Fill: 1.0
IntIndex
Indices: array([1, 2], dtype=int32)
But a ufunc (possibly via apply) on a DataFrame or Series with sparse values will materialize the dense values:
In [18]: np.exp(df.A).dtypes
Out[18]: dtype('float64')
Just converting back to sparse so that the output dtype is Sparse[float64]
isn't enough, your memory has already blown up :)
I think we need to ensure that ufuncs on Series dispatch to the underlying array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this better? self.obj[col].values
returns a SparseArray
when the column is sparse so the ufunc should stay sparse.
Codecov Report
@@ Coverage Diff @@
## master #23755 +/- ##
==========================================
+ Coverage 92.29% 92.31% +0.01%
==========================================
Files 161 161
Lines 51498 51483 -15
==========================================
- Hits 47530 47525 -5
+ Misses 3968 3958 -10
Continue to review full report at Codecov.
|
pandas/core/apply.py
Outdated
columns=self.columns, copy=False) | ||
result = self.obj._constructor(index=self.index, copy=False) | ||
for col in self.columns: | ||
if is_sparse(self.obj.dtypes[col]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the sparse check in the last commit because it turns out calling values
on normal columns is also acceptable (should generalize to other array types as well). I do agree that it would look neater if calling values
wasn't required at all though.
Can't reproduce the one failing test, and I don't think it's related |
pandas/core/apply.py
Outdated
columns=self.columns, copy=False) | ||
result = self.obj._constructor(index=self.index, copy=False) | ||
for col in self.columns: | ||
with np.errstate(all='ignore'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is going to be very inefficient. use a list comprehension to iterate over the columns, then collect and contruct the result. something like
def f(c):
with np.errstate(all='ignore'):
return self.f(c.value)
results = [f(c) for col, c for self.obj.iteritems()]
return self._constructor(results, index=self.index, columns=self.columns, copy=False)
iterate thru the series and construct the result. construct a dict instead with the results, then do the construction at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When given a list of series, the DataFrame
constructor converts them all to arrays, so the columns can't all be passed into the constructor at once. To prevent inefficient constructing in the common non-sparse case, how about checking whether there are sparse columns at all, and if there are then the construction happens column-by-column but if there aren't then it does what it did before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to be careful here. Previously, for a homogenous DataFrame of non-extension array values, df.apply(ufunc)
would result in one call to the ufunc.
If we go columnwise, we'll have n
calls to the ufunc.
Should this be done block-wise and then the results stitched together?
pandas/tests/frame/test_apply.py
Outdated
@@ -570,6 +570,16 @@ def test_apply_dup_names_multi_agg(self): | |||
|
|||
tm.assert_frame_equal(result, expected) | |||
|
|||
def test_apply_keep_sparse_dtype(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be in the sparse frame tests
does this cover all of the examples in both issues? |
@jreback sorry, does not fix the second issue, edited |
can you merge master |
pandas/core/apply.py
Outdated
@@ -131,6 +131,17 @@ def get_result(self): | |||
|
|||
# ufunc | |||
elif isinstance(self.f, np.ufunc): | |||
for dtype in self.obj.dtypes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so just do
if any(is_sparse(dtype) for dtype in self.obj.dtypes)):
....
# GH 23744 | ||
df = SparseDataFrame(np.array([[0, 1, 0], [0, 0, 0], [0, 0, 1]]), | ||
columns=['a', 'b', 'c'], default_fill_value=1) | ||
df2 = DataFrame(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call this expected
columns=['a', 'b', 'c'], default_fill_value=1) | ||
df2 = DataFrame(df) | ||
|
||
df = df.apply(np.exp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call this result
doc/source/whatsnew/v0.24.0.rst
Outdated
@@ -1275,6 +1275,7 @@ Numeric | |||
- Bug in :meth:`Series.rpow` with object dtype ``NaN`` for ``1 ** NA`` instead of ``1`` (:issue:`22922`). | |||
- :meth:`Series.agg` can now handle numpy NaN-aware methods like :func:`numpy.nansum` (:issue:`19629`) | |||
- Bug in :meth:`Series.rank` and :meth:`DataFrame.rank` when ``pct=True`` and more than 2:sup:`24` rows are present resulted in percentages greater than 1.0 (:issue:`18271`) | |||
- Bug in :meth:`DataFrame.apply` where dtypes would lose sparseness (:issue:`23744`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this to the Sparse section
pandas/core/apply.py
Outdated
with np.errstate(all='ignore'): | ||
for col in self.columns: | ||
result[col] = self.f(self.obj[col].values) | ||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as i said above, don't construct an empty Dataframe, rather use a dictionary (or just a list comprehension), then construct the dataframe from it right before return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: just figured out I could pass a dict into the constructor to get the proper behaviour. Please disregard the rest of this post.
@jreback I can't use the constructor in that case because it stacks a list of series horizontally and there's no axis option. so I used pd.concat
. Is that okay?
>>> import pandas as pd
>>> s1 = pd.Series([1, 2])
>>> s2 = pd.Series([3, 4])
>>> pd.DataFrame([s1, s2])
0 1
0 1 2
1 3 4
Considering #23875, I think we will need a for loop in
|
pandas/core/apply.py
Outdated
return self.obj._constructor(data=results, index=self.index, | ||
columns=self.columns, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably need to pass columns
here to ensure that the columns are in the correct order. Add a test where the columns aren't sorted (like ['b', 'a', 'c']
) and verify that it fails on py35 and older.
@jreback, general question: how firm should we be on preserving blockwise application of things? I don't think we should give that up yet (as the current PR does). |
@TomAugspurger I agree generally with your point about block-wise application. Sparse however I think is already a single block, so there should be no difference in this approach. |
My concern is about dataframes with non-sparse columns. Previously, for an all-float DataFrame, |
this could actually dispatch to the block impl then (prob better). @JustinZhengBC basically you can call
|
Ah, I forgot we had In [20]: df = pd.DataFrame({"A": [1, 2, 3], "B": pd.SparseArray([0, 1, 2])})
In [21]: df._data.apply('apply', func=np.sin)
Out[21]:
BlockManager
Items: Index(['A', 'B'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
FloatBlock: slice(0, 1, 1), 1 x 3, dtype: float64
ExtensionBlock: slice(1, 2, 1), 1 x 3, dtype: Sparse[float64, 0.0] |
@jreback you're right, it works. I also changed switched up the columns in the test so order preservation is tested |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tiny comment. ping when pushed / green.
|
||
def test_apply_keep_sparse_dtype(): | ||
# GH 23744 | ||
expected = SparseDataFrame(np.array([[0, 1, 0], [0, 0, 0], [0, 0, 1]]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you call this sdf. I find the repeated assignments slightly confusing here.
columns=['b', 'a', 'c'], default_fill_value=1) | ||
result = DataFrame(expected) | ||
|
||
expected = expected.apply(np.exp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expected = sdft.apply(...)
@jreback green |
thanks @JustinZhengBC |
git diff upstream/master -u -- "*.py" | flake8 --diff