Skip to content

BUG GH23744 ufuncs on DataFrame keeps dtype sparseness #23755

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Nov 27, 2018

Conversation

JustinZhengBC
Copy link
Contributor

@JustinZhengBC JustinZhengBC commented Nov 17, 2018

@pep8speaks
Copy link

Hello @JustinZhengBC! Thanks for submitting the PR.

@@ -133,8 +134,14 @@ def get_result(self):
elif isinstance(self.f, np.ufunc):
with np.errstate(all='ignore'):
results = self.f(self.values)
return self.obj._constructor(data=results, index=self.index,
columns=self.columns, copy=False)
result = self.obj._constructor(data=results, index=self.index,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd like to avoid going to sparse in the first place. This seems to be taking the result and converting it back to sparse if necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked closely at where things are going wrong though, so I'm not sure where you should be looking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug report seems to specifically target the case where the user chooses to store sparse columns in a normal DataFrame instead of a SparseDataFrame, for instance by calling the DataFrame constructor with a SparseDataFrame as the data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. It's a deeper issue though.

A ufunc on a SparseArray will stay sparse. It won't blow up your memory.

In [15]: df = pd.DataFrame({"A": pd.SparseArray([0, 1, 2])})

In [16]: np.exp(df.A.values)
Out[16]:
[1.0, 2.718281828459045, 7.38905609893065]
Fill: 1.0
IntIndex
Indices: array([1, 2], dtype=int32)

But a ufunc (possibly via apply) on a DataFrame or Series with sparse values will materialize the dense values:

In [18]: np.exp(df.A).dtypes
Out[18]: dtype('float64')

Just converting back to sparse so that the output dtype is Sparse[float64] isn't enough, your memory has already blown up :)

I think we need to ensure that ufuncs on Series dispatch to the underlying array.

Copy link
Contributor Author

@JustinZhengBC JustinZhengBC Nov 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this better? self.obj[col].values returns a SparseArray when the column is sparse so the ufunc should stay sparse.

@codecov
Copy link

codecov bot commented Nov 17, 2018

Codecov Report

Merging #23755 into master will increase coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #23755      +/-   ##
==========================================
+ Coverage   92.29%   92.31%   +0.01%     
==========================================
  Files         161      161              
  Lines       51498    51483      -15     
==========================================
- Hits        47530    47525       -5     
+ Misses       3968     3958      -10
Flag Coverage Δ
#multiple 90.7% <100%> (+0.01%) ⬆️
#single 42.43% <0%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/apply.py 98.61% <100%> (ø) ⬆️
pandas/plotting/_misc.py 38.68% <0%> (-0.31%) ⬇️
pandas/core/arrays/datetimes.py 98.37% <0%> (-0.14%) ⬇️
pandas/core/ops.py 94.14% <0%> (-0.14%) ⬇️
pandas/io/formats/html.py 97.63% <0%> (-0.05%) ⬇️
pandas/core/dtypes/dtypes.py 95.59% <0%> (-0.03%) ⬇️
pandas/core/arrays/categorical.py 95.34% <0%> (-0.02%) ⬇️
pandas/core/arrays/sparse.py 91.93% <0%> (-0.02%) ⬇️
pandas/core/reshape/tile.py 94.73% <0%> (ø) ⬆️
pandas/core/frame.py 97.02% <0%> (ø) ⬆️
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dc0674d...551ced8. Read the comment docs.

columns=self.columns, copy=False)
result = self.obj._constructor(index=self.index, copy=False)
for col in self.columns:
if is_sparse(self.obj.dtypes[col]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is closer, but I don't think it should be special casing sparse. The bug applies to any extension array.

#23293 is fixing ufuncs on series. So extracting .values won't be necessary. I'd recommend waiting for #23293 and seeing if you can just do result[col] = self.f(self.obj[col]).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the sparse check in the last commit because it turns out calling values on normal columns is also acceptable (should generalize to other array types as well). I do agree that it would look neater if calling values wasn't required at all though.

@JustinZhengBC JustinZhengBC changed the title BUG GH23744 DataFrame.apply keeps dtype sparseness BUG GH23743 and GH23744 DataFrame.apply keeps dtype sparseness Nov 17, 2018
@JustinZhengBC JustinZhengBC changed the title BUG GH23743 and GH23744 DataFrame.apply keeps dtype sparseness BUG GH23743 and GH23744 ufuncs on DataFrame keeps dtype sparseness Nov 17, 2018
@JustinZhengBC
Copy link
Contributor Author

JustinZhengBC commented Nov 18, 2018

Can't reproduce the one failing test, and I don't think it's related
edit: never mind

@gfyoung gfyoung added Sparse Sparse Data Type ExtensionArray Extending pandas with custom dtypes or arrays. labels Nov 18, 2018
columns=self.columns, copy=False)
result = self.obj._constructor(index=self.index, copy=False)
for col in self.columns:
with np.errstate(all='ignore'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is going to be very inefficient. use a list comprehension to iterate over the columns, then collect and contruct the result. something like

def f(c):
    with np.errstate(all='ignore'):
       return self.f(c.value)

results = [f(c) for col, c for self.obj.iteritems()]
return self._constructor(results, index=self.index, columns=self.columns, copy=False)

iterate thru the series and construct the result. construct a dict instead with the results, then do the construction at the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When given a list of series, the DataFrame constructor converts them all to arrays, so the columns can't all be passed into the constructor at once. To prevent inefficient constructing in the common non-sparse case, how about checking whether there are sparse columns at all, and if there are then the construction happens column-by-column but if there aren't then it does what it did before?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be careful here. Previously, for a homogenous DataFrame of non-extension array values, df.apply(ufunc) would result in one call to the ufunc.

If we go columnwise, we'll have n calls to the ufunc.

Should this be done block-wise and then the results stitched together?

@@ -570,6 +570,16 @@ def test_apply_dup_names_multi_agg(self):

tm.assert_frame_equal(result, expected)

def test_apply_keep_sparse_dtype(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be in the sparse frame tests

@jreback
Copy link
Contributor

jreback commented Nov 18, 2018

does this cover all of the examples in both issues?

@JustinZhengBC JustinZhengBC changed the title BUG GH23743 and GH23744 ufuncs on DataFrame keeps dtype sparseness BUG GH23744 ufuncs on DataFrame keeps dtype sparseness Nov 19, 2018
@JustinZhengBC
Copy link
Contributor Author

@jreback sorry, does not fix the second issue, edited

@jreback
Copy link
Contributor

jreback commented Nov 23, 2018

can you merge master

@@ -131,6 +131,17 @@ def get_result(self):

# ufunc
elif isinstance(self.f, np.ufunc):
for dtype in self.obj.dtypes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so just do

if any(is_sparse(dtype) for dtype in self.obj.dtypes)):
   ....

# GH 23744
df = SparseDataFrame(np.array([[0, 1, 0], [0, 0, 0], [0, 0, 1]]),
columns=['a', 'b', 'c'], default_fill_value=1)
df2 = DataFrame(df)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this expected

columns=['a', 'b', 'c'], default_fill_value=1)
df2 = DataFrame(df)

df = df.apply(np.exp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this result

@@ -1275,6 +1275,7 @@ Numeric
- Bug in :meth:`Series.rpow` with object dtype ``NaN`` for ``1 ** NA`` instead of ``1`` (:issue:`22922`).
- :meth:`Series.agg` can now handle numpy NaN-aware methods like :func:`numpy.nansum` (:issue:`19629`)
- Bug in :meth:`Series.rank` and :meth:`DataFrame.rank` when ``pct=True`` and more than 2:sup:`24` rows are present resulted in percentages greater than 1.0 (:issue:`18271`)
- Bug in :meth:`DataFrame.apply` where dtypes would lose sparseness (:issue:`23744`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this to the Sparse section

with np.errstate(all='ignore'):
for col in self.columns:
result[col] = self.f(self.obj[col].values)
return result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as i said above, don't construct an empty Dataframe, rather use a dictionary (or just a list comprehension), then construct the dataframe from it right before return.

Copy link
Contributor Author

@JustinZhengBC JustinZhengBC Nov 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: just figured out I could pass a dict into the constructor to get the proper behaviour. Please disregard the rest of this post.

@jreback I can't use the constructor in that case because it stacks a list of series horizontally and there's no axis option. so I used pd.concat. Is that okay?

>>> import pandas as pd
>>> s1 = pd.Series([1, 2])
>>> s2 = pd.Series([3, 4])
>>> pd.DataFrame([s1, s2])
   0  1
0  1  2
1  3  4

@JustinZhengBC
Copy link
Contributor Author

Considering #23875, I think we will need a for loop in get_result to check for all the different dtypes. This PR only covers the sparse case but the categorical case could be covered as well by adding:

elif is_categorical(self.obj.dtypes[col]):
    before = self.obj.dtypes[col].categories
    after = self.f(before)
    map = dict(zip(before, after))
    result = self.obj[col].replace(map).astype('category')

return self.obj._constructor(data=results, index=self.index,
columns=self.columns, copy=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably need to pass columns here to ensure that the columns are in the correct order. Add a test where the columns aren't sorted (like ['b', 'a', 'c']) and verify that it fails on py35 and older.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 26, 2018

@jreback, general question: how firm should we be on preserving blockwise application of things? I don't think we should give that up yet (as the current PR does).

@jreback
Copy link
Contributor

jreback commented Nov 26, 2018

@TomAugspurger I agree generally with your point about block-wise application. Sparse however I think is already a single block, so there should be no difference in this approach.

@TomAugspurger
Copy link
Contributor

My concern is about dataframes with non-sparse columns. Previously, for an all-float DataFrame, DataFrame.apply(np.abs) would (I think) call np.abs once. Now we'll be calling it once per column, even if there are no sparse columns.

@jreback
Copy link
Contributor

jreback commented Nov 26, 2018

My concern is about dataframes with non-sparse columns. Previously, for an all-float DataFrame, DataFrame.apply(np.abs) would (I think) call np.abs once. Now we'll be calling it once per column, even if there are no sparse columns.

this could actually dispatch to the block impl then (prob better).

@JustinZhengBC basically you can call

self._data.apply('apply', .....) I think

@TomAugspurger
Copy link
Contributor

Ah, I forgot we had Block.apply. Yes, this should hopefully work

In [20]: df = pd.DataFrame({"A": [1, 2, 3], "B": pd.SparseArray([0, 1, 2])})

In [21]: df._data.apply('apply', func=np.sin)
Out[21]:
BlockManager
Items: Index(['A', 'B'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
FloatBlock: slice(0, 1, 1), 1 x 3, dtype: float64
ExtensionBlock: slice(1, 2, 1), 1 x 3, dtype: Sparse[float64, 0.0]

@JustinZhengBC
Copy link
Contributor Author

@jreback you're right, it works. I also changed switched up the columns in the test so order preservation is tested

@jreback jreback added this to the 0.24.0 milestone Nov 27, 2018
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny comment. ping when pushed / green.


def test_apply_keep_sparse_dtype():
# GH 23744
expected = SparseDataFrame(np.array([[0, 1, 0], [0, 0, 0], [0, 0, 1]]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you call this sdf. I find the repeated assignments slightly confusing here.

columns=['b', 'a', 'c'], default_fill_value=1)
result = DataFrame(expected)

expected = expected.apply(np.exp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expected = sdft.apply(...)

@JustinZhengBC
Copy link
Contributor Author

@jreback green

@jreback jreback merged commit 3922947 into pandas-dev:master Nov 27, 2018
@jreback
Copy link
Contributor

jreback commented Nov 27, 2018

thanks @JustinZhengBC

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: DataFrame.apply looses sparse dtype
5 participants