BUG GH23744 ufuncs on DataFrame keeps dtype sparseness #23755

JustinZhengBC · 2018-11-17T18:55:06Z

closes BUG: DataFrame.apply looses sparse dtype #23744
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

pep8speaks · 2018-11-17T18:55:08Z

Hello @JustinZhengBC! Thanks for submitting the PR.

There are no PEP8 issues in the file pandas/core/apply.py !
There are no PEP8 issues in the file pandas/tests/frame/test_apply.py !

TomAugspurger · 2018-11-17T18:57:09Z

pandas/core/apply.py

@@ -133,8 +134,14 @@ def get_result(self):
        elif isinstance(self.f, np.ufunc):
            with np.errstate(all='ignore'):
                results = self.f(self.values)
-            return self.obj._constructor(data=results, index=self.index,
-                                         columns=self.columns, copy=False)
+            result = self.obj._constructor(data=results, index=self.index,


I think we'd like to avoid going to sparse in the first place. This seems to be taking the result and converting it back to sparse if necessary.

I haven't looked closely at where things are going wrong though, so I'm not sure where you should be looking.

The bug report seems to specifically target the case where the user chooses to store sparse columns in a normal DataFrame instead of a SparseDataFrame, for instance by calling the DataFrame constructor with a SparseDataFrame as the data.

Understood. It's a deeper issue though.

A ufunc on a SparseArray will stay sparse. It won't blow up your memory.

In [15]: df = pd.DataFrame({"A": pd.SparseArray([0, 1, 2])}) In [16]: np.exp(df.A.values) Out[16]: [1.0, 2.718281828459045, 7.38905609893065] Fill: 1.0 IntIndex Indices: array([1, 2], dtype=int32)

But a ufunc (possibly via apply) on a DataFrame or Series with sparse values will materialize the dense values:

In [18]: np.exp(df.A).dtypes Out[18]: dtype('float64')

Just converting back to sparse so that the output dtype is Sparse[float64] isn't enough, your memory has already blown up :)

I think we need to ensure that ufuncs on Series dispatch to the underlying array.

Is this better? self.obj[col].values returns a SparseArray when the column is sparse so the ufunc should stay sparse.

codecov · 2018-11-17T20:15:54Z

Codecov Report

Merging #23755 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #23755      +/-   ##
==========================================
+ Coverage   92.29%   92.31%   +0.01%     
==========================================
  Files         161      161              
  Lines       51498    51483      -15     
==========================================
- Hits        47530    47525       -5     
+ Misses       3968     3958      -10

Flag	Coverage Δ
#multiple	`90.7% <100%> (+0.01%)`	⬆️
#single	`42.43% <0%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/apply.py	`98.61% <100%> (ø)`	⬆️
pandas/plotting/_misc.py	`38.68% <0%> (-0.31%)`	⬇️
pandas/core/arrays/datetimes.py	`98.37% <0%> (-0.14%)`	⬇️
pandas/core/ops.py	`94.14% <0%> (-0.14%)`	⬇️
pandas/io/formats/html.py	`97.63% <0%> (-0.05%)`	⬇️
pandas/core/dtypes/dtypes.py	`95.59% <0%> (-0.03%)`	⬇️
pandas/core/arrays/categorical.py	`95.34% <0%> (-0.02%)`	⬇️
pandas/core/arrays/sparse.py	`91.93% <0%> (-0.02%)`	⬇️
pandas/core/reshape/tile.py	`94.73% <0%> (ø)`	⬆️
pandas/core/frame.py	`97.02% <0%> (ø)`	⬆️
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dc0674d...551ced8. Read the comment docs.

TomAugspurger · 2018-11-17T20:54:46Z

pandas/core/apply.py

-                                         columns=self.columns, copy=False)
+            result = self.obj._constructor(index=self.index, copy=False)
+            for col in self.columns:
+                if is_sparse(self.obj.dtypes[col]):


This is closer, but I don't think it should be special casing sparse. The bug applies to any extension array.

#23293 is fixing ufuncs on series. So extracting .values won't be necessary. I'd recommend waiting for #23293 and seeing if you can just do result[col] = self.f(self.obj[col]).

I removed the sparse check in the last commit because it turns out calling values on normal columns is also acceptable (should generalize to other array types as well). I do agree that it would look neater if calling values wasn't required at all though.

JustinZhengBC · 2018-11-18T01:15:11Z

Can't reproduce the one failing test, and I don't think it's related
edit: never mind

jreback · 2018-11-18T21:58:27Z

pandas/core/apply.py

-                                         columns=self.columns, copy=False)
+            result = self.obj._constructor(index=self.index, copy=False)
+            for col in self.columns:
+                with np.errstate(all='ignore'):


this is going to be very inefficient. use a list comprehension to iterate over the columns, then collect and contruct the result. something like

def f(c): with np.errstate(all='ignore'): return self.f(c.value) results = [f(c) for col, c for self.obj.iteritems()] return self._constructor(results, index=self.index, columns=self.columns, copy=False)

iterate thru the series and construct the result. construct a dict instead with the results, then do the construction at the end.

When given a list of series, the DataFrame constructor converts them all to arrays, so the columns can't all be passed into the constructor at once. To prevent inefficient constructing in the common non-sparse case, how about checking whether there are sparse columns at all, and if there are then the construction happens column-by-column but if there aren't then it does what it did before?

We need to be careful here. Previously, for a homogenous DataFrame of non-extension array values, df.apply(ufunc) would result in one call to the ufunc.

If we go columnwise, we'll have n calls to the ufunc.

Should this be done block-wise and then the results stitched together?

jreback · 2018-11-18T21:59:00Z

pandas/tests/frame/test_apply.py

@@ -570,6 +570,16 @@ def test_apply_dup_names_multi_agg(self):

        tm.assert_frame_equal(result, expected)

+    def test_apply_keep_sparse_dtype(self):


this should be in the sparse frame tests

jreback · 2018-11-18T22:03:10Z

does this cover all of the examples in both issues?

JustinZhengBC · 2018-11-19T21:01:42Z

@jreback sorry, does not fix the second issue, edited

jreback · 2018-11-23T03:17:58Z

can you merge master

jreback · 2018-11-23T03:40:32Z

pandas/core/apply.py

@@ -131,6 +131,17 @@ def get_result(self):

        # ufunc
        elif isinstance(self.f, np.ufunc):
+            for dtype in self.obj.dtypes:


so just do

if any(is_sparse(dtype) for dtype in self.obj.dtypes)): ....

jreback · 2018-11-23T03:41:09Z

pandas/tests/sparse/frame/test_apply.py

+    # GH 23744
+    df = SparseDataFrame(np.array([[0, 1, 0], [0, 0, 0], [0, 0, 1]]),
+                         columns=['a', 'b', 'c'], default_fill_value=1)
+    df2 = DataFrame(df)


call this expected

jreback · 2018-11-23T03:41:17Z

pandas/tests/sparse/frame/test_apply.py

+                         columns=['a', 'b', 'c'], default_fill_value=1)
+    df2 = DataFrame(df)
+
+    df = df.apply(np.exp)


call this result

jreback · 2018-11-23T03:42:07Z

doc/source/whatsnew/v0.24.0.rst

@@ -1275,6 +1275,7 @@ Numeric
 - Bug in :meth:`Series.rpow` with object dtype ``NaN`` for ``1 ** NA`` instead of ``1`` (:issue:`22922`).
 - :meth:`Series.agg` can now handle numpy NaN-aware methods like :func:`numpy.nansum` (:issue:`19629`)
 - Bug in :meth:`Series.rank` and :meth:`DataFrame.rank` when ``pct=True`` and more than 2:sup:`24` rows are present resulted in percentages greater than 1.0 (:issue:`18271`)
+- Bug in :meth:`DataFrame.apply` where dtypes would lose sparseness (:issue:`23744`)


move this to the Sparse section

…nto BUG-23744

jreback · 2018-11-23T21:09:36Z

pandas/core/apply.py

+                with np.errstate(all='ignore'):
+                    for col in self.columns:
+                        result[col] = self.f(self.obj[col].values)
+                return result


as i said above, don't construct an empty Dataframe, rather use a dictionary (or just a list comprehension), then construct the dataframe from it right before return.

Edit: just figured out I could pass a dict into the constructor to get the proper behaviour. Please disregard the rest of this post.

@jreback I can't use the constructor in that case because it stacks a list of series horizontally and there's no axis option. so I used pd.concat. Is that okay?

>>> import pandas as pd >>> s1 = pd.Series([1, 2]) >>> s2 = pd.Series([3, 4]) >>> pd.DataFrame([s1, s2]) 0 1 0 1 2 1 3 4

JustinZhengBC · 2018-11-23T23:48:20Z

Considering #23875, I think we will need a for loop in get_result to check for all the different dtypes. This PR only covers the sparse case but the categorical case could be covered as well by adding:

elif is_categorical(self.obj.dtypes[col]):
    before = self.obj.dtypes[col].categories
    after = self.f(before)
    map = dict(zip(before, after))
    result = self.obj[col].replace(map).astype('category')

TomAugspurger · 2018-11-26T15:21:35Z

pandas/core/apply.py

            return self.obj._constructor(data=results, index=self.index,
-                                         columns=self.columns, copy=False)


You probably need to pass columns here to ensure that the columns are in the correct order. Add a test where the columns aren't sorted (like ['b', 'a', 'c']) and verify that it fails on py35 and older.

TomAugspurger · 2018-11-26T15:24:56Z

@jreback, general question: how firm should we be on preserving blockwise application of things? I don't think we should give that up yet (as the current PR does).

jreback · 2018-11-26T15:49:53Z

@TomAugspurger I agree generally with your point about block-wise application. Sparse however I think is already a single block, so there should be no difference in this approach.

TomAugspurger · 2018-11-26T15:54:28Z

My concern is about dataframes with non-sparse columns. Previously, for an all-float DataFrame, DataFrame.apply(np.abs) would (I think) call np.abs once. Now we'll be calling it once per column, even if there are no sparse columns.

jreback · 2018-11-26T15:59:48Z

My concern is about dataframes with non-sparse columns. Previously, for an all-float DataFrame, DataFrame.apply(np.abs) would (I think) call np.abs once. Now we'll be calling it once per column, even if there are no sparse columns.

this could actually dispatch to the block impl then (prob better).

@JustinZhengBC basically you can call

self._data.apply('apply', .....) I think

TomAugspurger · 2018-11-26T16:03:53Z

Ah, I forgot we had Block.apply. Yes, this should hopefully work

In [20]: df = pd.DataFrame({"A": [1, 2, 3], "B": pd.SparseArray([0, 1, 2])})

In [21]: df._data.apply('apply', func=np.sin)
Out[21]:
BlockManager
Items: Index(['A', 'B'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
FloatBlock: slice(0, 1, 1), 1 x 3, dtype: float64
ExtensionBlock: slice(1, 2, 1), 1 x 3, dtype: Sparse[float64, 0.0]

JustinZhengBC · 2018-11-26T22:55:37Z

@jreback you're right, it works. I also changed switched up the columns in the test so order preservation is tested

jreback

tiny comment. ping when pushed / green.

jreback · 2018-11-27T02:35:40Z

pandas/tests/sparse/frame/test_apply.py

+
+def test_apply_keep_sparse_dtype():
+    # GH 23744
+    expected = SparseDataFrame(np.array([[0, 1, 0], [0, 0, 0], [0, 0, 1]]),


can you call this sdf. I find the repeated assignments slightly confusing here.

jreback · 2018-11-27T02:35:53Z

pandas/tests/sparse/frame/test_apply.py

+                               columns=['b', 'a', 'c'], default_fill_value=1)
+    result = DataFrame(expected)
+
+    expected = expected.apply(np.exp)


expected = sdft.apply(...)

…nto BUG-23744

JustinZhengBC · 2018-11-27T10:05:55Z

@jreback green

jreback · 2018-11-27T12:25:46Z

thanks @JustinZhengBC

)

BUG-23744 DataFrame.apply keeps dtype sparseness

b85bdb9

JustinZhengBC force-pushed the BUG-23744 branch from c7ea063 to b85bdb9 Compare November 17, 2018 18:56

TomAugspurger reviewed Nov 17, 2018

View reviewed changes

BUG-23744 Fix memory usage

ad33f76

BUG-23744 Remove unnecessary check

c39fe11

TomAugspurger reviewed Nov 17, 2018

View reviewed changes

BUG-23744 fix import lint

4aba3f8

JustinZhengBC changed the title ~~BUG GH23744 DataFrame.apply keeps dtype sparseness~~ BUG GH23743 and GH23744 DataFrame.apply keeps dtype sparseness Nov 17, 2018

JustinZhengBC changed the title ~~BUG GH23743 and GH23744 DataFrame.apply keeps dtype sparseness~~ BUG GH23743 and GH23744 ufuncs on DataFrame keeps dtype sparseness Nov 17, 2018

BUG-23744 fix test

bcdf01b

merge

79be557

gfyoung added Sparse Sparse Data Type ExtensionArray Extending pandas with custom dtypes or arrays. labels Nov 18, 2018

jreback requested changes Nov 18, 2018

View reviewed changes

JustinZhengBC changed the title ~~BUG GH23743 and GH23744 ufuncs on DataFrame keeps dtype sparseness~~ BUG GH23744 ufuncs on DataFrame keeps dtype sparseness Nov 19, 2018

BUG-23744 move test and avoid inefficiency

99c8796

Merge master

0868c47

jreback requested changes Nov 23, 2018

View reviewed changes

JustinZhengBC added 5 commits November 22, 2018 20:05

BUG-23744 make requested changes

de0ecf3

BUG-23744 make requested changes

491b908

Merge branch 'BUG-23744' of https://github.com/justinzhengbc/pandas i…

f6230f6

…nto BUG-23744

Merge branch 'BUG-23744' of https://github.com/justinzhengbc/pandas i…

42ca43a

…nto BUG-23744

Merge branch 'BUG-23744' of https://github.com/justinzhengbc/pandas i…

ee2c462

…nto BUG-23744

jreback requested changes Nov 23, 2018

View reviewed changes

BUG-23744 use list comprehension

bca539f

JustinZhengBC added 5 commits November 23, 2018 15:49

BUG-23744 use for loop instead

d8670ef

Merge branch 'master' into BUG-23744

30d83a6

BUG-23744 fix test

c15afe3

BUG-23744 fix other test

b4ab44b

BUG-23744 use constructor properly

d153f74

TomAugspurger reviewed Nov 26, 2018

View reviewed changes

BUG-23744 use block apply

d6e22a8

jreback added this to the 0.24.0 milestone Nov 27, 2018

jreback approved these changes Nov 27, 2018

View reviewed changes

JustinZhengBC added 3 commits November 26, 2018 19:23

BUG-23744 clarify test

8f151dc

BUG-23744 clarify test

be8750f

Merge branch 'BUG-23744' of https://github.com/justinzhengbc/pandas i…

551ced8

…nto BUG-23744

jreback merged commit 3922947 into pandas-dev:master Nov 27, 2018

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

BUG GH23744 ufuncs on DataFrame keeps dtype sparseness (pandas-dev#23755

02bd3e2

)

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

BUG GH23744 ufuncs on DataFrame keeps dtype sparseness (pandas-dev#23755

5ec3499

)

		@@ -570,6 +570,16 @@ def test_apply_dup_names_multi_agg(self):

		tm.assert_frame_equal(result, expected)

		def test_apply_keep_sparse_dtype(self):

		return self.obj._constructor(data=results, index=self.index,
		columns=self.columns, copy=False)

BUG GH23744 ufuncs on DataFrame keeps dtype sparseness #23755

BUG GH23744 ufuncs on DataFrame keeps dtype sparseness #23755

Conversation

JustinZhengBC commented Nov 17, 2018 • edited Loading

pep8speaks commented Nov 17, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JustinZhengBC Nov 17, 2018 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Nov 17, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JustinZhengBC commented Nov 18, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Nov 18, 2018

JustinZhengBC commented Nov 19, 2018

jreback commented Nov 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JustinZhengBC Nov 23, 2018 • edited Loading

Choose a reason for hiding this comment

JustinZhengBC commented Nov 23, 2018

Choose a reason for hiding this comment

TomAugspurger commented Nov 26, 2018 • edited Loading

jreback commented Nov 26, 2018

TomAugspurger commented Nov 26, 2018

jreback commented Nov 26, 2018

TomAugspurger commented Nov 26, 2018

JustinZhengBC commented Nov 26, 2018

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JustinZhengBC commented Nov 27, 2018

jreback commented Nov 27, 2018

JustinZhengBC commented Nov 17, 2018 •

edited

Loading

JustinZhengBC Nov 17, 2018 •

edited

Loading

codecov bot commented Nov 17, 2018 •

edited

Loading

JustinZhengBC commented Nov 18, 2018 •

edited

Loading

JustinZhengBC Nov 23, 2018 •

edited

Loading

TomAugspurger commented Nov 26, 2018 •

edited

Loading