BUG: don't assume series is length > 0 #19438

hexgnu · 2018-01-29T07:08:02Z

closes Python crashes when executing memory_usage(deep=True) on a sparse series #19368
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

jreback · 2018-01-29T11:29:16Z

pandas/tests/sparse/series/test_series.py

+    @pytest.mark.parametrize('deep,fill_values', [([True, False],
+                                                   [0, 1, np.nan, None])])
+    def test_memory_usage_deep(self, deep, fill_values):
+        for fv in fill_values:


use product to do this rather than have an embedded loop

jreback · 2018-01-29T11:32:35Z

pandas/_libs/lib.pyx

@@ -67,7 +67,8 @@ def memory_usage_of_objects(ndarray[object, ndim=1] arr):
    cdef int64_t s = 0

    n = len(arr)
-    for i from 0 <= i < n:
+    # Hrm why was this 0 <= i < n


why do you think this is the issue?

This is ok, I think are definition of memory_usage should be slightly updated. (in core/base.py) as its meant for pure object dtypes, not sparse.

Should it be updated?

My understanding is that memory_usage for a SparseSeries should equal to the memory_usage of the index plus the memory_usage of the SparseArray

The original bug is honestly an edgecase because if you have more than one dense element it doesn't fail:

import pandas as pd ss = pd.SparseSeries(['a', 'b', 'c'], fill_value='a') ss.memory_usage(deep=True) #=> No segfault because the dense elements is len > 0 ss= pd.SparseSeries(['a', 'a', 'a'], fill_value='a') ss.memory_usage(deep=True) #=> Segfault because dense elements len = 0

I'm not sure core/base.py should be updated. The problem with doing that would be that I would do a length check before handing it off to the lower level pyx file. This fix gets rid of the possible segfault by accessing memory that's not there.

Also I meant to take out my thought process comment ;). I'll do that now

hexgnu · 2018-01-29T12:31:22Z

Also looks like I have a test failure I'll look into

…sage

hexgnu · 2018-02-02T09:25:36Z

I think I figured out the problem. One of the memory_usage defs needed to pass in only sp_values.

We'll see how the tests chooch.

jreback · 2018-02-02T11:07:27Z

pandas/core/base.py

@@ -1069,12 +1069,18 @@ def memory_usage(self, deep=False):
        --------
        numpy.ndarray.nbytes
        """
-        if hasattr(self.values, 'memory_usage'):
-            return self.values.memory_usage(deep=deep)
+        # Use sparse values if they exist for memory consumption


I would rather override this in SparseSeries/SparseArray

jreback · 2018-02-02T11:07:51Z

pandas/_libs/lib.pyx

@@ -59,8 +59,14 @@ def memory_usage_of_objects(ndarray[object, ndim=1] arr):
    cdef Py_ssize_t i, n
    cdef int64_t s = 0

+    # The problem here is that... 


comments are not necessary

…sage

codecov · 2018-02-05T04:31:40Z

Codecov Report

Merging #19438 into master will decrease coverage by 0.02%.
The diff coverage is 90%.

@@            Coverage Diff             @@
##           master   #19438      +/-   ##
==========================================
- Coverage   91.62%   91.59%   -0.03%     
==========================================
  Files         150      150              
  Lines       48714    48725      +11     
==========================================
- Hits        44633    44630       -3     
- Misses       4081     4095      +14

Flag	Coverage Δ
#multiple	`89.96% <90%> (-0.03%)`	⬇️
#single	`41.75% <50%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/base.py	`96.76% <ø> (ø)`	⬆️
pandas/core/sparse/array.py	`91.4% <90%> (-0.08%)`	⬇️
pandas/plotting/_converter.py	`65.22% <0%> (-1.74%)`	⬇️
pandas/util/testing.py	`83.64% <0%> (-0.21%)`	⬇️
pandas/core/groupby.py	`92.17% <0%> (+0.02%)`	⬆️
pandas/core/strings.py	`98.31% <0%> (+0.14%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update de39a15...f9433d8. Read the comment docs.

hexgnu · 2018-02-05T07:29:15Z

ping this is green and I think I fixed everything pointed out.

jreback · 2018-02-05T11:39:30Z

pandas/core/sparse/array.py

+        """
+
+        values = self.sp_values
+        if hasattr(values, 'memory_usage'):


I don't think you need this here as the values are always an ndarray

jreback · 2018-02-05T11:40:16Z

pandas/core/sparse/array.py

@@ -238,6 +239,41 @@ def kind(self):
        elif isinstance(self.sp_index, IntIndex):
            return 'integer'

+    def memory_usage(self, deep=False):


can you make the doc-string generic in base and use a shared one here?

jreback · 2018-02-06T01:25:22Z

thanks @hexgnu keep em coming!

closes pandas-dev#19368 Author: Matthew Kirk <[email protected]> Closes pandas-dev#19438 from hexgnu/segfault_memory_usage and squashes the following commits: f9433d8 [Matthew Kirk] Use shared docstring and get rid of if condition 4ead141 [Matthew Kirk] Move whatsnew doc to Sparse ae9f74d [Matthew Kirk] Revert base.py cdd4141 [Matthew Kirk] Fix linting error 93a0c3d [Matthew Kirk] Merge remote-tracking branch 'upstream/master' into segfault_memory_usage 207bc74 [Matthew Kirk] Define memory_usage on SparseArray 21ae147 [Matthew Kirk] FIX: revert change to lib.pyx 3f52a44 [Matthew Kirk] Ah ha I think I got it 5e59e9c [Matthew Kirk] Use range over 0 <= for loops e251587 [Matthew Kirk] Fix failing test with indexing 27df317 [Matthew Kirk] Merge remote-tracking branch 'upstream/master' into segfault_memory_usage 7fdd03e [Matthew Kirk] Take out comment and use product 6bd6ddd [Matthew Kirk] BUG: don't assume series is length > 0

BUG: don't assume series is length > 0

6bd6ddd

jreback requested changes Jan 29, 2018

View reviewed changes

Take out comment and use product

7fdd03e

hexgnu added 4 commits February 2, 2018 15:53

Merge remote-tracking branch 'upstream/master' into segfault_memory_u…

27df317

…sage

Fix failing test with indexing

e251587

Use range over 0 <= for loops

5e59e9c

Ah ha I think I got it

3f52a44

jreback requested changes Feb 2, 2018

View reviewed changes

jreback added Bug Sparse Sparse Data Type labels Feb 2, 2018

hexgnu added 4 commits February 5, 2018 10:19

FIX: revert change to lib.pyx

21ae147

Define memory_usage on SparseArray

207bc74

Merge remote-tracking branch 'upstream/master' into segfault_memory_u…

93a0c3d

…sage

Fix linting error

cdd4141

hexgnu added 2 commits February 5, 2018 11:33

Revert base.py

ae9f74d

Move whatsnew doc to Sparse

4ead141

jreback requested changes Feb 5, 2018

View reviewed changes

Use shared docstring and get rid of if condition

f9433d8

jreback added this to the 0.23.0 milestone Feb 6, 2018

jreback approved these changes Feb 6, 2018

View reviewed changes

jreback closed this in a01f74c Feb 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: don't assume series is length > 0 #19438

BUG: don't assume series is length > 0 #19438

hexgnu commented Jan 29, 2018

jreback Jan 29, 2018

jreback Jan 29, 2018

jreback Jan 29, 2018

hexgnu Jan 29, 2018

hexgnu Jan 29, 2018

hexgnu commented Jan 29, 2018

hexgnu commented Feb 2, 2018

jreback Feb 2, 2018

jreback Feb 2, 2018

codecov bot commented Feb 5, 2018 •

edited

Loading

hexgnu commented Feb 5, 2018

jreback Feb 5, 2018

jreback Feb 5, 2018

jreback commented Feb 6, 2018

BUG: don't assume series is length > 0 #19438

BUG: don't assume series is length > 0 #19438

Conversation

hexgnu commented Jan 29, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hexgnu commented Jan 29, 2018

hexgnu commented Feb 2, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Feb 5, 2018 • edited Loading

Codecov Report

hexgnu commented Feb 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Feb 6, 2018

codecov bot commented Feb 5, 2018 •

edited

Loading