Skip to content

BUG: don't assume series is length > 0 #19438

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from

Conversation

hexgnu
Copy link
Contributor

@hexgnu hexgnu commented Jan 29, 2018

@pytest.mark.parametrize('deep,fill_values', [([True, False],
[0, 1, np.nan, None])])
def test_memory_usage_deep(self, deep, fill_values):
for fv in fill_values:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use product to do this rather than have an embedded loop

@@ -67,7 +67,8 @@ def memory_usage_of_objects(ndarray[object, ndim=1] arr):
cdef int64_t s = 0

n = len(arr)
for i from 0 <= i < n:
# Hrm why was this 0 <= i < n
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you think this is the issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ok, I think are definition of memory_usage should be slightly updated. (in core/base.py) as its meant for pure object dtypes, not sparse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be updated?

My understanding is that memory_usage for a SparseSeries should equal to the memory_usage of the index plus the memory_usage of the SparseArray

The original bug is honestly an edgecase because if you have more than one dense element it doesn't fail:

import pandas as pd
ss = pd.SparseSeries(['a', 'b', 'c'], fill_value='a')

ss.memory_usage(deep=True) #=> No segfault because the dense elements is len > 0

ss= pd.SparseSeries(['a', 'a', 'a'], fill_value='a')
ss.memory_usage(deep=True) #=> Segfault because dense elements len = 0

I'm not sure core/base.py should be updated. The problem with doing that would be that I would do a length check before handing it off to the lower level pyx file. This fix gets rid of the possible segfault by accessing memory that's not there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I meant to take out my thought process comment ;). I'll do that now

@hexgnu
Copy link
Contributor Author

hexgnu commented Jan 29, 2018

Also looks like I have a test failure I'll look into

@hexgnu
Copy link
Contributor Author

hexgnu commented Feb 2, 2018

I think I figured out the problem. One of the memory_usage defs needed to pass in only sp_values.

We'll see how the tests chooch.

@@ -1069,12 +1069,18 @@ def memory_usage(self, deep=False):
--------
numpy.ndarray.nbytes
"""
if hasattr(self.values, 'memory_usage'):
return self.values.memory_usage(deep=deep)
# Use sparse values if they exist for memory consumption
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather override this in SparseSeries/SparseArray

@@ -59,8 +59,14 @@ def memory_usage_of_objects(ndarray[object, ndim=1] arr):
cdef Py_ssize_t i, n
cdef int64_t s = 0

# The problem here is that...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments are not necessary

@jreback jreback added Bug Sparse Sparse Data Type labels Feb 2, 2018
@codecov
Copy link

codecov bot commented Feb 5, 2018

Codecov Report

Merging #19438 into master will decrease coverage by 0.02%.
The diff coverage is 90%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #19438      +/-   ##
==========================================
- Coverage   91.62%   91.59%   -0.03%     
==========================================
  Files         150      150              
  Lines       48714    48725      +11     
==========================================
- Hits        44633    44630       -3     
- Misses       4081     4095      +14
Flag Coverage Δ
#multiple 89.96% <90%> (-0.03%) ⬇️
#single 41.75% <50%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/base.py 96.76% <ø> (ø) ⬆️
pandas/core/sparse/array.py 91.4% <90%> (-0.08%) ⬇️
pandas/plotting/_converter.py 65.22% <0%> (-1.74%) ⬇️
pandas/util/testing.py 83.64% <0%> (-0.21%) ⬇️
pandas/core/groupby.py 92.17% <0%> (+0.02%) ⬆️
pandas/core/strings.py 98.31% <0%> (+0.14%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update de39a15...f9433d8. Read the comment docs.

@hexgnu
Copy link
Contributor Author

hexgnu commented Feb 5, 2018

ping this is green and I think I fixed everything pointed out.

"""

values = self.sp_values
if hasattr(values, 'memory_usage'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need this here as the values are always an ndarray

@@ -238,6 +239,41 @@ def kind(self):
elif isinstance(self.sp_index, IntIndex):
return 'integer'

def memory_usage(self, deep=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make the doc-string generic in base and use a shared one here?

@jreback jreback added this to the 0.23.0 milestone Feb 6, 2018
@jreback jreback closed this in a01f74c Feb 6, 2018
@jreback
Copy link
Contributor

jreback commented Feb 6, 2018

thanks @hexgnu keep em coming!

harisbal pushed a commit to harisbal/pandas that referenced this pull request Feb 28, 2018
closes pandas-dev#19368

Author: Matthew Kirk <[email protected]>

Closes pandas-dev#19438 from hexgnu/segfault_memory_usage and squashes the following commits:

f9433d8 [Matthew Kirk] Use shared docstring and get rid of if condition
4ead141 [Matthew Kirk] Move whatsnew doc to Sparse
ae9f74d [Matthew Kirk] Revert base.py
cdd4141 [Matthew Kirk] Fix linting error
93a0c3d [Matthew Kirk] Merge remote-tracking branch 'upstream/master' into segfault_memory_usage
207bc74 [Matthew Kirk] Define memory_usage on SparseArray
21ae147 [Matthew Kirk] FIX: revert change to lib.pyx
3f52a44 [Matthew Kirk] Ah ha I think I got it
5e59e9c [Matthew Kirk] Use range over 0 <= for loops
e251587 [Matthew Kirk] Fix failing test with indexing
27df317 [Matthew Kirk] Merge remote-tracking branch 'upstream/master' into segfault_memory_usage
7fdd03e [Matthew Kirk] Take out comment and use product
6bd6ddd [Matthew Kirk] BUG: don't assume series is length > 0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Python crashes when executing memory_usage(deep=True) on a sparse series
2 participants