-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: don't assume series is length > 0 #19438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@pytest.mark.parametrize('deep,fill_values', [([True, False], | ||
[0, 1, np.nan, None])]) | ||
def test_memory_usage_deep(self, deep, fill_values): | ||
for fv in fill_values: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use product
to do this rather than have an embedded loop
pandas/_libs/lib.pyx
Outdated
@@ -67,7 +67,8 @@ def memory_usage_of_objects(ndarray[object, ndim=1] arr): | |||
cdef int64_t s = 0 | |||
|
|||
n = len(arr) | |||
for i from 0 <= i < n: | |||
# Hrm why was this 0 <= i < n |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you think this is the issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ok, I think are definition of memory_usage should be slightly updated. (in core/base.py) as its meant for pure object dtypes, not sparse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be updated?
My understanding is that memory_usage for a SparseSeries should equal to the memory_usage of the index plus the memory_usage of the SparseArray
The original bug is honestly an edgecase because if you have more than one dense element it doesn't fail:
import pandas as pd
ss = pd.SparseSeries(['a', 'b', 'c'], fill_value='a')
ss.memory_usage(deep=True) #=> No segfault because the dense elements is len > 0
ss= pd.SparseSeries(['a', 'a', 'a'], fill_value='a')
ss.memory_usage(deep=True) #=> Segfault because dense elements len = 0
I'm not sure core/base.py should be updated. The problem with doing that would be that I would do a length check before handing it off to the lower level pyx file. This fix gets rid of the possible segfault by accessing memory that's not there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I meant to take out my thought process comment ;). I'll do that now
Also looks like I have a test failure I'll look into |
I think I figured out the problem. One of the memory_usage defs needed to pass in only sp_values. We'll see how the tests chooch. |
pandas/core/base.py
Outdated
@@ -1069,12 +1069,18 @@ def memory_usage(self, deep=False): | |||
-------- | |||
numpy.ndarray.nbytes | |||
""" | |||
if hasattr(self.values, 'memory_usage'): | |||
return self.values.memory_usage(deep=deep) | |||
# Use sparse values if they exist for memory consumption |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather override this in SparseSeries/SparseArray
pandas/_libs/lib.pyx
Outdated
@@ -59,8 +59,14 @@ def memory_usage_of_objects(ndarray[object, ndim=1] arr): | |||
cdef Py_ssize_t i, n | |||
cdef int64_t s = 0 | |||
|
|||
# The problem here is that... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comments are not necessary
Codecov Report
@@ Coverage Diff @@
## master #19438 +/- ##
==========================================
- Coverage 91.62% 91.59% -0.03%
==========================================
Files 150 150
Lines 48714 48725 +11
==========================================
- Hits 44633 44630 -3
- Misses 4081 4095 +14
Continue to review full report at Codecov.
|
ping this is green and I think I fixed everything pointed out. |
pandas/core/sparse/array.py
Outdated
""" | ||
|
||
values = self.sp_values | ||
if hasattr(values, 'memory_usage'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need this here as the values are always an ndarray
pandas/core/sparse/array.py
Outdated
@@ -238,6 +239,41 @@ def kind(self): | |||
elif isinstance(self.sp_index, IntIndex): | |||
return 'integer' | |||
|
|||
def memory_usage(self, deep=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make the doc-string generic in base and use a shared one here?
thanks @hexgnu keep em coming! |
closes pandas-dev#19368 Author: Matthew Kirk <[email protected]> Closes pandas-dev#19438 from hexgnu/segfault_memory_usage and squashes the following commits: f9433d8 [Matthew Kirk] Use shared docstring and get rid of if condition 4ead141 [Matthew Kirk] Move whatsnew doc to Sparse ae9f74d [Matthew Kirk] Revert base.py cdd4141 [Matthew Kirk] Fix linting error 93a0c3d [Matthew Kirk] Merge remote-tracking branch 'upstream/master' into segfault_memory_usage 207bc74 [Matthew Kirk] Define memory_usage on SparseArray 21ae147 [Matthew Kirk] FIX: revert change to lib.pyx 3f52a44 [Matthew Kirk] Ah ha I think I got it 5e59e9c [Matthew Kirk] Use range over 0 <= for loops e251587 [Matthew Kirk] Fix failing test with indexing 27df317 [Matthew Kirk] Merge remote-tracking branch 'upstream/master' into segfault_memory_usage 7fdd03e [Matthew Kirk] Take out comment and use product 6bd6ddd [Matthew Kirk] BUG: don't assume series is length > 0
git diff upstream/master -u -- "*.py" | flake8 --diff