Skip to content

BUG #10228: segfault due to out-of-bounds in binning #10337

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

Garrett-R
Copy link
Contributor

Closes #10228. I also deleted some duplicated code while I was at it.

So, I wasn't sure if I should be including a unit test for this. This issue was a segfault, which was happening when you do:

s = pd.Series([], index=pd.DatetimeIndex([]), dtype=np.object)
s.resample('d', how='count')

so I suppose I could add this code into a test, and just verify it doesn't segfault, but that seemed like a bad idea. It would be testing for something that's undefined behavior and therefore be a non-deterministic test.

@jreback
Copy link
Contributor

jreback commented Jun 12, 2015

of course you need a test, otherwise how can you tell if your fix works? The test can be as simple as asserting that it runs.

@@ -9110,6 +9110,8 @@ def group_count_bin_float64(ndarray[float64_t, ndim=2] out,
ndarray[int64_t, ndim=2] nobs = np.zeros((out.shape[0], out.shape[1]),
dtype=np.int64)

if len(bins) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is by definition generated code, edits here will be lost. rather make the fix in src/generate_code.py, thenpython generate_code.py creates the generated.pyx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh crap, my bad! I did see the name, but then I looked at git history and saw what appeared to be people editing it (but I get it now).

Do you think it's worth adding a warning message at the top for new devs like NumPy does? Or should it be obvious from the name?

@Garrett-R
Copy link
Contributor Author

@jreback, sure I'll write a test. I have a piece of code that reliably segfaults for me:

ss = pd.Series([], index=pd.DatetimeIndex([]), dtype=np.object)
ss.resample('d', how='count')

That being said, this doesn't mean it'll reliably segfault for other machines since out-of-bounds access are undefined behavior. One way would be to re-enable the bounds-checking on these Cython functions just for the unit tests. (I actually posed this question on Programmers Stack Exchange and they seemed surprised that the disabling of bounds-checking was hard-coded rather than something that can be switched on and off)

@jreback
Copy link
Contributor

jreback commented Jun 14, 2015

@Garrett-R then that is a reasonable test. Even if it doesn't segfault on some platforms that is ok. Note that you can change the directive at run-time, see here, e.g. you would have a flag that is passed in to re-enable this. But to be honest it is overkill, and not necessary to tests with that. Since you have an example that fails, that can serve as a smoke tests (e.g. the test is simply running that code, if it doesn't segfault then the test passes).

@jreback
Copy link
Contributor

jreback commented Jun 14, 2015

@Garrett-R No objection to a separate PR for adding a warning to generated.pyx (and more importantly prob a note in internals.rst about how/what to change. thanks!

@jreback jreback added Bug Resample resample method labels Jun 14, 2015
# These were sometimes causing a segfault (for the functions with
# bounds-checking disabled) or an IndexError. We just run them to
# ensure they no longer do. (#10228)
index = pd.DatetimeIndex([])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually make a couple of loops here, iterate over all of the index types, then all of the possible dtypes (object,float,datetime if you are adventurous - only works for some of the hows), then all of the 'hows' and exhaustivly test the possibilities

I would actually define these functions in utils.testing (right below the make index) functions

def all_index_factory():
  return [ tm.makeIntIndex, tm.makeFloatIndex,
           tm.makeStringIndex, tm.makeUnicodeIndex,
           tm.makeDateIndex, tm.makePeriodIndex, 
           tm.makeTimedeltaIndex, tm.makeBoolIndex, 
           tm.makeCategoricalIndex]

def datetimelike_index_factory():
     return [  tm.makeDateIndex, tm.makePeriodIndex, 
               tm.makeTimedeltaIndex ]

obviously use the 2nd one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I ended up making these generators that returned the index instances, which I thought would keep the test cleaner. Let me know if you prefer to just return a list of functions though and I can change it.

@Garrett-R
Copy link
Contributor Author

@jreback, good call about adding more comprehensive testing. This helped me find some more out-of-bounds bugs (which were in algos.pyx).

I'll send a PR soon to supply a "DO NOT EDIT THIS GENERATED FILE...." warning for other fellow noobs.

@jreback jreback added this to the 0.17.0 milestone Jun 26, 2015
@jreback
Copy link
Contributor

jreback commented Jun 26, 2015

merged via 25fc49d

thanks!

nice fix.

btw if you'd like to update the docs w.r.t. the generate_code.pyx would be gr8. pls put in internals.rst

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Resample resample method
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants