Skip to content

intermittent segfault after take error #2892

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dsm054 opened this issue Feb 18, 2013 · 40 comments
Closed

intermittent segfault after take error #2892

dsm054 opened this issue Feb 18, 2013 · 40 comments
Labels
Milestone

Comments

@dsm054
Copy link
Contributor

dsm054 commented Feb 18, 2013

Was playing around and managed to get a segfault, in several versions. It's a little intermittent, and it may have to do with how things are left after an error is encountered. The same code without the the faulty take line doesn't crash for me.

>>> import pandas as pd, sys
>>> sys.version
'2.7.3 (default, Aug  1 2012, 05:16:07) \n[GCC 4.6.3]'
>>> pd.__version__
'0.11.0.dev-14a04dd'
>>> df = pd.DataFrame({'A': {1.0: 0.0, 2.0: 6.0, 3.0: 12.0}, 'C': {1.0: 2.0, 2.0: 8.0, 3.0: 14.0}, 'B': {1.0: 1.0, 2.0: 7.0, 3.0: 13.0}})
>>> df
    A   B   C
1   0   1   2
2   6   7   8
3  12  13  14
>>> 
>>> df.take(df.unstack())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.11.0.dev_14a04dd-py2.7-linux-i686.egg/pandas/core/frame.py", line 2951, in take
    new_index = self.index.take(indices)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.11.0.dev_14a04dd-py2.7-linux-i686.egg/pandas/core/index.py", line 416, in take
    taken = self.view(np.ndarray).take(indexer)
IndexError: index out of range for array
>>> df.unstack()
A  2     0
   2     6
   3    12
B  2     1
   2     7
   3    13
C  2     2
   2     8
   3    14
Dtype: float64
>>> df.take(df.unstack())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.11.0.dev_14a04dd-py2.7-linux-i686.egg/pandas/core/frame.py", line 2951, in take
    new_index = self.index.take(indices)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.11.0.dev_14a04dd-py2.7-linux-i686.egg/pandas/core/index.py", line 416, in take
    taken = self.view(np.ndarray).take(indexer)
IndexError: index out of range for array
>>> df.unstack()
Segmentation fault (core dumped)
@stephenwlin
Copy link
Contributor

Not 100% sure, but it looks like a numpy issue:

In [1]: import pandas as pd, sys

In [2]: import numpy as np

In [3]: df=pd.DataFrame({'A': {1.0: 0.0, 2.0: 6.0, 3.0: 12.0},
   ...:                  'C': {1.0: 2.0, 2.0: 8.0, 3.0: 14.0},
   ...:                  'B': {1.0: 1.0, 2.0: 7.0, 3.0: 13.0}})

In [4]: i = df.index.view(np.ndarray)

In [5]: u = df.unstack().astype(np.int_).view(np.ndarray)

In [6]: i
Out[6]: array([1.0, 2.0, 3.0], dtype=object)

In [7]: i.take(u)
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-7-8293b97d15cc> in <module>()
----> 1 i.take(u)

IndexError: index out of range for array

In [8]: i
Segmentation fault (core dumped)

Line 7 only involves numpy types, although the underlying buffers have been touched by pandas (so it's possible it's happening because pandas violated a contract somewhere). i.take doesn't seem to have any reason to affect the value of i though, since it returns a new object instead of modifying itself: I'm guessing numpy's implementation is modifying i for some optimization purpose but it wasn't coded in an exception-safe manner...leaving i in some invalid state.

Will dig in a bit more.

@dsm054
Copy link
Contributor Author

dsm054 commented Feb 18, 2013

Note that the unstack results after the first take are already off (it shouldn't be 2,2,3). On a different system I can get

>>> df.unstack()
A  1358910081     0
   2              6
   3             12
B  1358910081     1
   2              7
   3             13
C  1358910081     2
   2              8
   3             14

instead.

@stephenwlin
Copy link
Contributor

Changing line 5 to u = np.asarray(df.unstack().astype(np.int_).copy()).copy() still results in a segfault as does changing line 4 to either i = np.asarray(df.index.copy()) or i = np.asarray(df.index).copy() seem to make it go away. So seems either Index is violating some contract in subclassing ndarray, ndarray.take is buggy, or both...

EDIT: actually it seems like changing 4 to either option above still segfaults (at least for me), but it requires multiple calls to ndarray.take to do so.

@stephenwlin
Copy link
Contributor

@dsm054 yes, it looks like the the index contents are being affected by ndarray.take, which seems like numpy optimization gone awry (since ndarray.take doesn't semantically mutate its self argument)

@stephenwlin
Copy link
Contributor

definitely numpy bug (here's with dev numpy)

In [1]: import numpy as np

In [2]: u = np.asarray([ 0, 6, 12, 1, 7, 13, 2, 8, 14])

In [3]: i = np.asarray([1.0, 2.0, 3.0]).astype(object)

In [4]: i
Out[4]: array([1.0, 2.0, 3.0], dtype=object)

In [5]: i.take(u)
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-5-8293b97d15cc> in <module>()
----> 1 i.take(u)

IndexError: index 6 is out of bounds for axis 0 with size 3

In [6]: i
Segmentation fault (core dumped)

@dsm054
Copy link
Contributor Author

dsm054 commented Feb 18, 2013

Should we close this, then, and bump it upstream?

@stephenwlin
Copy link
Contributor

well, maybe pandas should work around it by checking index bounds itself first for object arrays (at some minor efficiency cost...)? in which case we might want to leave the issue open for that.

it might be important since Index arrays are usually object-based (unless specialized via a subclass)

@wesm
Copy link
Member

wesm commented Feb 19, 2013

why don't you replace the usage of ndarray.take in Index.take with take_nd (which should not corrupt memory)?

@stephenwlin
Copy link
Contributor

yeah, that works too...thanks! (it might be marginally faster too because ndarray.take is coded for the general nd-case but we have cython specializations for most 1-d cases)

@wesm
Copy link
Member

wesm commented Mar 12, 2013

What's the status here?

@stephenwlin
Copy link
Contributor

sorry for the lack of update, this turned out to be somewhat of a rabbit hole; basically, we're exposed to this same numpy bug everywhere where we call ndarray.take with user-supplied indices, which turns out to be a lot of places. it doesn't make that much sense to just convert one call arbitrarily so I started doing a blanket conversion...but that turns out to be the wrong thing because ndarray.take has different semantics with negative indicies than our cythonic takes (i.e. the normal ones, rather than -1 being a dummy value for fill behavior)...lots of tests were broken in subtle ways before I realized the issue and regardless it's unclear if anyone is depending on the normal semantics already (since there's no documentation either way that this is supported officially, afaict).

i think the right thing to do is probably the following:

  1. audit the code and try to disambiguate uses of user-supplied indices and algorithm-generated indices throughout
  2. convert the former to either a bug-free cythonic take that does normal negative indices, or a bounds-checking wrapper around ndarray.take
  3. (optional) if new cythonic take is used, then get rid of the whole back and forth "platform int <-> int64" conversion mess, which is basically only there to support ndarray.take.

but that's a pretty big change...i don't have free cycles for it right now but if someone else wants to take a look they're free to

@stephenwlin
Copy link
Contributor

(btw, in case someone picks this up...there's no bounds checking in take_nd in most cases, which might be ok if we can ensure that no user-supplied indices are every passed to it, but maybe it should be addressed to be safe; probably too slow to turn on cython's built-in bounds check though, since that'll check with every access, better thing to do would be to add the minimal checks manually for a row/column of calls at a time)

@jreback
Copy link
Contributor

jreback commented Mar 12, 2013

this is the change I made in _ixs in frame.py (this was the old icol), as negative user-supplied indicies were incorrect (the change was all wrapped up in the .iloc changes). I am sure as @stephenwlin points out there are more user facing changes like this (prob not too many, and most prob hit this code-block anyhow)

can anyone think of a place where there are user-facing posssibilites of negative indices except:

  • _ixs/icol
  • frame/take

other uses of negative indices denote labels (or hit _ixs) AFAICT

@ghost
Copy link

ghost commented Mar 12, 2013

no bound checking ocassionaly bites with a kindly segfault, #3011, #2775/#2778
are all those cython functions really in the inner loop?

@stephenwlin
Copy link
Contributor

well, if all the indices are generated internally and the algorithms to do so are not buggy, then it's technically superfluous to check--there's just no guarantee that that's the case

@ghost
Copy link

ghost commented Mar 12, 2013

obviously that wasn't the case in those examples.

@jreback
Copy link
Contributor

jreback commented Mar 12, 2013

so why don't we do negative index conversion in frame/take, and make that only user facing (and can call from _ixs), then @stephenwlin can fix the take bugs independently? could also have a flag (for safety sake?)

@stephenwlin
Copy link
Contributor

@y-p of course :) just saying there's a case for why it wasn't there to begin with (i didn't take it out, it was turned off before I started working on it)

@ghost
Copy link

ghost commented Mar 12, 2013

generate_code.py uses @cython.boundscheck(False) across the board. maybe
this can be revamped with the conversion to fused types.

@stephenwlin
Copy link
Contributor

@jreback maybe, i don't know how many user-facing functions there are to look at actually, it's a bit hard to tell what's user-facing and what isn't in python sometimes...it might be less of a deal than i think it is

@stephenwlin
Copy link
Contributor

@y-p i don't think we should pay for per-access bounds checking though, that'll be O(prod(shape)) overhead, unless cython is smart enough to factor out the checks

@ghost
Copy link

ghost commented Mar 12, 2013

Wouldn't it be a function of access count rather then strictly the shape of the array? probably most
cython code does do significant work, so point taken in any case.

I'm just thinking of reducing bug surface area, perhaps integrating a boundscheck preamble into the
cython code would be better.

@jreback
Copy link
Contributor

jreback commented Mar 12, 2013

So user facing take is broken as well for negative indicies (because take_nd doesn like negative indices)
(and the 2 cases below have to do whether the internals method take is called or the indexers take is called

In [5]: df = pd.DataFrame(np.random.rand(8,3))

In [6]: df.take([0,1,-1])
Out[6]: 
          0         1         2
0  0.610429  0.185237  0.174880
1  0.059846  0.290267  0.012082
7       NaN       NaN       NaN

In [7]: df
Out[7]: 
          0         1         2
0  0.610429  0.185237  0.174880
1  0.059846  0.290267  0.012082
2  0.719515  0.322478  0.953638
3  0.450850  0.374526  0.781923
4  0.588323  0.951343  0.995791
5  0.385074  0.276861  0.720896
6  0.149348  0.553080  0.652363
7  0.554311  0.635818  0.148375

In [8]: df.take([0,1,-1],axis=1)
Out[8]: 
          0         1   2
0  0.610429  0.185237 NaN
1  0.059846  0.290267 NaN
2  0.719515  0.322478 NaN
3  0.450850  0.374526 NaN
4  0.588323  0.951343 NaN
5  0.385074  0.276861 NaN
6  0.149348  0.553080 NaN
7  0.554311  0.635818 NaN

but works for mixed

In [9]: df['test'] = 'foo'

In [10]: df.take([0,1,-1],axis=1)
Out[10]: 
          0         1 test
0  0.610429  0.185237  foo
1  0.059846  0.290267  foo
2  0.719515  0.322478  foo
3  0.450850  0.374526  foo
4  0.588323  0.951343  foo
5  0.385074  0.276861  foo
6  0.149348  0.553080  foo
7  0.554311  0.635818  foo

but not in mixed with axis=0

In [3]: df = pd.DataFrame(dict(A = np.random.rand(8), B = 'foo'), index=range(8))

In [4]: df.take([-1],axis=0)
Exception: Indices must be nonzero and less than the axis length

@stephenwlin
Copy link
Contributor

so should negative indices be allowed everywhere where the user can supply indices, or just through .ix and friends?

anyway, regardless of what we allow or don't allow, we need to make sure any user-supplied indices are sanitized before being passed into functions that assume properly bounded indices, so disambiguating the two is important.

@y-p maybe just add an option to generate.py to add bounds checking as a debug tool? also, i'm pretty sure it's only the take_* functions that are a potential problem, there's almost certainly no user-supplied indicies going into pad_*, backfill_*, etc.

@ghost
Copy link

ghost commented Mar 12, 2013

Sure, but that's useful as a way to track down segfaults after users report them.

sanitizing is good, but probably less effort to "head them off at the pass" in the
cython function preamble then tracking down all the code paths that lead there.

#3011 was indeed groupby_int64 delegating to take_1d in numpy_helper.

@stephenwlin
Copy link
Contributor

ok, i guess that's ok, but if it's just take_* etc., we can at least minimize the performance hit but writing the bounds checks manually so they don't repeat checks unnecessarily, rather than relying on the cython ones.

@ghost
Copy link

ghost commented Mar 12, 2013

+1

@stephenwlin
Copy link
Contributor

oh actually, i guess that's what you meant by doing it in the preamble...yeah that's fine; actually, you can be a bit more efficient than that (for the normal case) by doing it as you iterate, instead of iterating twice...i'll do that for now until we resolve the semantics/sanitization issue

@ghost
Copy link

ghost commented Mar 12, 2013

In #3011 take_1d was in the inner loop, and it's really the caller that should have done the
check in this case. Though it was another cython function.

Probably need to do case-by-case unfortunately.

@stephenwlin
Copy link
Contributor

@y-p yeah, in take_1d there's only one loop so it needs to check every index anyway, but take_2d_* etc. can do it column-by-column or row-by-row during the outer loop

@stephenwlin
Copy link
Contributor

anyway, so it's still an open question here where we want to allow negative indexing. it'll probably be simpler to only allow it through .ix and friends, to reduce the amount of cases we need to support and test...and with bounds checks in the cython take implementations we can probably get away with not sanitizing elsewhere (although the ideal would be to sanitize on any user-facing function and avoid the bounds check internally; not sure how obtainable that would be at this point without a thorough review)

@wesm
Copy link
Member

wesm commented Mar 12, 2013

The -1 business was done as a matter of convenience. That whole mess should be replaced with a proper INT64_MIN or INT32_MIN depending on the platform / byte width. Sigh

@jreback
Copy link
Contributor

jreback commented Mar 12, 2013

it is tested/works with .ix and friends, take (though I would be open to ONLY allowing with .iloc but thats a different issue)

@stephenwlin
Copy link
Contributor

so are we willing to break code that uses negative indices with Index.take, DataFrame.take, etc? guessing there's not a whole lot of it, and it's broken already for some cases (as show by @jreback's example)

@jreback
Copy link
Contributor

jreback commented Mar 12, 2013

@stephenwlin DataFrame.take is fixed (for neg indicies), as well as neg inicies for .ix (which worked before), see #3027, I don't think Index.take is broken?

I was just musing if we should restrict .ix to non-negative (but maybe there is not good reason)

@stephenwlin
Copy link
Contributor

yeah, DataFrame.take is broken because it mixes ndarray.take and take_nd calls, Index.take presumably isn't (other than possibly triggering a numpy segfault)...but it would probably make things easier to just not support it except at a few defined entry points rather than supporting it inconsistently?

@jreback
Copy link
Contributor

jreback commented Mar 12, 2013

@stephenwlin i just merged in #3027 to master. I cannot reproduce this bug (but I didn't do anything to 'fix' it), maybe the bounds checking fixed this somehow? weird

correction: it hits the exception before it can segfault....because the 3 from the unstack was an out-of-bounds...but
didn't you discover some other bug here?

@stephenwlin
Copy link
Contributor

no that's pretty much it, that should take care of it; the only problem is if we haven't covered all the cases that we might be passing out-of-range indices into ndarray.take

@jreback
Copy link
Contributor

jreback commented Mar 12, 2013

ok great

it's funny that the take didn't blow up with an illegal access on the first action

should close this issue then?
(as u moved everything else to a new issue)

@stephenwlin
Copy link
Contributor

sure, i split off another issue in case we want to do a bigger overhaul

@jreback jreback closed this as completed Mar 13, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants