Skip to content

BUG: groupby segfaults when passed a function of a timestamp which raises a TypeError #3035

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dhirschfeld opened this issue Mar 13, 2013 · 2 comments
Labels
Milestone

Comments

@dhirschfeld
Copy link
Contributor

It is a dumb thing to do, but pandas probably shouldn't segfault regardless:

In [1]: dates = pd.date_range('01-Jan-2013', periods=12, freq='MS')
   ...: ts = pd.TimeSeries(randn(12), index=dates)

In [2]: ts.groupby(lambda key: key[0:4]).first()
It seems the kernel died unexpectedly. Use 'Restart kernel' to continue using this console.

Tested with pandas 10.1 on win32/win64, Python 2.7 and pandas 0.11.0.dev-2f7b0e4 on win32, Python 2.7.

@ghost
Copy link

ghost commented Mar 13, 2013

Short answer: added a fix to #3031.

In [11]: N=12
    ...: dates = pd.date_range('01-Jan-2013', periods=N, freq='MS')
    ...: ts = pd.TimeSeries(randn(N), index=dates)
    ...: ts.groupby(lambda key: key[0:4]).first()
---------------------------------------------------------------------------

AssertionError: Grouper result violates len(labels) == len(data)
result: [2013-01-01 00:00:00, 2013-02-01 00:00:00, 2013-03-01 00:00:00, 2013-04-01 00:00:00]

Very long answer:
I think this is a case where pandas is too clever by half.

When your grouper is a function, it's applied using index.map.
index.map tries really hard to guess what you want, and so if the function
throws an exception when applied to the elements, it's tried again on the entire
index as a sequence and the result is used as the grouping result.

Presumably, lambda key: str(key)[0:4] is what you wanted, but you're using key[0:4] when
key is a Timestamp, so that failes, the fallback path is triggered and you end
up getting the first 4 elements of the index, rather then a sequence of the first
4 chars of each element.
That produces a label/data len mismatch, which is uncaught by the cython code
down the road, which borks (similar to #3011 from a few days ago, but a different code path).

The fix in #3011, still doesn't address the corner case where an erroneous lambda throws an exception
on the element, but returns a result of the correct length when applied to the entire index:

In [17]: N=5
    ...: dates = pd.date_range('01-Jan-2013', periods=N, freq='MS')
    ...: ts = pd.TimeSeries(randn(N), index=dates)
    ...: ts.groupby(lambda key: key[0:len(ts.index)]).count()
Out[17]: 
2013-01-01    1
2013-02-01    1
2013-03-01    1
2013-04-01    1
2013-05-01    1
dtype: int64

compare with lambda key: str(key)[0:len(ts.index)]:

In [16]: N=5
    ...: dates = pd.date_range('01-Jan-2013', periods=N, freq='MS')
    ...: ts = pd.TimeSeries(randn(N), index=dates)
    ...: ts.groupby(lambda key: str(key)[0:len(ts.index)]).count()
Out[16]: 
2013-    5
dtype: int64

That could give someone a nasty surprise.

Comments in the code suggest that in the hands of a trained professional
this is a feature, not a bug. well, ok then.

EDIT: this behaviour is specific to DatetimeIndex, and it's actually the other way around, first
the index itself is tried, then each element, code

@ghost
Copy link

ghost commented Mar 17, 2013

fixed in master.

@ghost ghost closed this as completed Mar 17, 2013
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant