-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG/TST: transform and filter on non-unique index, closes #4620 #5375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
filtered = self.obj.take([]) # because np.concatenate would fail | ||
else: | ||
filtered = self.obj.take(np.sort(np.concatenate(indexers))) | ||
filtered = self.obj.take(np.sort(np.concatenate(indices))) | ||
if dropna: | ||
return filtered | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could filtered have a non-unique index here? (and if dropna is False will fail)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. That was careless on my part.
I fixed the problem @jreback noticed, and I added tests with |
you can also do this (which is in effect the same thing)
not sure why you have the |
indexers = [self.obj.index.get_indexer(group.index) \ | ||
if true_and_notnull(group) else [] \ | ||
for _ , group in self] | ||
indices = [self.indices[name] if true_and_notnull(group) else [] \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI you don't need the \ in a list comprehension
(also pep8 prefers parens over \ almost always) :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad to know that. Thanks!
I'm in favor of less indirection and leaving as-is - groupby code can be complex enough as it is. If it continued to grow in size, then maybe my inclination would change. |
True, I would slightly worry that one will be updated and not the other. But meh 'til then. :) Also, good spot to use indices rather than label. Could you add in a test which has integer index (which overlap with the expected filter)? |
@jreback This is why I think
To reiterative, this problem arises in the final step where I want to do |
ok the theory is that the mask is generally an align able itself (eg a frame) it's fine - but self.obj was only ever a series in your tests is there a case where it's a frame? (need a test if there is) |
I believe
@hayd , I agree there's some risk of lurking label/location confusion in this change. I added this test to target that possibility. I may be lacking imagination. Any other suggestions?
If you are good with this, it's ready to merge. |
Maybe try various actual Index types or ndarrays? Have you tested non-Int64Index indexes? |
else: | ||
|
||
# in theory you could do .all() on the boolean result ? | ||
raise TypeError("the filter must return a boolean result") | ||
|
||
if len(indexers) == 0: | ||
filtered = self.obj.take([]) # because np.concatenate would fail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most of this section looks like it could be refactored out to a function in the Groupby
object and just called (from NDFrameGroupby
and the Groupby
to avoid the code dup? (for the transform/filter) cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that should do it.
Unexpectedly, non-Int64Indexes do not behave well under this change. I'm looking into it. Is there a deadline on 0.13? |
as soon as you can.... @wesm is cutting the RC tomorrow, but this can go in even after that |
K, will push on this and str_match/str_extract. |
Series are fine; DataFrames are in trouble. As far as I can tell, If we were wiling to sort non-monotonic frames (and probably series too, for consistency) there would be an easy way out. But doing this right might really hard. Any help understanding what is happening with slicing would be appreciated, but in the meantime I will keep at it. |
I have discovered a more serious problem with groupby, independent of non-unique indexes, transform, or filter. This happens on the current master:
Full Traceback is long so I leave to you to reproduce and read. If, instead of Help? I'm having a hard time believing this has never come up before. I hope I am just confused. |
We should see if this occurs in 0.12. If it doesn't, then it's the new Float64Index. |
No problem on 0.12.0, so you must be right. I don't think I have time to dig into Float64Index. I'lll be standing by to finish this though. |
Yeah, other Jeff is probably the one who needs to deal with that. If you have time to finish up the str stuff, that'd be great! |
@danielballan ok....#5393 fixes...pretty straightforward....will merge in a minute |
@danielballan rebase and you should be good to go! |
You are a champion. I'll be back on this late tonight. |
Looks like that was the whole issue. Latest push includes tests for transform and filter on Series and DataFrames with dropna True and False for non-unique integer, float, datetime, and string indexes. Anything else we should check? (Aside: I did move the duplicated code into GroupBy. Good call.) |
Wow, that is a lot of tests, good effort! |
def test_filter_and_transform_with_non_unique_float_index(self): | ||
# GH4620 | ||
df = DataFrame({'pid' : [1,1,1,2,2,3,3,3], | ||
'tag' : [23,45,62,24,45,34,25,62]}, index=[0] * 8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe adjust some of these to have two unique values?
Took @jtratner 's suggestion above: indexes in tests contain both repeated and (two) unique values. They are also nonmonotonic. |
@danielballan we'll have to call you the test master after this! gr8! |
Stand by, just two more for good measure...multiple nonunique groups... |
OK, thanks as ever for the prompt and patient support, team. I feel that this is ready, but I'm still willing to hear your suggestions. |
+1 |
BUG/TST: transform and filter on non-unique index, closes #4620
closes #4620
This is a relatively minor change. Previously,
filter
onSeriesGroupBy
andDataFrameGroupby
andtransform
on onlySeriesGroupBy
referred to the index of each group's object.This cannot work with repeated indexes, because some repeated
indexes occur between more than one group. Instead, use
{Series, DataFrame}GroupBy.indices
array, an array oflocations, not labels.
This was marked for 0.14, but I would really like to start using it.