BUG/TST: transform and filter on non-unique index, closes #4620 #5375

danielballan · 2013-10-29T19:08:50Z

This is a relatively minor change. Previously, filter on
SeriesGroupBy and DataFrameGroupby and transform on only
SeriesGroupBy referred to the index of each group's object.

This cannot work with repeated indexes, because some repeated
indexes occur between more than one group. Instead, use
{Series, DataFrame}GroupBy.indices array, an array of
locations, not labels.

This was marked for 0.14, but I would really like to start using it.

jreback · 2013-10-29T19:15:51Z

pandas/core/groupby.py

            filtered = self.obj.take([]) # because np.concatenate would fail
        else:
-            filtered = self.obj.take(np.sort(np.concatenate(indexers)))
+            filtered = self.obj.take(np.sort(np.concatenate(indices)))
        if dropna:
            return filtered
        else:


could filtered have a non-unique index here? (and if dropna is False will fail)

Good catch. That was careless on my part.

danielballan · 2013-10-29T20:33:10Z

I fixed the problem @jreback noticed, and I added tests with dropna=True for SeriesGroupBy and DataFrameGroupBy, rebased into my commit.

jreback · 2013-10-29T21:17:11Z

you can also do this (which is in effect the same thing)

x = np.empty(len(self.obj.index),dtype=bool)
x.fill(False)
x[indices.astype(int)] = True

self.obj.where(x)
0   NaN
0     1
0     1
0   NaN
0     2
0   NaN
0   NaN
0     3
Name: pid, dtype: float64

not sure why you have the np.tile there; the passed in condition will broadcast if needed (and I don't see any example where self.obj is not a Series in any event), is that right?

hayd · 2013-10-29T21:34:20Z

pandas/core/groupby.py

-            indexers = [self.obj.index.get_indexer(group.index) \
-                        if true_and_notnull(group) else [] \
-                        for _ , group in self]
+            indices = [self.indices[name] if true_and_notnull(group) else [] \


FYI you don't need the \ in a list comprehension

(also pep8 prefers parens over \ almost always) :)

I'm glad to know that. Thanks!

jtratner · 2013-10-29T22:56:38Z

hayd: I wonder if this chunk (from len(indices) == 0) should be a function/method since it's exactly the same as above... I'm not sure where good place for it would be though.
danielballan: Agreed. Between "don't duplicate" and "don't obfuscate," I think "don't obfuscate" wins this round. There was a similar situation with _define_paths, but in this case the duplicated code is shared by two classes, so it would have to sit "far away" in the module. I'm open to be swayed though....

I'm in favor of less indirection and leaving as-is - groupby code can be complex enough as it is. If it continued to grow in size, then maybe my inclination would change.

hayd · 2013-10-29T23:23:48Z

True, I would slightly worry that one will be updated and not the other. But meh 'til then. :)

Also, good spot to use indices rather than label. Could you add in a test which has integer index (which overlap with the expected filter)?

danielballan · 2013-10-30T00:27:30Z

@jreback This is why I think np.tile in necessary in preparation for DataFrame.where. Can you reproduce and/or explain? Is this a separate issue?

In [5]: df = DataFrame(np.zeros((2,2)))

In [6]: df
Out[6]: 
   0  1
0  0  0
1  0  0

In [7]: df.shape
Out[7]: (2, 2)

In [8]: mask = np.array([True, False])

In [9]: mask.shape
Out[9]: (2,)

In [10]: df.where(mask)
ValueError: Array conditional must be same shape as self

In [11]: df.where(mask.reshape(-1, 1)) # shape is now (2, 1)
ValueError: Array conditional must be same shape as self

In [12]: df.where(mask.reshape(-1, 1)) # shape is now (1, 2)
ValueError: Array conditional must be same shape as self

To reiterative, this problem arises in the final step where I want to do self.obj.where(mask) and self.obj may be a DataFrame that we are filtering.

jreback · 2013-10-30T00:36:32Z

ok the theory is that the mask is generally an align able itself (eg a frame)
so if u provide an array then it needs to be already broadcasted
so I see why u r doing what u r doing

it's fine - but self.obj was only ever a series in your tests

is there a case where it's a frame? (need a test if there is)

danielballan · 2013-10-30T01:12:15Z

I believe self.obj is a frame in this case, which is in there.

    actual = grouped_df.filter(lambda x: len(x) > 1, dropna=False)
    expected = df.copy()
    expected.iloc[[0, 3, 5, 6]] = np.nan
    assert_frame_equal(actual, expected)

@hayd , I agree there's some risk of lurking label/location confusion in this change. I added this test to target that possibility. I may be lacking imagination. Any other suggestions?

def test_index_label_overlaps_location(self):
    # checking we don't have any label/location confusion in the
    # the wake of GH5375
    df = DataFrame(list('ABCDE'), index=[2, 0, 2, 1, 1])
    g = df.groupby(list('ababb'))
    actual = g.filter(lambda x: len(x) > 2)
    expected = df.iloc[[1, 3, 4]]
    assert_frame_equal(actual, expected)

    ser = df[0]
    g = ser.groupby(list('ababb'))
    actual = g.filter(lambda x: len(x) > 2)
    expected = ser.take([1, 3, 4])
    assert_series_equal(actual, expected)

If you are good with this, it's ready to merge.

jtratner · 2013-10-30T03:25:16Z

Maybe try various actual Index types or ndarrays? Have you tested non-Int64Index indexes?

jreback · 2013-10-30T20:15:36Z

pandas/core/groupby.py

                else:

                    # in theory you could do .all() on the boolean result ?
                    raise TypeError("the filter must return a boolean result")

-        if len(indexers) == 0:
-            filtered = self.obj.take([]) # because np.concatenate would fail


most of this section looks like it could be refactored out to a function in the Groupby object and just called (from NDFrameGroupby and the Groupby to avoid the code dup? (for the transform/filter) cases

Yeah, that should do it.

danielballan · 2013-10-30T20:24:59Z

Unexpectedly, non-Int64Indexes do not behave well under this change. I'm looking into it. Is there a deadline on 0.13?

jreback · 2013-10-30T20:26:44Z

as soon as you can.... @wesm is cutting the RC tomorrow, but this can go in even after that

danielballan · 2013-10-30T20:30:39Z

K, will push on this and str_match/str_extract.

danielballan · 2013-10-30T21:58:10Z

Series are fine; DataFrames are in trouble. As far as I can tell, SeriesSplitter has no problem with non-unique non-monotonic slices, but FrameSplitter can't handle that. I'm really down in the weeds with unfamiliar parts of the code.

If we were wiling to sort non-monotonic frames (and probably series too, for consistency) there would be an easy way out. But doing this right might really hard. Any help understanding what is happening with slicing would be appreciated, but in the meantime I will keep at it.

danielballan · 2013-10-30T23:45:26Z

I have discovered a more serious problem with groupby, independent of non-unique indexes, transform, or filter. This happens on the current master:

In [1]: for name, group in DataFrame([1,2], index=[2.,1.]).groupby(list('ab')):
     ...    pass
KeyError: 0

In [2]: pd.__version__
Out[2]: '0.12.0-1000-gea97682'

Full Traceback is long so I leave to you to reproduce and read.

If, instead of index=[2., 1.], I use an monotonically increasing index [2., 3.] or an integer index [2, 1], all works as expected.

Help? I'm having a hard time believing this has never come up before. I hope I am just confused.

jtratner · 2013-10-30T23:46:40Z

We should see if this occurs in 0.12. If it doesn't, then it's the new Float64Index.

danielballan · 2013-10-30T23:57:10Z

No problem on 0.12.0, so you must be right. I don't think I have time to dig into Float64Index. I'lll be standing by to finish this though.

jtratner · 2013-10-31T00:04:36Z

Yeah, other Jeff is probably the one who needs to deal with that. If you have time to finish up the str stuff, that'd be great!

jreback · 2013-10-31T00:21:23Z

@danielballan ok....#5393 fixes...pretty straightforward....will merge in a minute

jreback · 2013-10-31T00:30:42Z

@danielballan rebase and you should be good to go!

danielballan · 2013-10-31T00:37:08Z

You are a champion. I'll be back on this late tonight.

danielballan · 2013-10-31T04:07:33Z

Looks like that was the whole issue.

Latest push includes tests for transform and filter on Series and DataFrames with dropna True and False for non-unique integer, float, datetime, and string indexes. Anything else we should check?

(Aside: I did move the duplicated code into GroupBy. Good call.)

hayd · 2013-10-31T04:43:31Z

Wow, that is a lot of tests, good effort!

jtratner · 2013-10-31T05:31:38Z

pandas/tests/test_groupby.py

+    def test_filter_and_transform_with_non_unique_float_index(self):
+        # GH4620
+        df = DataFrame({'pid' : [1,1,1,2,2,3,3,3],
+                       'tag' : [23,45,62,24,45,34,25,62]}, index=[0] * 8)


maybe adjust some of these to have two unique values?

danielballan · 2013-10-31T12:19:35Z

Took @jtratner 's suggestion above: indexes in tests contain both repeated and (two) unique values. They are also nonmonotonic.

jreback · 2013-10-31T12:21:09Z

@danielballan we'll have to call you the test master after this! gr8!

danielballan · 2013-10-31T12:22:37Z

Stand by, just two more for good measure...multiple nonunique groups...

…4620

danielballan · 2013-10-31T12:25:37Z

OK, thanks as ever for the prompt and patient support, team. I feel that this is ready, but I'm still willing to hear your suggestions.

jreback · 2013-10-31T13:23:18Z

I am fine with this, @jtratner, @hayd ?

hayd · 2013-11-01T18:19:26Z

+1

BUG/TST: transform and filter on non-unique index, closes #4620

danielballan mentioned this pull request Oct 29, 2013

Groupby filter doesn't work with repeated index #4620

Closed

jreback reviewed Oct 29, 2013
View reviewed changes

hayd reviewed Oct 29, 2013
View reviewed changes

jreback reviewed Oct 30, 2013
View reviewed changes

jreback mentioned this pull request Oct 31, 2013

BUG: groupby with a Float like index misbehaving when the index is non-monotonic (related GH5375) #5393

Merged

jtratner reviewed Oct 31, 2013
View reviewed changes

BUG/TST: transform and filter on non-unique index, closes pandas-dev#…

1a47ee4

…4620

jreback added a commit that referenced this pull request Nov 1, 2013

Merge pull request #5375 from danielballan/filter-nonunique

7df68f6

BUG/TST: transform and filter on non-unique index, closes #4620

jreback merged commit 7df68f6 into pandas-dev:master Nov 1, 2013

danielballan deleted the filter-nonunique branch November 1, 2013 18:42

jreback mentioned this pull request Nov 1, 2013

BLD: plot failures in master #5409

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG/TST: transform and filter on non-unique index, closes #4620 #5375

BUG/TST: transform and filter on non-unique index, closes #4620 #5375

danielballan commented Oct 29, 2013

jreback Oct 29, 2013

danielballan Oct 29, 2013

danielballan commented Oct 29, 2013

jreback commented Oct 29, 2013

hayd Oct 29, 2013

danielballan Oct 29, 2013

jtratner commented Oct 29, 2013

hayd commented Oct 29, 2013

danielballan commented Oct 30, 2013

jreback commented Oct 30, 2013

danielballan commented Oct 30, 2013

jtratner commented Oct 30, 2013

jreback Oct 30, 2013

danielballan Oct 30, 2013

danielballan commented Oct 30, 2013

jreback commented Oct 30, 2013

danielballan commented Oct 30, 2013

danielballan commented Oct 30, 2013

danielballan commented Oct 30, 2013

jtratner commented Oct 30, 2013

danielballan commented Oct 30, 2013

jtratner commented Oct 31, 2013

jreback commented Oct 31, 2013

jreback commented Oct 31, 2013

danielballan commented Oct 31, 2013

danielballan commented Oct 31, 2013

hayd commented Oct 31, 2013

jtratner Oct 31, 2013

danielballan commented Oct 31, 2013

jreback commented Oct 31, 2013

danielballan commented Oct 31, 2013

danielballan commented Oct 31, 2013

jreback commented Oct 31, 2013

hayd commented Nov 1, 2013

BUG/TST: transform and filter on non-unique index, closes #4620 #5375

BUG/TST: transform and filter on non-unique index, closes #4620 #5375

Conversation

danielballan commented Oct 29, 2013

jreback Oct 29, 2013

Choose a reason for hiding this comment

danielballan Oct 29, 2013

Choose a reason for hiding this comment

danielballan commented Oct 29, 2013

jreback commented Oct 29, 2013

hayd Oct 29, 2013

Choose a reason for hiding this comment

danielballan Oct 29, 2013

Choose a reason for hiding this comment

jtratner commented Oct 29, 2013

hayd commented Oct 29, 2013

danielballan commented Oct 30, 2013

jreback commented Oct 30, 2013

danielballan commented Oct 30, 2013

jtratner commented Oct 30, 2013

jreback Oct 30, 2013

Choose a reason for hiding this comment

danielballan Oct 30, 2013

Choose a reason for hiding this comment

danielballan commented Oct 30, 2013

jreback commented Oct 30, 2013

danielballan commented Oct 30, 2013

danielballan commented Oct 30, 2013

danielballan commented Oct 30, 2013

jtratner commented Oct 30, 2013

danielballan commented Oct 30, 2013

jtratner commented Oct 31, 2013

jreback commented Oct 31, 2013

jreback commented Oct 31, 2013

danielballan commented Oct 31, 2013

danielballan commented Oct 31, 2013

hayd commented Oct 31, 2013

jtratner Oct 31, 2013

Choose a reason for hiding this comment

danielballan commented Oct 31, 2013

jreback commented Oct 31, 2013

danielballan commented Oct 31, 2013

danielballan commented Oct 31, 2013

jreback commented Oct 31, 2013

hayd commented Nov 1, 2013