Skip to content

BUG/TST: transform and filter on non-unique index, closes #4620 #5375

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 1, 2013

Conversation

danielballan
Copy link
Contributor

closes #4620

This is a relatively minor change. Previously, filter on
SeriesGroupBy and DataFrameGroupby and transform on only
SeriesGroupBy referred to the index of each group's object.

This cannot work with repeated indexes, because some repeated
indexes occur between more than one group. Instead, use
{Series, DataFrame}GroupBy.indices array, an array of
locations, not labels.

This was marked for 0.14, but I would really like to start using it.

filtered = self.obj.take([]) # because np.concatenate would fail
else:
filtered = self.obj.take(np.sort(np.concatenate(indexers)))
filtered = self.obj.take(np.sort(np.concatenate(indices)))
if dropna:
return filtered
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could filtered have a non-unique index here? (and if dropna is False will fail)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. That was careless on my part.

@danielballan
Copy link
Contributor Author

I fixed the problem @jreback noticed, and I added tests with dropna=True for SeriesGroupBy and DataFrameGroupBy, rebased into my commit.

@jreback
Copy link
Contributor

jreback commented Oct 29, 2013

you can also do this (which is in effect the same thing)

x = np.empty(len(self.obj.index),dtype=bool)
x.fill(False)
x[indices.astype(int)] = True

self.obj.where(x)
0   NaN
0     1
0     1
0   NaN
0     2
0   NaN
0   NaN
0     3
Name: pid, dtype: float64

not sure why you have the np.tile there; the passed in condition will broadcast if needed (and I don't see any example where self.obj is not a Series in any event), is that right?

indexers = [self.obj.index.get_indexer(group.index) \
if true_and_notnull(group) else [] \
for _ , group in self]
indices = [self.indices[name] if true_and_notnull(group) else [] \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI you don't need the \ in a list comprehension

(also pep8 prefers parens over \ almost always) :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad to know that. Thanks!

@jtratner
Copy link
Contributor

hayd: I wonder if this chunk (from len(indices) == 0) should be a function/method since it's exactly the same as above... I'm not sure where good place for it would be though.
danielballan: Agreed. Between "don't duplicate" and "don't obfuscate," I think "don't obfuscate" wins this round. There was a similar situation with _define_paths, but in this case the duplicated code is shared by two classes, so it would have to sit "far away" in the module. I'm open to be swayed though....

I'm in favor of less indirection and leaving as-is - groupby code can be complex enough as it is. If it continued to grow in size, then maybe my inclination would change.

@hayd
Copy link
Contributor

hayd commented Oct 29, 2013

True, I would slightly worry that one will be updated and not the other. But meh 'til then. :)

Also, good spot to use indices rather than label. Could you add in a test which has integer index (which overlap with the expected filter)?

@danielballan
Copy link
Contributor Author

@jreback This is why I think np.tile in necessary in preparation for DataFrame.where. Can you reproduce and/or explain? Is this a separate issue?

In [5]: df = DataFrame(np.zeros((2,2)))

In [6]: df
Out[6]: 
   0  1
0  0  0
1  0  0

In [7]: df.shape
Out[7]: (2, 2)

In [8]: mask = np.array([True, False])

In [9]: mask.shape
Out[9]: (2,)

In [10]: df.where(mask)
ValueError: Array conditional must be same shape as self

In [11]: df.where(mask.reshape(-1, 1)) # shape is now (2, 1)
ValueError: Array conditional must be same shape as self

In [12]: df.where(mask.reshape(-1, 1)) # shape is now (1, 2)
ValueError: Array conditional must be same shape as self

To reiterative, this problem arises in the final step where I want to do self.obj.where(mask) and self.obj may be a DataFrame that we are filtering.

@jreback
Copy link
Contributor

jreback commented Oct 30, 2013

ok the theory is that the mask is generally an align able itself (eg a frame)
so if u provide an array then it needs to be already broadcasted
so I see why u r doing what u r doing

it's fine - but self.obj was only ever a series in your tests

is there a case where it's a frame? (need a test if there is)

@danielballan
Copy link
Contributor Author

I believe self.obj is a frame in this case, which is in there.

    actual = grouped_df.filter(lambda x: len(x) > 1, dropna=False)
    expected = df.copy()
    expected.iloc[[0, 3, 5, 6]] = np.nan
    assert_frame_equal(actual, expected)

@hayd , I agree there's some risk of lurking label/location confusion in this change. I added this test to target that possibility. I may be lacking imagination. Any other suggestions?

def test_index_label_overlaps_location(self):
    # checking we don't have any label/location confusion in the
    # the wake of GH5375
    df = DataFrame(list('ABCDE'), index=[2, 0, 2, 1, 1])
    g = df.groupby(list('ababb'))
    actual = g.filter(lambda x: len(x) > 2)
    expected = df.iloc[[1, 3, 4]]
    assert_frame_equal(actual, expected)

    ser = df[0]
    g = ser.groupby(list('ababb'))
    actual = g.filter(lambda x: len(x) > 2)
    expected = ser.take([1, 3, 4])
    assert_series_equal(actual, expected)

If you are good with this, it's ready to merge.

@jtratner
Copy link
Contributor

Maybe try various actual Index types or ndarrays? Have you tested non-Int64Index indexes?

else:

# in theory you could do .all() on the boolean result ?
raise TypeError("the filter must return a boolean result")

if len(indexers) == 0:
filtered = self.obj.take([]) # because np.concatenate would fail
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of this section looks like it could be refactored out to a function in the Groupby object and just called (from NDFrameGroupby and the Groupby to avoid the code dup? (for the transform/filter) cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that should do it.

@danielballan
Copy link
Contributor Author

Unexpectedly, non-Int64Indexes do not behave well under this change. I'm looking into it. Is there a deadline on 0.13?

@jreback
Copy link
Contributor

jreback commented Oct 30, 2013

as soon as you can.... @wesm is cutting the RC tomorrow, but this can go in even after that

@danielballan
Copy link
Contributor Author

K, will push on this and str_match/str_extract.

@danielballan
Copy link
Contributor Author

Series are fine; DataFrames are in trouble. As far as I can tell, SeriesSplitter has no problem with non-unique non-monotonic slices, but FrameSplitter can't handle that. I'm really down in the weeds with unfamiliar parts of the code.

If we were wiling to sort non-monotonic frames (and probably series too, for consistency) there would be an easy way out. But doing this right might really hard. Any help understanding what is happening with slicing would be appreciated, but in the meantime I will keep at it.

@danielballan
Copy link
Contributor Author

I have discovered a more serious problem with groupby, independent of non-unique indexes, transform, or filter. This happens on the current master:

In [1]: for name, group in DataFrame([1,2], index=[2.,1.]).groupby(list('ab')):
     ...    pass
KeyError: 0

In [2]: pd.__version__
Out[2]: '0.12.0-1000-gea97682'

Full Traceback is long so I leave to you to reproduce and read.

If, instead of index=[2., 1.], I use an monotonically increasing index [2., 3.] or an integer index [2, 1], all works as expected.

Help? I'm having a hard time believing this has never come up before. I hope I am just confused.

@jtratner
Copy link
Contributor

We should see if this occurs in 0.12. If it doesn't, then it's the new Float64Index.

@danielballan
Copy link
Contributor Author

No problem on 0.12.0, so you must be right. I don't think I have time to dig into Float64Index. I'lll be standing by to finish this though.

@jtratner
Copy link
Contributor

Yeah, other Jeff is probably the one who needs to deal with that. If you have time to finish up the str stuff, that'd be great!

@jreback
Copy link
Contributor

jreback commented Oct 31, 2013

@danielballan ok....#5393 fixes...pretty straightforward....will merge in a minute

@jreback
Copy link
Contributor

jreback commented Oct 31, 2013

@danielballan rebase and you should be good to go!

@danielballan
Copy link
Contributor Author

You are a champion. I'll be back on this late tonight.

@danielballan
Copy link
Contributor Author

Looks like that was the whole issue.

Latest push includes tests for transform and filter on Series and DataFrames with dropna True and False for non-unique integer, float, datetime, and string indexes. Anything else we should check?

(Aside: I did move the duplicated code into GroupBy. Good call.)

@hayd
Copy link
Contributor

hayd commented Oct 31, 2013

Wow, that is a lot of tests, good effort!

def test_filter_and_transform_with_non_unique_float_index(self):
# GH4620
df = DataFrame({'pid' : [1,1,1,2,2,3,3,3],
'tag' : [23,45,62,24,45,34,25,62]}, index=[0] * 8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe adjust some of these to have two unique values?

@danielballan
Copy link
Contributor Author

Took @jtratner 's suggestion above: indexes in tests contain both repeated and (two) unique values. They are also nonmonotonic.

@jreback
Copy link
Contributor

jreback commented Oct 31, 2013

@danielballan we'll have to call you the test master after this! gr8!

@danielballan
Copy link
Contributor Author

Stand by, just two more for good measure...multiple nonunique groups...

@danielballan
Copy link
Contributor Author

OK, thanks as ever for the prompt and patient support, team. I feel that this is ready, but I'm still willing to hear your suggestions.

@jreback
Copy link
Contributor

jreback commented Oct 31, 2013

I am fine with this, @jtratner, @hayd ?

@hayd
Copy link
Contributor

hayd commented Nov 1, 2013

+1

jreback added a commit that referenced this pull request Nov 1, 2013
BUG/TST: transform and filter on non-unique index, closes #4620
@jreback jreback merged commit 7df68f6 into pandas-dev:master Nov 1, 2013
@danielballan danielballan deleted the filter-nonunique branch November 1, 2013 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Groupby filter doesn't work with repeated index
4 participants