get_group sometimes throws an exception when using an index of tuples with different lengths #8121

dwiel · 2014-08-27T12:59:55Z

Here is a simple test case that exposes the problem:

    df = pd.DataFrame(pd.Series([(1,), (1,2), (1,), (1, 2)]), columns = ['ids'])
    gb = df.groupby('ids')
    for i in gb.size().index :
        print i
        gb.get_group(i)

The issues is that in _get_index of GroupBy, these lines assume that if there is a tuple in the index, then the index is a multi-index, which in the above test case isn't true. Maybe there is some other way to detect that values are from a multi-index, or should pandas explicitly not support tuples in this situation (in an index of a groupby)

    pandas/core/groupby.py:

    sample = next(iter(self.indices))
    if isinstance(sample, tuple):
        if not isinstance(name, tuple):
            raise ValueError("must supply a tuple to get_group with multiple grouping keys")
        if not len(name) == len(sample):
            raise ValueError("must supply a a same-length tuple to get_group with multiple grouping keys")

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2014-08-27T14:19:37Z

I'll take a look. While you're probably right that this shouldn't thrown an exception, storing containers in DataFrames is usually frowned upon. Something like

In [14]: gr = df.groupby(pd.factorize(df.ids)[0])

In [15]: for i in gr.size().index:
   ....:     print(i)
   ....:     gr.get_group(i)
   ....:     
0
1

is usually better (faster and I think clearer). pd.factorize also returns the labels if you need those.

TomAugspurger · 2014-08-27T14:41:16Z

@dwiel This is what you're expecting, right?

In [1]: good = pd.DataFrame([[1, 1, 1, 1], ['a', 'b', 'a', 'b']]).T

In [2]: bad = pd.DataFrame(pd.Series([(1,), (1,2), (1,), (1, 2)]), columns = ['
ids'])

In [3]: gg = good.groupby([0, 1])

In [4]: gb = bad.groupby('ids')

In [5]: good
Out[5]: 
   0  1
0  1  a
1  1  b
2  1  a
3  1  b

In [6]: bad
Out[6]: 
      ids
0    (1,)
1  (1, 2)
2    (1,)
3  (1, 2)

In [9]: def run(gr):
    for i in gr.size().index:
        print(i)
        print(gr.get_group(i))
   ...:         

In [10]: run(gg)
(1, 'a')
   0  1
0  1  a
2  1  a
(1, 'b')
   0  1
1  1  b
3  1  b

In [11]: run(gb)
(1,)
    ids
0  (1,)
2  (1,)
(1, 2)
      ids
1  (1, 2)
3  (1, 2)

dwiel · 2014-08-27T15:32:47Z

The factorize code does appear to do what I want.

To your second comment that does look like how I would expect it to work.

TomAugspurger · 2014-08-28T01:59:35Z

Should be fixed now. Like I said, you're probably better off with factorizeing and then grouping in this case.

Thanks for the report!

dwiel · 2014-08-28T02:17:47Z

Thanks!

On Wed, Aug 27, 2014 at 9:59 PM, Tom Augspurger [email protected]
wrote:

Should be fixed now. Like I said, you're probably better off with
factorizeing and then grouping in this case.

Thanks for the report!

—
Reply to this email directly or view it on GitHub
#8121 (comment).

TomAugspurger mentioned this issue Aug 27, 2014

BUG: fix groupby with tuple bug #8123

Merged

TomAugspurger closed this as completed in #8123 Aug 28, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_group sometimes throws an exception when using an index of tuples with different lengths #8121

get_group sometimes throws an exception when using an index of tuples with different lengths #8121

dwiel commented Aug 27, 2014

TomAugspurger commented Aug 27, 2014

TomAugspurger commented Aug 27, 2014

dwiel commented Aug 27, 2014

TomAugspurger commented Aug 28, 2014

dwiel commented Aug 28, 2014

get_group sometimes throws an exception when using an index of tuples with different lengths #8121

get_group sometimes throws an exception when using an index of tuples with different lengths #8121

Comments

dwiel commented Aug 27, 2014

TomAugspurger commented Aug 27, 2014

TomAugspurger commented Aug 27, 2014

dwiel commented Aug 27, 2014

TomAugspurger commented Aug 28, 2014

dwiel commented Aug 28, 2014