Broken nunique on Series group by #11077

cpcloud · 2015-09-12T18:04:09Z

The following code works in 0.16.2 and not in latest master:

data = pd.DataFrame(
    [[100, 1, 'Alice'],
     [200, 2, 'Bob'],
     [300, 3, 'Charlie'],
     [-400, 4, 'Dan'],
     [500, 5, 'Edith']],
    columns=['amount', 'id', 'name']
)

expected = data.groupby(['id', 'amount'])['name'].nunique()

Going to bisect this today unless someone beats me to it.

jreback · 2015-09-12T18:08:27Z

#10894

cpcloud · 2015-09-12T18:16:34Z

Can we just revert that change and keep the test?

cpcloud · 2015-09-12T18:17:15Z

The only other solution I see is to coerce val to str if it's an object dtype

jreback · 2015-09-12T18:21:27Z

if its not int64 factorize it

ipdb> p np.lexsort((pd.factorize(val)[0],ids))
array([0, 1, 2, 3, 4])

cpcloud · 2015-09-12T18:22:20Z

ok

jreback · 2015-09-12T18:23:39Z

or rather it has a TypeError, though this shouldn't have gotten this far if that was the case....

cpcloud · 2015-09-12T18:35:27Z

Using factorize won't give the same results and is therefore incorrect:

In [8]: np.lexsort(([1, 2, 3], list('cba')))
Out[8]: array([2, 1, 0])

In [9]: np.lexsort(([1, 2, 3], pd.factorize(list('cba'))[0]))
Out[9]: array([0, 1, 2])

jreback · 2015-09-12T18:37:23Z

might be faster to just .astype(str) an object array

use is_object_dtype

cpcloud · 2015-09-12T18:37:32Z

yep

behzadnouri · 2015-09-12T20:22:11Z

Using factorize won't give the same results and is therefore incorrect:

factorize does work there.

Can we just revert that change and keep the test?

if you do not understand the algo, it would be better to ping whoever wrote the code rather than suggesting to revert it

cpcloud · 2015-09-12T20:37:10Z

@behzadnouri This is unrelated to my understanding of the algorithm. My suggestion to revert it was based on the fact that it broke existing code.

behzadnouri · 2015-09-12T20:42:40Z

My suggestion to revert it was based on the fact that it broke existing code.

existing code is not forced to update to master

jreback · 2015-09-12T20:43:40Z

@behzadnouri no this DID break things. We just weren't fully testing it.

cpcloud · 2015-09-12T20:43:55Z

existing code is not forced to update to master

That's certainly true. However this broke existing pandas code.

cpcloud added Bug Blocker Blocking issue or pull request for an upcoming release labels Sep 12, 2015

cpcloud self-assigned this Sep 12, 2015

cpcloud added this to the 0.17.0 milestone Sep 12, 2015

cpcloud mentioned this issue Sep 12, 2015

BUG: Fix Series nunique groupby with object dtype #11079

Merged

cpcloud closed this as completed in #11079 Sep 14, 2015

jreback mentioned this issue Sep 15, 2015

SeriesGroupBy.count() failing with TypeError #11101

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken nunique on Series group by #11077

Broken nunique on Series group by #11077

cpcloud commented Sep 12, 2015

jreback commented Sep 12, 2015

cpcloud commented Sep 12, 2015

cpcloud commented Sep 12, 2015

jreback commented Sep 12, 2015

cpcloud commented Sep 12, 2015

jreback commented Sep 12, 2015

cpcloud commented Sep 12, 2015

jreback commented Sep 12, 2015

cpcloud commented Sep 12, 2015

behzadnouri commented Sep 12, 2015

cpcloud commented Sep 12, 2015

behzadnouri commented Sep 12, 2015

jreback commented Sep 12, 2015

cpcloud commented Sep 12, 2015

Broken nunique on Series group by #11077

Broken nunique on Series group by #11077

Comments

cpcloud commented Sep 12, 2015

jreback commented Sep 12, 2015

cpcloud commented Sep 12, 2015

cpcloud commented Sep 12, 2015

jreback commented Sep 12, 2015

cpcloud commented Sep 12, 2015

jreback commented Sep 12, 2015

cpcloud commented Sep 12, 2015

jreback commented Sep 12, 2015

cpcloud commented Sep 12, 2015

behzadnouri commented Sep 12, 2015

cpcloud commented Sep 12, 2015

behzadnouri commented Sep 12, 2015

jreback commented Sep 12, 2015

cpcloud commented Sep 12, 2015