Skip to content

Broken nunique on Series group by #11077

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cpcloud opened this issue Sep 12, 2015 · 14 comments · Fixed by #11079
Closed

Broken nunique on Series group by #11077

cpcloud opened this issue Sep 12, 2015 · 14 comments · Fixed by #11079
Assignees
Labels
Blocker Blocking issue or pull request for an upcoming release Bug
Milestone

Comments

@cpcloud
Copy link
Member

cpcloud commented Sep 12, 2015

The following code works in 0.16.2 and not in latest master:

data = pd.DataFrame(
    [[100, 1, 'Alice'],
     [200, 2, 'Bob'],
     [300, 3, 'Charlie'],
     [-400, 4, 'Dan'],
     [500, 5, 'Edith']],
    columns=['amount', 'id', 'name']
)

expected = data.groupby(['id', 'amount'])['name'].nunique()

Going to bisect this today unless someone beats me to it.

@cpcloud cpcloud added Bug Blocker Blocking issue or pull request for an upcoming release labels Sep 12, 2015
@cpcloud cpcloud self-assigned this Sep 12, 2015
@cpcloud cpcloud added this to the 0.17.0 milestone Sep 12, 2015
@jreback
Copy link
Contributor

jreback commented Sep 12, 2015

#10894

@cpcloud
Copy link
Member Author

cpcloud commented Sep 12, 2015

Can we just revert that change and keep the test?

@cpcloud
Copy link
Member Author

cpcloud commented Sep 12, 2015

The only other solution I see is to coerce val to str if it's an object dtype

@jreback
Copy link
Contributor

jreback commented Sep 12, 2015

if its not int64 factorize it

ipdb> p np.lexsort((pd.factorize(val)[0],ids))
array([0, 1, 2, 3, 4])

@cpcloud
Copy link
Member Author

cpcloud commented Sep 12, 2015

ok

@jreback
Copy link
Contributor

jreback commented Sep 12, 2015

or rather it has a TypeError, though this shouldn't have gotten this far if that was the case....

@cpcloud
Copy link
Member Author

cpcloud commented Sep 12, 2015

Using factorize won't give the same results and is therefore incorrect:

In [8]: np.lexsort(([1, 2, 3], list('cba')))
Out[8]: array([2, 1, 0])

In [9]: np.lexsort(([1, 2, 3], pd.factorize(list('cba'))[0]))
Out[9]: array([0, 1, 2])

@jreback
Copy link
Contributor

jreback commented Sep 12, 2015

might be faster to just .astype(str) an object array

use is_object_dtype

@cpcloud
Copy link
Member Author

cpcloud commented Sep 12, 2015

yep

@behzadnouri
Copy link
Contributor

Using factorize won't give the same results and is therefore incorrect:

factorize does work there.

Can we just revert that change and keep the test?

if you do not understand the algo, it would be better to ping whoever wrote the code rather than suggesting to revert it

@cpcloud
Copy link
Member Author

cpcloud commented Sep 12, 2015

@behzadnouri This is unrelated to my understanding of the algorithm. My suggestion to revert it was based on the fact that it broke existing code.

@behzadnouri
Copy link
Contributor

My suggestion to revert it was based on the fact that it broke existing code.

existing code is not forced to update to master

@jreback
Copy link
Contributor

jreback commented Sep 12, 2015

@behzadnouri no this DID break things. We just weren't fully testing it.

@cpcloud
Copy link
Member Author

cpcloud commented Sep 12, 2015

existing code is not forced to update to master

That's certainly true. However this broke existing pandas code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocker Blocking issue or pull request for an upcoming release Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants