Skip to content

BUG: Fix Series nunique groupby with object dtype #11079

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 14, 2015
Merged

BUG: Fix Series nunique groupby with object dtype #11079

merged 1 commit into from
Sep 14, 2015

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Sep 12, 2015

closes #11077

@cpcloud cpcloud self-assigned this Sep 12, 2015
@cpcloud cpcloud added this to the 0.17.0 milestone Sep 12, 2015
@cpcloud cpcloud added Bug Blocker Blocking issue or pull request for an upcoming release labels Sep 12, 2015
@jreback
Copy link
Contributor

jreback commented Sep 12, 2015

cc @behzadnouri

or you want to use do a factorize?

@behzadnouri
Copy link
Contributor

diff --git a/pandas/core/groupby.py b/pandas/core/groupby.py
index f34fd6e..39be706 100644
--- a/pandas/core/groupby.py
+++ b/pandas/core/groupby.py
@@ -2565,13 +2565,21 @@ class SeriesGroupBy(GroupBy):
         ids, _, _ = self.grouper.group_info
         val = self.obj.get_values()

-        sorter = np.lexsort((val, ids))
+        try:
+            sorter = np.lexsort((val, ids))
+        except TypeError:
+            val, _ = algos.factorize(val, sort=False)
+            sorter = np.lexsort((val, ids))
+            isnull = lambda a: a == -1
+        else:
+            isnull = com.isnull
+
         ids, val = ids[sorter], val[sorter]

         # group boundries are where group ids change
         # unique observations are where sorted values change
-        idx = com._ensure_int64(np.r_[0, 1 + np.nonzero(ids[1:] != ids[:-1])[0]])
-        inc = com._ensure_int64(np.r_[1, val[1:] != val[:-1]])
+        idx = np.r_[0, 1 + np.nonzero(ids[1:] != ids[:-1])[0]]
+        inc = np.r_[1, val[1:] != val[:-1]]

         # 1st item of each group is a new unique observation
         mask = isnull(val)

@cpcloud
Copy link
Member Author

cpcloud commented Sep 12, 2015

@behzadnouri If you submit that as a PR to my fork, I'll happily merge it in

@behzadnouri
Copy link
Contributor

the error says TypeError: merge sort not available for item 0 even though numpy does have merge sort for objects (both np.argsort and np.sort). so it may be a bug on numpy side. so, it is better to implement try catch in case future numpy releases do implement lexsort for this case

once, values are factorized, original values are not needed and it is more efficient to work with integer factors, as long as isnull function is adjusted accordingly.

@cpcloud
Copy link
Member Author

cpcloud commented Sep 13, 2015

@jreback getting a somewhat strange travis failure here that seems unrelated to my change: https://travis-ci.org/cpcloud/pandas/jobs/80036566

@cpcloud
Copy link
Member Author

cpcloud commented Sep 13, 2015

any ideas what that might be?

@jreback
Copy link
Contributor

jreback commented Sep 13, 2015

hmm someone else reported that as well
try restarting it and see

try:
sorter = np.lexsort((val, ids))
except TypeError:
val, _ = algos.factorize(val, sort=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this means object dtype? Maybe add an assert?

@cpcloud
Copy link
Member Author

cpcloud commented Sep 13, 2015

this is passing on my travis ci fork

try:
sorter = np.lexsort((val, ids))
except TypeError:
assert val.dtype == object, \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment here that this catches object dtypes

@jreback
Copy link
Contributor

jreback commented Sep 13, 2015

couple comments, squash and merge ok

@@ -1135,3 +1135,4 @@ Bug Fixes
- Bug in ``DatetimeIndex`` cannot infer negative freq (:issue:`11018`)
- Remove use of some deprecated numpy comparison operations, mainly in tests. (:issue:`10569`)
- Bug in ``Index`` dtype may not applied properly (:issue:`11017`)
- Bug in Series groupby when calling nunique on an object dtype (:issue:`11077`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this to where the other nunique comment is (just put in in the list)

@cpcloud cpcloud changed the title Fix Series nunique groupby with object dtype BUG: Fix Series nunique groupby with object dtype Sep 13, 2015
ids, val = ids[sorter], val[sorter]

# group boundries are where group ids change
# unique observations are where sorted values change
idx = com._ensure_int64(np.r_[0, 1 + np.nonzero(ids[1:] != ids[:-1])[0]])
inc = com._ensure_int64(np.r_[1, val[1:] != val[:-1]])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert these 2 lines, these don't pass on windows ATM. this can possibly be changed, but on another PR/issue

@jreback
Copy link
Contributor

jreback commented Sep 14, 2015

@cpcloud can you update (e.g. revert those 2 lines). ok to merge on green after that.

cpcloud added a commit that referenced this pull request Sep 14, 2015
…h-object

BUG: Fix Series nunique groupby with object dtype
@cpcloud cpcloud merged commit 5ee3a4f into pandas-dev:master Sep 14, 2015
@cpcloud cpcloud deleted the fix-series-nunique-groupby-with-object branch September 14, 2015 22:09
@jreback
Copy link
Contributor

jreback commented Sep 14, 2015

ty sir!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocker Blocking issue or pull request for an upcoming release Bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Broken nunique on Series group by
4 participants