BUG: Fix Series nunique groupby with object dtype #11079

cpcloud · 2015-09-12T18:31:35Z

jreback · 2015-09-12T20:25:41Z

or you want to use do a factorize?

behzadnouri · 2015-09-12T21:37:11Z

diff --git a/pandas/core/groupby.py b/pandas/core/groupby.py
index f34fd6e..39be706 100644
--- a/pandas/core/groupby.py
+++ b/pandas/core/groupby.py
@@ -2565,13 +2565,21 @@ class SeriesGroupBy(GroupBy):
         ids, _, _ = self.grouper.group_info
         val = self.obj.get_values()

-        sorter = np.lexsort((val, ids))
+        try:
+            sorter = np.lexsort((val, ids))
+        except TypeError:
+            val, _ = algos.factorize(val, sort=False)
+            sorter = np.lexsort((val, ids))
+            isnull = lambda a: a == -1
+        else:
+            isnull = com.isnull
+
         ids, val = ids[sorter], val[sorter]

         # group boundries are where group ids change
         # unique observations are where sorted values change
-        idx = com._ensure_int64(np.r_[0, 1 + np.nonzero(ids[1:] != ids[:-1])[0]])
-        inc = com._ensure_int64(np.r_[1, val[1:] != val[:-1]])
+        idx = np.r_[0, 1 + np.nonzero(ids[1:] != ids[:-1])[0]]
+        inc = np.r_[1, val[1:] != val[:-1]]

         # 1st item of each group is a new unique observation
         mask = isnull(val)

cpcloud · 2015-09-12T21:41:22Z

@behzadnouri If you submit that as a PR to my fork, I'll happily merge it in

behzadnouri · 2015-09-12T21:44:02Z

the error says TypeError: merge sort not available for item 0 even though numpy does have merge sort for objects (both np.argsort and np.sort). so it may be a bug on numpy side. so, it is better to implement try catch in case future numpy releases do implement lexsort for this case

once, values are factorized, original values are not needed and it is more efficient to work with integer factors, as long as isnull function is adjusted accordingly.

cpcloud · 2015-09-13T00:18:27Z

@jreback getting a somewhat strange travis failure here that seems unrelated to my change: https://travis-ci.org/cpcloud/pandas/jobs/80036566

cpcloud · 2015-09-13T00:18:36Z

any ideas what that might be?

jreback · 2015-09-13T00:24:56Z

hmm someone else reported that as well
try restarting it and see

shoyer · 2015-09-13T08:15:21Z

pandas/core/groupby.py

+        try:
+            sorter = np.lexsort((val, ids))
+        except TypeError:
+            val, _ = algos.factorize(val, sort=False)


Note that this means object dtype? Maybe add an assert?

cpcloud · 2015-09-13T17:45:10Z

this is passing on my travis ci fork

jreback · 2015-09-13T17:55:31Z

pandas/core/groupby.py

+        try:
+            sorter = np.lexsort((val, ids))
+        except TypeError:
+            assert val.dtype == object, \


add a comment here that this catches object dtypes

jreback · 2015-09-13T17:55:57Z

couple comments, squash and merge ok

jreback · 2015-09-13T17:56:20Z

doc/source/whatsnew/v0.17.0.txt

@@ -1135,3 +1135,4 @@ Bug Fixes
 - Bug in ``DatetimeIndex`` cannot infer negative freq (:issue:`11018`)
 - Remove use of some deprecated numpy comparison operations, mainly in tests. (:issue:`10569`)
 - Bug in ``Index`` dtype may not applied properly (:issue:`11017`)
+- Bug in Series groupby when calling nunique on an object dtype (:issue:`11077`)


move this to where the other nunique comment is (just put in in the list)

jreback · 2015-09-13T18:28:54Z

pandas/core/groupby.py

        ids, val = ids[sorter], val[sorter]

        # group boundries are where group ids change
        # unique observations are where sorted values change
-        idx = com._ensure_int64(np.r_[0, 1 + np.nonzero(ids[1:] != ids[:-1])[0]])
-        inc = com._ensure_int64(np.r_[1, val[1:] != val[:-1]])


revert these 2 lines, these don't pass on windows ATM. this can possibly be changed, but on another PR/issue

jreback · 2015-09-14T19:51:50Z

@cpcloud can you update (e.g. revert those 2 lines). ok to merge on green after that.

…h-object BUG: Fix Series nunique groupby with object dtype

jreback · 2015-09-14T22:11:28Z

ty sir!

cpcloud self-assigned this Sep 12, 2015

cpcloud added this to the 0.17.0 milestone Sep 12, 2015

cpcloud added Bug Blocker Blocking issue or pull request for an upcoming release labels Sep 12, 2015

shoyer reviewed Sep 13, 2015
View reviewed changes

jreback reviewed Sep 13, 2015
View reviewed changes

cpcloud changed the title ~~Fix Series nunique groupby with object dtype~~ BUG: Fix Series nunique groupby with object dtype Sep 13, 2015

jreback reviewed Sep 13, 2015
View reviewed changes

Fix Series.nunique groupby with object

f9e6c3d

cpcloud added a commit that referenced this pull request Sep 14, 2015

Merge pull request #11079 from cpcloud/fix-series-nunique-groupby-wit…

5ee3a4f

…h-object BUG: Fix Series nunique groupby with object dtype

cpcloud merged commit 5ee3a4f into pandas-dev:master Sep 14, 2015

cpcloud deleted the fix-series-nunique-groupby-with-object branch September 14, 2015 22:09

jreback mentioned this pull request Sep 30, 2015

groupby, as_index=False, with pandas.Series.count() as an agg #8381

Closed

jorisvandenbossche mentioned this pull request Nov 18, 2015

BUG: groupby nunique with Categorical and missing categories gives ValueError #11635

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix Series nunique groupby with object dtype #11079

BUG: Fix Series nunique groupby with object dtype #11079

cpcloud commented Sep 12, 2015

jreback commented Sep 12, 2015

behzadnouri commented Sep 12, 2015

cpcloud commented Sep 12, 2015

behzadnouri commented Sep 12, 2015

cpcloud commented Sep 13, 2015

cpcloud commented Sep 13, 2015

jreback commented Sep 13, 2015

shoyer Sep 13, 2015

cpcloud commented Sep 13, 2015

jreback Sep 13, 2015

jreback commented Sep 13, 2015

jreback Sep 13, 2015

jreback Sep 13, 2015

jreback commented Sep 14, 2015

jreback commented Sep 14, 2015

BUG: Fix Series nunique groupby with object dtype #11079

BUG: Fix Series nunique groupby with object dtype #11079

Conversation

cpcloud commented Sep 12, 2015

jreback commented Sep 12, 2015

behzadnouri commented Sep 12, 2015

cpcloud commented Sep 12, 2015

behzadnouri commented Sep 12, 2015

cpcloud commented Sep 13, 2015

cpcloud commented Sep 13, 2015

jreback commented Sep 13, 2015

shoyer Sep 13, 2015

Choose a reason for hiding this comment

cpcloud commented Sep 13, 2015

jreback Sep 13, 2015

Choose a reason for hiding this comment

jreback commented Sep 13, 2015

jreback Sep 13, 2015

Choose a reason for hiding this comment

jreback Sep 13, 2015

Choose a reason for hiding this comment

jreback commented Sep 14, 2015

jreback commented Sep 14, 2015