-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix Series nunique groupby with object dtype #11079
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix Series nunique groupby with object dtype #11079
Conversation
cc @behzadnouri or you want to use do a factorize? |
diff --git a/pandas/core/groupby.py b/pandas/core/groupby.py
index f34fd6e..39be706 100644
--- a/pandas/core/groupby.py
+++ b/pandas/core/groupby.py
@@ -2565,13 +2565,21 @@ class SeriesGroupBy(GroupBy):
ids, _, _ = self.grouper.group_info
val = self.obj.get_values()
- sorter = np.lexsort((val, ids))
+ try:
+ sorter = np.lexsort((val, ids))
+ except TypeError:
+ val, _ = algos.factorize(val, sort=False)
+ sorter = np.lexsort((val, ids))
+ isnull = lambda a: a == -1
+ else:
+ isnull = com.isnull
+
ids, val = ids[sorter], val[sorter]
# group boundries are where group ids change
# unique observations are where sorted values change
- idx = com._ensure_int64(np.r_[0, 1 + np.nonzero(ids[1:] != ids[:-1])[0]])
- inc = com._ensure_int64(np.r_[1, val[1:] != val[:-1]])
+ idx = np.r_[0, 1 + np.nonzero(ids[1:] != ids[:-1])[0]]
+ inc = np.r_[1, val[1:] != val[:-1]]
# 1st item of each group is a new unique observation
mask = isnull(val) |
@behzadnouri If you submit that as a PR to my fork, I'll happily merge it in |
the error says once, values are factorized, original values are not needed and it is more efficient to work with integer factors, as long as |
@jreback getting a somewhat strange travis failure here that seems unrelated to my change: https://travis-ci.org/cpcloud/pandas/jobs/80036566 |
any ideas what that might be? |
hmm someone else reported that as well |
try: | ||
sorter = np.lexsort((val, ids)) | ||
except TypeError: | ||
val, _ = algos.factorize(val, sort=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this means object dtype? Maybe add an assert?
this is passing on my travis ci fork |
try: | ||
sorter = np.lexsort((val, ids)) | ||
except TypeError: | ||
assert val.dtype == object, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment here that this catches object
dtypes
couple comments, squash and merge ok |
@@ -1135,3 +1135,4 @@ Bug Fixes | |||
- Bug in ``DatetimeIndex`` cannot infer negative freq (:issue:`11018`) | |||
- Remove use of some deprecated numpy comparison operations, mainly in tests. (:issue:`10569`) | |||
- Bug in ``Index`` dtype may not applied properly (:issue:`11017`) | |||
- Bug in Series groupby when calling nunique on an object dtype (:issue:`11077`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this to where the other nunique comment is (just put in in the list)
ids, val = ids[sorter], val[sorter] | ||
|
||
# group boundries are where group ids change | ||
# unique observations are where sorted values change | ||
idx = com._ensure_int64(np.r_[0, 1 + np.nonzero(ids[1:] != ids[:-1])[0]]) | ||
inc = com._ensure_int64(np.r_[1, val[1:] != val[:-1]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert these 2 lines, these don't pass on windows ATM. this can possibly be changed, but on another PR/issue
@cpcloud can you update (e.g. revert those 2 lines). ok to merge on green after that. |
…h-object BUG: Fix Series nunique groupby with object dtype
ty sir! |
closes #11077