-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: GH13629 Binned groupby median function calculates median on empt… #14225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Current coverage is 85.24% (diff: 100%)@@ master #14225 diff @@
==========================================
Files 140 140
Lines 50559 50560 +1
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 43099 43101 +2
+ Misses 7460 7459 -1
Partials 0 0
|
@@ -4429,7 +4429,13 @@ def _groupby_indices(values): | |||
# bit better than factorizing again | |||
reverse = dict(enumerate(values.categories)) | |||
codes = values.codes.astype('int64') | |||
_, counts = _hash.value_count_int64(codes, False) | |||
keys, counts = _hash.value_count_int64(codes, False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than adjusting the hash table value counts, I think you could use np.bincount
here? - see Categorical.value_counts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per your suggestion, I borrowed some of the logic from Categorical.value_counts. The mask.all() check seemed to make the code run about as fast (or slower) in this case, so I left it out.
|
||
for op, targop in ops: | ||
result = df.groupby(pd.cut(df[0], grps))._cython_agg_general(op) | ||
expected = df.groupby(pd.cut(df[0], grps)).agg(targop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because agg
intercepts methods (i.e. np.median
is interpreted as 'median'
), this doesn't fail on master - wrap the targop in a lambda.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a test that hits the public api (e.g. df.groupby(...).median()
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...
109fda8
to
5baa875
Compare
…y bins and outputs random numbers
5baa875
to
42affd5
Compare
lgtm, @jreback? |
_, counts = _hash.value_count_int64(codes, False) | ||
keys, counts = _hash.value_count_int64(codes, False) | ||
|
||
if not counts.size == values.categories.size: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the purpose of this?
when is this hit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _algos.groupby_indices function expects a counts array that is the same length as the categories. If one of the categories has no values in it (empty bins), the _hash.value_count_int64 function would just return a shorter count array (with no 0 entries for empty categories/bins). My initial approach was to pad the array with 0s in the right places, but per chris-b1's suggestion, I threw out this code and used np.bincount instead.
lgtm merge away |
* commit 'v0.19.0rc1-25-ga7469cf': (471 commits) ENH: Add divmod to series and index. (pandas-dev#14208) Fix generator tests to run (pandas-dev#14245) BUG: GH13629 Binned groupby median function with empty bins (pandas-dev#14225) TST/TEMP: fix pyqt to 4.x for plotting tests (pandas-dev#14240) DOC: added example to Series.map showing use of na_action parameter (GH14231) DOC: split docstring into multiple lines in excel.py (pandas-dev#14073) MAINT: Use __module__ in _DeprecatedModule. (pandas-dev#14181) ENH: Allow true_values and false_values options in read_excel (pandas-dev#14002) DOC: fix incorrect example in unstack docstring (GH14206) (pandas-dev#14211) BUG: iloc fails with non lex-sorted MultiIndex pandas-dev#13797 BUG: add check for infinity in __call__ of EngFormatter In gbq.to_gbq allow the DataFrame column order to differ from schema BLD: require cython if tempita is needed DOC: add source links to api docs (pandas-dev#14200) BUG: compat with Stata ver 111 Fix: F999 dictionary key '2000q4' repeated with different values (pandas-dev#14198) BLD: Test for Python 3.5 with C locale BUG: DatetimeTZBlock can't assign values near dst boundary BUG: union_categorical with Series and cat idx BUG: fix str.contains for series containing only nan values ...
git diff upstream/master | flake8 --diff