BUG: GH13629 Binned groupby median function calculates median on empt… #14225

paul-mannino · 2016-09-15T01:57:41Z

[ x] closes BUG: Binned groupby median function calculates median on empty bins and outputs random numbers #13629
[ x] tests added / passed
[ x] passes git diff upstream/master | flake8 --diff
[x ] whatsnew entry

codecov-io · 2016-09-15T02:24:52Z

Current coverage is 85.24% (diff: 100%)

Merging #14225 into master will increase coverage by <.01%

@@             master     #14225   diff @@
==========================================
  Files           140        140          
  Lines         50559      50560     +1   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43099      43101     +2   
+ Misses         7460       7459     -1   
  Partials          0          0

Powered by Codecov. Last update e8357a1...42affd5

chris-b1 · 2016-09-17T13:07:09Z

pandas/core/groupby.py

@@ -4429,7 +4429,13 @@ def _groupby_indices(values):
        # bit better than factorizing again
        reverse = dict(enumerate(values.categories))
        codes = values.codes.astype('int64')
-        _, counts = _hash.value_count_int64(codes, False)
+        keys, counts = _hash.value_count_int64(codes, False)


Rather than adjusting the hash table value counts, I think you could use np.bincount here? - see Categorical.value_counts

Per your suggestion, I borrowed some of the logic from Categorical.value_counts. The mask.all() check seemed to make the code run about as fast (or slower) in this case, so I left it out.

chris-b1 · 2016-09-17T13:12:47Z

pandas/tests/test_groupby.py

+
+        for op, targop in ops:
+            result = df.groupby(pd.cut(df[0], grps))._cython_agg_general(op)
+            expected = df.groupby(pd.cut(df[0], grps)).agg(targop)


Because agg intercepts methods (i.e. np.median is interpreted as 'median'), this doesn't fail on master - wrap the targop in a lambda.

Can you also add a test that hits the public api (e.g. df.groupby(...).median())?

chris-b1

...

…y bins and outputs random numbers

chris-b1 · 2016-09-18T15:48:42Z

lgtm, @jreback?

jreback · 2016-09-15T02:03:00Z

pandas/core/groupby.py

-        _, counts = _hash.value_count_int64(codes, False)
+        keys, counts = _hash.value_count_int64(codes, False)
+
+        if not counts.size == values.categories.size:


what is the purpose of this?
when is this hit?

The _algos.groupby_indices function expects a counts array that is the same length as the categories. If one of the categories has no values in it (empty bins), the _hash.value_count_int64 function would just return a shorter count array (with no 0 entries for empty categories/bins). My initial approach was to pad the array with 0s in the right places, but per chris-b1's suggestion, I threw out this code and used np.bincount instead.

jreback · 2016-09-18T15:54:35Z

lgtm

merge away

* commit 'v0.19.0rc1-25-ga7469cf': (471 commits) ENH: Add divmod to series and index. (pandas-dev#14208) Fix generator tests to run (pandas-dev#14245) BUG: GH13629 Binned groupby median function with empty bins (pandas-dev#14225) TST/TEMP: fix pyqt to 4.x for plotting tests (pandas-dev#14240) DOC: added example to Series.map showing use of na_action parameter (GH14231) DOC: split docstring into multiple lines in excel.py (pandas-dev#14073) MAINT: Use __module__ in _DeprecatedModule. (pandas-dev#14181) ENH: Allow true_values and false_values options in read_excel (pandas-dev#14002) DOC: fix incorrect example in unstack docstring (GH14206) (pandas-dev#14211) BUG: iloc fails with non lex-sorted MultiIndex pandas-dev#13797 BUG: add check for infinity in __call__ of EngFormatter In gbq.to_gbq allow the DataFrame column order to differ from schema BLD: require cython if tempita is needed DOC: add source links to api docs (pandas-dev#14200) BUG: compat with Stata ver 111 Fix: F999 dictionary key '2000q4' repeated with different values (pandas-dev#14198) BLD: Test for Python 3.5 with C locale BUG: DatetimeTZBlock can't assign values near dst boundary BUG: union_categorical with Series and cat idx BUG: fix str.contains for series containing only nan values ...

jreback added Bug Groupby Categorical Categorical Data Type labels Sep 15, 2016

chris-b1 reviewed Sep 17, 2016

View reviewed changes

chris-b1 suggested changes Sep 17, 2016

View reviewed changes

paul-mannino force-pushed the master branch from 109fda8 to 5baa875 Compare September 18, 2016 14:17

BUG: GH13629 Binned groupby median function calculates median on empt…

42affd5

…y bins and outputs random numbers

paul-mannino force-pushed the master branch from 5baa875 to 42affd5 Compare September 18, 2016 14:23

chris-b1 approved these changes Sep 18, 2016

View reviewed changes

jreback approved these changes Sep 18, 2016

View reviewed changes

chris-b1 merged commit db9dc65 into pandas-dev:master Sep 18, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: GH13629 Binned groupby median function calculates median on empt… #14225

BUG: GH13629 Binned groupby median function calculates median on empt… #14225

Uh oh!

paul-mannino commented Sep 15, 2016

Uh oh!

codecov-io commented Sep 15, 2016 •

edited

Loading

Uh oh!

chris-b1 Sep 17, 2016

Uh oh!

paul-mannino Sep 18, 2016 •

edited

Loading

Uh oh!

chris-b1 Sep 17, 2016 •

edited

Loading

Uh oh!

chris-b1 Sep 17, 2016

Uh oh!

chris-b1 left a comment

Uh oh!

chris-b1 commented Sep 18, 2016

Uh oh!

jreback Sep 15, 2016

Uh oh!

paul-mannino Sep 18, 2016 •

edited

Loading

Uh oh!

jreback commented Sep 18, 2016

Uh oh!

Uh oh!

Uh oh!

BUG: GH13629 Binned groupby median function calculates median on empt… #14225

BUG: GH13629 Binned groupby median function calculates median on empt… #14225

Uh oh!

Conversation

paul-mannino commented Sep 15, 2016

Uh oh!

codecov-io commented Sep 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current coverage is 85.24% (diff: 100%)

Uh oh!

chris-b1 Sep 17, 2016

Choose a reason for hiding this comment

Uh oh!

paul-mannino Sep 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chris-b1 Sep 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chris-b1 Sep 17, 2016

Choose a reason for hiding this comment

Uh oh!

chris-b1 left a comment

Choose a reason for hiding this comment

Uh oh!

chris-b1 commented Sep 18, 2016

Uh oh!

jreback Sep 15, 2016

Choose a reason for hiding this comment

Uh oh!

paul-mannino Sep 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Sep 18, 2016

Uh oh!

Uh oh!

codecov-io commented Sep 15, 2016 •

edited

Loading

paul-mannino Sep 18, 2016 •

edited

Loading

chris-b1 Sep 17, 2016 •

edited

Loading

paul-mannino Sep 18, 2016 •

edited

Loading