Skip to content

BUG: GH13629 Binned groupby median function calculates median on empt… #14225

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 18, 2016

Conversation

paul-mannino
Copy link
Contributor

@codecov-io
Copy link

codecov-io commented Sep 15, 2016

Current coverage is 85.24% (diff: 100%)

Merging #14225 into master will increase coverage by <.01%

@@             master     #14225   diff @@
==========================================
  Files           140        140          
  Lines         50559      50560     +1   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43099      43101     +2   
+ Misses         7460       7459     -1   
  Partials          0          0          

Powered by Codecov. Last update e8357a1...42affd5

@jreback jreback added Bug Groupby Categorical Categorical Data Type labels Sep 15, 2016
@@ -4429,7 +4429,13 @@ def _groupby_indices(values):
# bit better than factorizing again
reverse = dict(enumerate(values.categories))
codes = values.codes.astype('int64')
_, counts = _hash.value_count_int64(codes, False)
keys, counts = _hash.value_count_int64(codes, False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than adjusting the hash table value counts, I think you could use np.bincount here? - see Categorical.value_counts

Copy link
Contributor Author

@paul-mannino paul-mannino Sep 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per your suggestion, I borrowed some of the logic from Categorical.value_counts. The mask.all() check seemed to make the code run about as fast (or slower) in this case, so I left it out.


for op, targop in ops:
result = df.groupby(pd.cut(df[0], grps))._cython_agg_general(op)
expected = df.groupby(pd.cut(df[0], grps)).agg(targop)
Copy link
Contributor

@chris-b1 chris-b1 Sep 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because agg intercepts methods (i.e. np.median is interpreted as 'median'), this doesn't fail on master - wrap the targop in a lambda.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a test that hits the public api (e.g. df.groupby(...).median())?

Copy link
Contributor

@chris-b1 chris-b1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...

@chris-b1
Copy link
Contributor

lgtm, @jreback?

_, counts = _hash.value_count_int64(codes, False)
keys, counts = _hash.value_count_int64(codes, False)

if not counts.size == values.categories.size:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of this?
when is this hit?

Copy link
Contributor Author

@paul-mannino paul-mannino Sep 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _algos.groupby_indices function expects a counts array that is the same length as the categories. If one of the categories has no values in it (empty bins), the _hash.value_count_int64 function would just return a shorter count array (with no 0 entries for empty categories/bins). My initial approach was to pad the array with 0s in the right places, but per chris-b1's suggestion, I threw out this code and used np.bincount instead.

@jreback
Copy link
Contributor

jreback commented Sep 18, 2016

lgtm

merge away

@chris-b1 chris-b1 merged commit db9dc65 into pandas-dev:master Sep 18, 2016
yarikoptic added a commit to neurodebian/pandas that referenced this pull request Oct 13, 2016
* commit 'v0.19.0rc1-25-ga7469cf': (471 commits)
  ENH: Add divmod to series and index. (pandas-dev#14208)
  Fix generator tests to run (pandas-dev#14245)
  BUG: GH13629 Binned groupby median function with empty bins (pandas-dev#14225)
  TST/TEMP: fix pyqt to 4.x for plotting tests (pandas-dev#14240)
  DOC: added example to Series.map showing use of na_action parameter (GH14231)
  DOC: split docstring into multiple lines in excel.py (pandas-dev#14073)
  MAINT: Use __module__ in _DeprecatedModule. (pandas-dev#14181)
  ENH: Allow true_values and false_values options in read_excel (pandas-dev#14002)
  DOC: fix incorrect example in unstack docstring (GH14206) (pandas-dev#14211)
  BUG: iloc fails with non lex-sorted MultiIndex pandas-dev#13797
  BUG: add check for infinity in __call__ of EngFormatter
  In gbq.to_gbq allow the DataFrame column order to differ from schema
  BLD: require cython if tempita is needed
  DOC: add source links to api docs (pandas-dev#14200)
  BUG: compat with Stata ver 111
  Fix: F999 dictionary key '2000q4' repeated with different values (pandas-dev#14198)
  BLD: Test for Python 3.5 with C locale
  BUG: DatetimeTZBlock can't assign values near dst boundary
  BUG: union_categorical with Series and cat idx
  BUG: fix str.contains for series containing only nan values
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Binned groupby median function calculates median on empty bins and outputs random numbers
4 participants