Skip to content

BUG: bug in GroupBy.count where arg minlength passed to np.bincount must be None for np<1.13 #21957

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

topper-123
Copy link
Contributor

See #21956 for details.

@topper-123 topper-123 changed the title BUG: fix bug where np.bincount default arg minlength must be None for np<1.13 BUG: bug in GroupBy.count where arg minlength passed to np.bincount must be None for np<1.13 Jul 17, 2018
def test_count_with_only_nans_in_first_group(self):
# GH21956
df = DataFrame({'A': [np.nan, np.nan], 'B': ['a', 'b'], 'C': [1, 2]})
result = df.groupby(['A', 'B']).C.count()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u do an assert_produces_warning()

here (should not show a warning)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not produce any warnings for me.

What should look for in travis? Grep "SeriesGroupBy.count"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just look at the warnings in the 3.6 log

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will look in the morning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, can you assert that this produces no warning (e.g. it IS producing a warning now in numpy < 1.13)

@@ -471,7 +471,7 @@ Groupby/Resample/Rolling

- Bug in :func:`pandas.core.groupby.GroupBy.first` and :func:`pandas.core.groupby.GroupBy.last` with ``as_index=False`` leading to the loss of timezone information (:issue:`15884`)
- Bug in :meth:`DatetimeIndex.resample` when downsampling across a DST boundary (:issue:`8531`)
-
- Bug in :func:`pandas.core.groupby.GroupBy.count` when using numpy < 1.13 and ngroups=0 (:issue:`21956`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a note is not necessary this is not user visible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug shown in #21956 is user visible...

@@ -1207,7 +1208,10 @@ def count(self):

mask = (ids != -1) & ~isna(val)
ids = ensure_platform_int(ids)
out = np.bincount(ids[mask], minlength=ngroups or 0)
minlength = ngroups or 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u simplify this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the " or 0" part?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made this a one-liner in the newest update.

@topper-123 topper-123 force-pushed the groupby_count_with_np_under_v1.13 branch 2 times, most recently from 2d4ab97 to d65615b Compare July 17, 2018 23:20
@codecov
Copy link

codecov bot commented Jul 17, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@495a0c1). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #21957   +/-   ##
=========================================
  Coverage          ?   92.02%           
=========================================
  Files             ?      170           
  Lines             ?    50710           
  Branches          ?        0           
=========================================
  Hits              ?    46664           
  Misses            ?     4046           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.42% <100%> (?)
#single 42.31% <0%> (?)
Impacted Files Coverage Δ
pandas/core/groupby/generic.py 86.8% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 495a0c1...32a5fcf. Read the comment docs.

@gfyoung gfyoung added Groupby Compat pandas objects compatability with Numpy or Python functions labels Jul 18, 2018
@topper-123 topper-123 force-pushed the groupby_count_with_np_under_v1.13 branch from 91343df to 97ead75 Compare July 18, 2018 10:31
@@ -1207,7 +1208,8 @@ def count(self):

mask = (ids != -1) & ~isna(val)
ids = ensure_platform_int(ids)
out = np.bincount(ids[mask], minlength=ngroups or 0)
minlength = ngroups or (None if _np_version_under1p13 else 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I meant is can you just pass None always

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right, th doc string says "minlength : int, optional" and it accepts None for numpy >1.13. I'll change it.

@@ -31,6 +31,7 @@ Bug Fixes
**Groupby/Resample/Rolling**

- Bug where calling :func:`DataFrameGroupBy.agg` with a list of functions including ``ohlc`` as the non-initial element would raise a ``ValueError`` (:issue:`21716`)
- Bug in :func:`pandas.core.groupby.SeriesGroupBy.count` when using numpy < 1.13 and ngroups=0 (:issue:`21956`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there anything user visible atm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as the per the example in #21956:

>>> df = pd.DataFrame({'A': [np.nan, np.nan], 'B': ['a', 'b'], 'C':[1,2]})
>>> df.groupby(['A', 'B']).C.count()
ValueError: minlength must be positive  # when numpy <1.13

@topper-123 topper-123 force-pushed the groupby_count_with_np_under_v1.13 branch 2 times, most recently from 2b202e0 to b250bce Compare July 25, 2018 10:40
@@ -32,6 +32,7 @@ Bug Fixes

- Bug where calling :func:`DataFrameGroupBy.agg` with a list of functions including ``ohlc`` as the non-initial element would raise a ``ValueError`` (:issue:`21716`)
- Bug in ``roll_quantile`` caused a memory leak when calling ``.rolling(...).quantile(q)`` with ``q`` in (0,1) (:issue:`21965`)
- Bug in :func:`pandas.core.groupby.SeriesGroupBy.count` when using numpy < 1.13 and ngroups=0 (:issue:`21956`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, is this actually a user facing change? (not against the note), just this is not understandable to user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, new explanation uploaded.

@topper-123 topper-123 force-pushed the groupby_count_with_np_under_v1.13 branch from b250bce to eea3c76 Compare July 25, 2018 12:34
def test_count_with_only_nans_in_first_group(self):
# GH21956
df = DataFrame({'A': [np.nan, np.nan], 'B': ['a', 'b'], 'C': [1, 2]})
result = df.groupby(['A', 'B']).C.count()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, can you assert that this produces no warning (e.g. it IS producing a warning now in numpy < 1.13)

@topper-123
Copy link
Contributor Author

@jreback I got a straight up ValueError in master, so no warning. That ValueError is not in this PR, obviously

I've got a feeling I don't understand you. Could you give an example where a warning is emitted?

@topper-123 topper-123 force-pushed the groupby_count_with_np_under_v1.13 branch from eea3c76 to 4e59555 Compare July 25, 2018 18:36
@topper-123 topper-123 force-pushed the groupby_count_with_np_under_v1.13 branch from 4e59555 to 5c978e2 Compare July 25, 2018 20:53
@@ -32,6 +32,8 @@ Bug Fixes

- Bug where calling :func:`DataFrameGroupBy.agg` with a list of functions including ``ohlc`` as the non-initial element would raise a ``ValueError`` (:issue:`21716`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this to 0.24.0, as this wont' backport cleanly.

@jreback jreback added this to the 0.24.0 milestone Jul 26, 2018
@topper-123 topper-123 force-pushed the groupby_count_with_np_under_v1.13 branch from 5c978e2 to 49f30c2 Compare July 26, 2018 12:53
@pep8speaks
Copy link

pep8speaks commented Jul 26, 2018

Hello @topper-123! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on July 26, 2018 at 12:58 Hours UTC

@topper-123 topper-123 force-pushed the groupby_count_with_np_under_v1.13 branch from 49f30c2 to 17ad763 Compare July 26, 2018 12:54
@topper-123 topper-123 force-pushed the groupby_count_with_np_under_v1.13 branch from 17ad763 to 32a5fcf Compare July 26, 2018 12:58
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. ping on green.

@topper-123
Copy link
Contributor Author

Something's wrong with travis, else green:

Worker information
hostname: 1773d14a-144b-4458-b6fe-9060bb57620b@1.production-2-worker-org-09-packet
version: v4.0.0 https://github.com/travis-ci/worker/tree/e5cb567e10c0eefe380e81c9a2229ac8fb6a16ce
instance: 2a708b5 travisci/ci-garnet:packer-1512502276-986baf0 (via amqp)
startup: 2.24796375s
No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received
The build has been terminated

@jreback jreback merged commit bb451e8 into pandas-dev:master Jul 28, 2018
@jreback
Copy link
Contributor

jreback commented Jul 28, 2018

thanks @topper-123

minggli added a commit to minggli/pandas that referenced this pull request Jul 28, 2018
* master:
  BENCH: asv csv reading benchmarks no longer read StringIO objects off the end (pandas-dev#21807)
  BUG: df.agg, df.transform and df.apply use different methods when axis=1 than when axis=0 (pandas-dev#21224)
  BUG: bug in GroupBy.count where arg minlength passed to np.bincount must be None for np<1.13 (pandas-dev#21957)
  CLN: Vbench to asv conversion script (pandas-dev#22089)
  consistent docstring (pandas-dev#22066)
  TST: skip pytables test with not-updated pytables conda package (pandas-dev#22099)
  CLN: Remove Legacy MultiIndex Index Compatibility (pandas-dev#21740)
  DOC: Reword doc for filepath_or_buffer in read_csv (pandas-dev#22058)
  BUG: rolling with MSVC 2017 build (pandas-dev#21813)
@jreback
Copy link
Contributor

jreback commented Aug 1, 2018

so these are the warnings: https://travis-ci.org/pandas-dev/pandas/jobs/410445444

/home/travis/build/pandas-dev/pandas/pandas/core/groupby/generic.py:1209: DeprecationWarning: 0 should be passed as minlength instead of None; this will error in future.
    out = np.bincount(ids[mask], minlength=ngroups or None)

can you PR to fix this. I guess we have to switch on the version and actually pass 0 rather than None for > 113

@topper-123 topper-123 deleted the groupby_count_with_np_under_v1.13 branch August 1, 2018 10:43
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: GroupBy.count arg minlength passed to np.bincount must be None rather than 0 for np<1.13
4 participants