-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: pd.cut with bins=1 and input all 0s #15437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pls add #15428 as a tests case as well |
doc/source/whatsnew/v0.20.0.txt
Outdated
@@ -587,3 +587,5 @@ Bug Fixes | |||
- Bug in ``Series.replace`` and ``DataFrame.replace`` which failed on empty replacement dicts (:issue:`15289`) | |||
- Bug in ``pd.melt()`` where passing a tuple value for ``value_vars`` caused a ``TypeError`` (:issue:`15348`) | |||
- Bug in ``.eval()`` which caused multiline evals to fail with local variables not on the first line (:issue:`15342`) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't put whatsnew at the very end as they cause conflicts. the point of the empty space elsewhere is for you to use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I was actually wondering what was the best place to add the new entry
doc/source/whatsnew/v0.20.0.txt
Outdated
@@ -587,3 +587,5 @@ Bug Fixes | |||
- Bug in ``Series.replace`` and ``DataFrame.replace`` which failed on empty replacement dicts (:issue:`15289`) | |||
- Bug in ``pd.melt()`` where passing a tuple value for ``value_vars`` caused a ``TypeError`` (:issue:`15348`) | |||
- Bug in ``.eval()`` which caused multiline evals to fail with local variables not on the first line (:issue:`15342`) | |||
|
|||
- Bug in ``pd.cut()`` (:issue:`15428`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this a useful message
pandas/tests/tools/test_tile.py
Outdated
@@ -297,6 +297,10 @@ def test_single_bin(self): | |||
result = cut(s, 1, labels=False) | |||
tm.assert_series_equal(result, expected) | |||
|
|||
s = Series([0., 0.]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add as a new test function
add the issue number as a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is related to that test. The test checks that pd.cut works fine when the input array contains identical values and bins=1. That is the general scenario, while this PR covers the specific case of an array contains all 0s. it would be misleading to move the test to another function.
pandas/tests/tools/test_tile.py
Outdated
@@ -297,6 +297,11 @@ def test_single_bin(self): | |||
result = cut(s, 1, labels=False) | |||
tm.assert_series_equal(result, expected) | |||
|
|||
# issue 15428 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add this for also 2 equal values that are not 0, but with a single bin, unless that is already covered
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's already covered
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok great
its the same thing. yes please add it. |
I am a little confused, it it ok now or is there something still missing? |
I am referring to #15431, do you want that issue solved in this PR too, even though it's a different bug? |
@luca-s yes its virtually the same issue. |
Codecov Report
@@ Coverage Diff @@
## master #15437 +/- ##
==========================================
- Coverage 91.02% 90.99% -0.03%
==========================================
Files 143 143
Lines 49295 49295
==========================================
- Hits 44869 44857 -12
- Misses 4426 4438 +12
Continue to review full report at Codecov.
|
pandas/tools/tile.py
Outdated
@@ -189,6 +189,13 @@ def qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise'): | |||
else: | |||
quantiles = q | |||
bins = algos.quantile(x, quantiles) | |||
|
|||
# fix special case: q=1 and all identical values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems a bit hacky, any way to figure this out a-prior?
What about this approach? I added an 'allow' option to _bins_to_cuts function. This might actually be useful as a qcut option too |
pandas/tools/tile.py
Outdated
@@ -186,9 +186,13 @@ def qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise'): | |||
|
|||
if is_integer(q): | |||
quantiles = np.linspace(0, 1, q + 1) | |||
|
|||
if q == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is too narrow of a solution as well
pandas/tools/tile.py
Outdated
raise ValueError('Cannot qcut empty array') | ||
|
||
rng = (nanops.nanmin(x), nanops.nanmax(x)) | ||
if rng[0] == rng[1] and q == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of using allow
I would just adjust the quantiles right here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can do that but this seems cleaner than increasing the edges by 0.1% for this specific case only (q=1 and identical values). That seems fine with pd.cut but not with pd.qcut, considering that we might be returning those bins (it's not hidden to the user).
Also, I thought that with a little extra work we could make 'allow' a top level pd.qcut option. Compared to 'drop' this option would maintain the total number of quantiles (some of them empty in the case of duplicated bin edges).
What do you think? If you don't like it I'll go with the 0.1% edges increment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any update?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you show an example of what allow
would do and why its useful? (apart from the internal use here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason it could be useful is to keep the number of quantiles as per user request. This is the only difference compare to 'drop'. I understand this might not be super useful, so it's up to you. Still I would keep the option for internal use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@luca-s I still find this a bit odd, can you do this by simply adjusting the values (and not using this kw, even though its only internal)?
you'll want to rebase on master as travis was having some issues |
The special case of running pd.cut() qith bins=1 an input containing all 0s raises a ValueError
BUG: pd.qcut with q=1 and input with identical values
What about this new fix attempt? |
@luca-s I like this a lot more! |
All good then? Or is there something missing? |
thanks @luca-s keep em coming! |
The special case of running pd.cut() qith bins=1 an input containing all 0s raises a ValueError closes pandas-dev#15428 closes pandas-dev#15431 Author: Luca Scarabello <[email protected]> Author: Luca <[email protected]> Closes pandas-dev#15437 from luca-s/issue_15428 and squashes the following commits: 1248987 [Luca] rebased on master def84ba [Luca] Yet another implementation attempt 692503a [Luca Scarabello] Improved solution: using same approach as pd.cut b7d92dc [Luca] Added 'allow' duplicates option to _bins_to_cuts f56a27f [Luca Scarabello] Issue pandas-dev#15431 55806cf [Luca Scarabello] BUG: pd.cut with bins=1 and input all 0s
The special case of running pd.cut() qith bins=1 an input containing all
0s raises a ValueError
closes BUG: pd.qcut with q=1 and input with identical values #15431
git diff upstream/master | flake8 --diff