Skip to content

BUG: pd.cut with bins=1 and input all 0s #15437

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

luca-s
Copy link
Contributor

@luca-s luca-s commented Feb 17, 2017

The special case of running pd.cut() qith bins=1 an input containing all
0s raises a ValueError

@jreback
Copy link
Contributor

jreback commented Feb 17, 2017

pls add #15428 as a tests case as well

@@ -587,3 +587,5 @@ Bug Fixes
- Bug in ``Series.replace`` and ``DataFrame.replace`` which failed on empty replacement dicts (:issue:`15289`)
- Bug in ``pd.melt()`` where passing a tuple value for ``value_vars`` caused a ``TypeError`` (:issue:`15348`)
- Bug in ``.eval()`` which caused multiline evals to fail with local variables not on the first line (:issue:`15342`)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't put whatsnew at the very end as they cause conflicts. the point of the empty space elsewhere is for you to use

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I was actually wondering what was the best place to add the new entry

@@ -587,3 +587,5 @@ Bug Fixes
- Bug in ``Series.replace`` and ``DataFrame.replace`` which failed on empty replacement dicts (:issue:`15289`)
- Bug in ``pd.melt()`` where passing a tuple value for ``value_vars`` caused a ``TypeError`` (:issue:`15348`)
- Bug in ``.eval()`` which caused multiline evals to fail with local variables not on the first line (:issue:`15342`)

- Bug in ``pd.cut()`` (:issue:`15428`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this a useful message

@@ -297,6 +297,10 @@ def test_single_bin(self):
result = cut(s, 1, labels=False)
tm.assert_series_equal(result, expected)

s = Series([0., 0.])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add as a new test function
add the issue number as a comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is related to that test. The test checks that pd.cut works fine when the input array contains identical values and bins=1. That is the general scenario, while this PR covers the specific case of an array contains all 0s. it would be misleading to move the test to another function.

@jreback jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Feb 17, 2017
@@ -297,6 +297,11 @@ def test_single_bin(self):
result = cut(s, 1, labels=False)
tm.assert_series_equal(result, expected)

# issue 15428
Copy link
Contributor

@jreback jreback Feb 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add this for also 2 equal values that are not 0, but with a single bin, unless that is already covered

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's already covered

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok great

@jreback
Copy link
Contributor

jreback commented Feb 17, 2017

Regarding the issue #15428 I can add that fix here too but It's a different code bug and it's related to another function (qcut vs cut).

its the same thing. yes please add it.

@luca-s
Copy link
Contributor Author

luca-s commented Feb 17, 2017

I am a little confused, it it ok now or is there something still missing?

@luca-s
Copy link
Contributor Author

luca-s commented Feb 17, 2017

I am referring to #15431, do you want that issue solved in this PR too, even though it's a different bug?

@jreback
Copy link
Contributor

jreback commented Feb 17, 2017

@luca-s yes its virtually the same issue.

@codecov-io
Copy link

codecov-io commented Feb 17, 2017

Codecov Report

Merging #15437 into master will decrease coverage by -0.03%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #15437      +/-   ##
==========================================
- Coverage   91.02%   90.99%   -0.03%     
==========================================
  Files         143      143              
  Lines       49295    49295              
==========================================
- Hits        44869    44857      -12     
- Misses       4426     4438      +12
Impacted Files Coverage Δ
pandas/tools/tile.py 96.85% <100%> (ø)
pandas/io/gbq.py 25% <0%> (-58.34%)
pandas/tools/merge.py 91.76% <0%> (-0.35%)
pandas/core/common.py 90.96% <0%> (-0.34%)
pandas/core/frame.py 97.87% <0%> (-0.1%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 648ae4f...1248987. Read the comment docs.

@@ -189,6 +189,13 @@ def qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise'):
else:
quantiles = q
bins = algos.quantile(x, quantiles)

# fix special case: q=1 and all identical values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems a bit hacky, any way to figure this out a-prior?

@luca-s
Copy link
Contributor Author

luca-s commented Feb 18, 2017

What about this approach? I added an 'allow' option to _bins_to_cuts function. This might actually be useful as a qcut option too

@@ -186,9 +186,13 @@ def qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise'):

if is_integer(q):
quantiles = np.linspace(0, 1, q + 1)

if q == 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is too narrow of a solution as well

raise ValueError('Cannot qcut empty array')

rng = (nanops.nanmin(x), nanops.nanmax(x))
if rng[0] == rng[1] and q == 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of using allow I would just adjust the quantiles right here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can do that but this seems cleaner than increasing the edges by 0.1% for this specific case only (q=1 and identical values). That seems fine with pd.cut but not with pd.qcut, considering that we might be returning those bins (it's not hidden to the user).

Also, I thought that with a little extra work we could make 'allow' a top level pd.qcut option. Compared to 'drop' this option would maintain the total number of quantiles (some of them empty in the case of duplicated bin edges).

What do you think? If you don't like it I'll go with the 0.1% edges increment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any update?

Copy link
Contributor

@jreback jreback Mar 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you show an example of what allow would do and why its useful? (apart from the internal use here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason it could be useful is to keep the number of quantiles as per user request. This is the only difference compare to 'drop'. I understand this might not be super useful, so it's up to you. Still I would keep the option for internal use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luca-s I still find this a bit odd, can you do this by simply adjusting the values (and not using this kw, even though its only internal)?

@jreback
Copy link
Contributor

jreback commented Mar 1, 2017

you'll want to rebase on master as travis was having some issues

luca-s and others added 6 commits March 8, 2017 00:43
@luca-s
Copy link
Contributor Author

luca-s commented Mar 7, 2017

What about this new fix attempt?

@jreback
Copy link
Contributor

jreback commented Mar 8, 2017

@luca-s I like this a lot more!

@luca-s
Copy link
Contributor Author

luca-s commented Mar 8, 2017

All good then? Or is there something missing?

@jreback jreback added this to the 0.20.0 milestone Mar 8, 2017
@jreback jreback closed this in d32acaa Mar 8, 2017
@jreback
Copy link
Contributor

jreback commented Mar 8, 2017

thanks @luca-s

keep em coming!

@luca-s luca-s deleted the issue_15428 branch March 8, 2017 13:49
@luca-s luca-s restored the issue_15428 branch March 8, 2017 13:49
@luca-s luca-s deleted the issue_15428 branch March 8, 2017 13:51
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
The special case of running pd.cut() qith bins=1 an input containing
all  0s raises a ValueError

closes pandas-dev#15428
closes pandas-dev#15431

Author: Luca Scarabello <[email protected]>
Author: Luca <[email protected]>

Closes pandas-dev#15437 from luca-s/issue_15428 and squashes the following commits:

1248987 [Luca] rebased on master
def84ba [Luca] Yet another implementation attempt
692503a [Luca Scarabello] Improved solution: using same approach as pd.cut
b7d92dc [Luca] Added 'allow' duplicates option to _bins_to_cuts
f56a27f [Luca Scarabello] Issue pandas-dev#15431
55806cf [Luca Scarabello] BUG: pd.cut with bins=1 and input all 0s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: pd.qcut with q=1 and input with identical values BUG: pd.cut with bins=1 and input all 0s
3 participants