-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
qcut can fail for highly discontinuous data distributions #15069
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
There recently has been some improvement regarding this. With master:
So there is option to deal with duplicates edges, but the option chosen here is to take less bins instead of assigning all values to one of the bins. |
Effectively, the |
I looked at the behavior after #15000 -- I'm going to leave this issue open for now. We should look into the quantile algorithms in other statistical packages. for example we have:
I think having duplicate bin edges is fine as long as we have a convention about which bin to assign the data to. I would argue in this case, the correct sample quantiles are |
Can someone review the commit I just made? This is the first time I am contributing to pandas. Do I need to write a test for the same? Also, I need to know what to do with the duplicates parameter. Thank you! |
Yes please to writing a test - this is often a good first step @puneet29 |
Code Sample, a copy-pastable example if possible
This code fails for any
K
:Problem description
With pandas 0.19.2, I have:
Expected Output
We need some kind of option to decide how to assign values to a quantile bucket in the event that two quantiles have the same value prior to the
searchsorted
call. In this case, the appropriate behavior may be to assign all1
values to the 50% quantile bucket.The text was updated successfully, but these errors were encountered: