-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: floats cannot be ranked with tolerance #8379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
hmm this was done for a reason but I don't remember exactly |
my impression is that the correct way to compare floats is with a tolerance, because At the very least, we need to consider this further - it's not as simple as this example. |
but clearly the tolerance is too big somehow, the actual behavior is clearly wrong, but we want to replace it with something that will definitely be more correct. |
As for the actual tolerance value, one thing that I know for sure about floats is that all tolerances should be configurable, because whatever sane tolerance value and type (absolute, relative, ULP) you pick it won't fit everyone. In this case, even though strictly speaking one doesn't need equality comparisons to rank data, one might want to do some "joining" knowing that the numbers come from some algorithm that produces a limited number of significant digits. |
-1- did you notice the no matter how small the tolerance is one can make such examples where two numbers being equally ranked or not depends on the presence of a third number (i.e. a small step in between). the problem is using tolerance in ranking not the size of the tolerance. -2- a user knows his data better than what the library can guess. if he wants to use tolerance he can simply do -3- there is also cross consistency with other popular libraries. #8365 is comparing with In certain domains, there is an undeniable value in sticking to conventions, even if one can argue an alternative is more technically correct. -4- then, there is internal consistency; that results should be semantically consistent. currently on master:
if with |
see #6886 which was put in place to handle really small floats. |
@jreback yes, i had seen that commit before making my code changes, and i think the right thing to do back then was to remove tolerance in ranking floats. if a user wants to have tolerance let him ranking floats with absolute/relative tolerance has the logical issue mentioned above. |
@behzadnouri no good. need to accomodate both situations. The original problem is just as valid. Is it possible to normalize the numbers first (if needed or does this cause more issues)? e.g. not a problem if it takes a slower loop when precision is an issue (as long as the fast path stays the same). |
@behzadnouri also, pls, pls, pls add tests when you put up a PR |
The problem with rounding is that the boundaries between equivalence classes become fixed, i.e. if you want 0.1 absolute tolerance and simply round up to 1 place after the delimiter |
care to weight in? cc @seth-p |
@immerrr and how do you rank does seriously, you do not see the logical issue in here? |
I agree with @behzadnouri. The problem is that the I agree that the ranking function should be applied based on strict (in)equality, possibly after the user has transformed the data, e.g. I'm tempted to suggest giving the user the option (which would not be the default) to specify the current behavior with a user-specified tolerance -- with lots of warnings -- but I think it's just handing them a loaded gun, and is probably best avoided. |
@behzadnouri, neither of |
@jreback, no matter how small the threshold, you can't avoid the problem that is_close(a,b) and is_close(b,c) does not imply is_close(a,c). I would put the onus on the user to map the "raw" values to standardized discrete values, e.g. |
That would indeed simplify |
@immerrr, can you elaborate? I don't think I understand what you mean. |
Currently |
The problem is that the current behavior is not equivalent to discretization. So you really can't do it. I haven't looked at the current behavior in detail, but on quick glance it looks similar to (but not the same as) |
Ok, current algorithm is broken, I've forgotten about that. Perhaps, it was supposed to
|
As for if this is discretization or not, it does map elements from almost-infinite set of floats to a finite set of floats that are farther than threshold from each other. It is not guaranteed to be unique, but that sounds like discretization to me. |
|
Current algorithm would, the one that I described would not. Like I said, it's broken now.
Hmm, you're correct, at some point I got confused thinking it returns a series. |
How would your proposed algorithm bin [1.04, 1.05, 1.14, 1.15, 1.24, 1.25, ...] with a threshold of 0.1? |
|
That's sort of arbitrary, since you're mapping 1.05 and 1.14 to different bins even though they're <0.1 apart. (The current behavior would regard the whole sequence as a single bin, if I undrestand it correctly.) So you end up with something that is (a) still arbitrary, and (b) different from the current behavior. Perhaps a better example would be { 1.04, 1.13, 1.22, 1.31, ... } -- i.e. a sequence of numbers 0.09 apart. I see only two plausible binnings: (i) every item is its own bin (which seems odd given that successive items are <0.1 apart); or (ii) all items in a single bin (which is the current behavior, but is odd since most pairs of items are >0.1 apart). |
I'm all for fixing the current approach. A few thoughts: Before I forget, I'd like to reference issue #6945: I'm not sure what's wrong with giving the users an option of using some sort of precision in doing the comparisons -- the original code, before I messed with it, had a hard-coded precision limit of 1e-13, so clearly it was an issue for someone when there was no float precision handling. I don't care if this is a default or an extra option, but I think most people trying to sort floats with ties will hit this issue. (Please, please, please make your tests include ties -- if you're going to change the definition of tie, please make sure the updated tests deal with ties properly) Finally, I believe these algorithms affect both the Series and DataFrame tests, although the PR only updated the frame tests so check that both have the expected behaviors (whatever everyone decides is the expected behavior!). |
how about we make a reasonable default that handles most cases, and a high precision option (maybe e.g. see for example what is happening for reading float precision in 0.15.0 here |
Just want to throw in some context here to make sure everyone's on the same page:
|
0216d56
to
ef48a43
Compare
@jreback tests added |
@seth-p I shouldn't have attempted to devise clever algorithms after work, the wording I had presented is wrong in several places. Essentially I was trying to achieve a "maximum independent set" binning on a graph with an edge between every
If I understand you correctly, you're basically saying that there exists another variant of binning of the same array of values in which these two numbers are in the same bin. That is correct, since maximum independent set is not guaranteed to be unique, I'm choosing one arbitrarily because any of them is good enough. |
bfd2c90
to
9f74147
Compare
9f74147
to
55eaeff
Compare
@behzadnouri care to summarize the current state of this PR and enumerate issues raised above? |
@jreback i think removing tolerance all together is the way to go (i.e. what this pr is doing), the point is that, as explained in this comment, "a is close to b" and "b is close to c" does not imply that "a is close to c". i.e. it is not transitive. so, ranking with tolerance is not well defined; as in the example in the first comment, this can cause inconsistent behaviour. |
iranks = iseries.rank() | ||
assert_series_equal(iranks, exp) | ||
assert_series_equal(iranks, exp, check_dtype=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make the expected of the correct dtype
0f265ca
to
5395a98
Compare
@jreback made the changes |
FWIW, I (still) agree with @behzadnouri. |
891d367
to
6c11ac2
Compare
BUG: floats cannot be ranked with tolerance
@behzadnouri thanks! |
closes #8365
using tolerance in ranking floats can result in inconsistent behavior; currently on master:
so, in effect
1001 == 1001.0002
if one takes small steps in between.on branch:
original issue(#8365):