Skip to content

BUG: floats cannot be ranked with tolerance #8379

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 3, 2015

Conversation

behzadnouri
Copy link
Contributor

closes #8365

using tolerance in ranking floats can result in inconsistent behavior; currently on master:

>>> pd.Series([1001, 1001.0002]).rank()
0    1
1    2
dtype: float64
>>> pd.Series([1001, 1001.0001, 1001.0002]).rank()
0    2
1    2
2    2
dtype: float64

so, in effect 1001 == 1001.0002 if one takes small steps in between.

on branch:

>>> pd.Series([1001, 1001.0002]).rank()
0    1
1    2
dtype: float64
>>> pd.Series([1001, 1001.0001, 1001.0002]).rank()
0    1
1    2
2    3
dtype: float64

original issue(#8365):

>>> from scipy import stats
>>> ts = pd.Series([1000.000669 , 1000.000041 , 1000.000059 , 1000.000063 , 1000.000121 , 1000.000104 , 1000.000040 , 1000.000062 , 1000.000095 , 1000.000091 , 1000.000050 , 1000.000074 , 1000.000063 , 1000.000076 , 1000.000083 , 1000.000061 , 1000.000030 , 1000.000069 , 1000.000090 , 1000.000116 , 1000.000058 , 1000.000074 , 1000.000035 , 1000.000084 , 1000.000067 , 1000.000072 , 1000.000105 , 1000.000091 , 1000.000077 , 1000.000040 , 1000.000108 , 1000.000117 , 1000.000114 , 1000.000117 , 1000.000099 , 1000.000039 , 1000.000046 , 1000.000105 , 1000.000057])
>>> np.all(ts.rank() == stats.rankdata(ts))
True

@jreback
Copy link
Contributor

jreback commented Sep 24, 2014

hmm this was done for a reason but I don't remember exactly

@behzadnouri behzadnouri changed the title ranking floats cannot be done with tolerance BUG: ranking floats cannot be done with tolerance Sep 24, 2014
@behzadnouri behzadnouri changed the title BUG: ranking floats cannot be done with tolerance BUG: floats cannot be ranked with tolerance Sep 24, 2014
@jtratner
Copy link
Contributor

my impression is that the correct way to compare floats is with a tolerance, because != can give inconsistent results in different systems (due to how floats are represented). Quick googling for 'comparing floating point numbers' give a couple different explanations and suggests that relative tolerance is a good middle ground.

At the very least, we need to consider this further - it's not as simple as this example.

@jtratner
Copy link
Contributor

but clearly the tolerance is too big somehow, the actual behavior is clearly wrong, but we want to replace it with something that will definitely be more correct.

@immerrr
Copy link
Contributor

immerrr commented Sep 24, 2014

1.001E3 ~= 1.001002E3 (upto 1e-4) is a bug in equivalence cluster calculation, the code should compare cur_value to cluster_min instead of prev_value to make sure that for each a_i in the cluster float_is_diff(cur_value, a_i) is False, setting tolerance to zero just masks the issue by having cluster_min == prev_element.

As for the actual tolerance value, one thing that I know for sure about floats is that all tolerances should be configurable, because whatever sane tolerance value and type (absolute, relative, ULP) you pick it won't fit everyone. In this case, even though strictly speaking one doesn't need equality comparisons to rank data, one might want to do some "joining" knowing that the numbers come from some algorithm that produces a limited number of significant digits.

@behzadnouri
Copy link
Contributor Author

@jtratner

-1- did you notice the [1001, 1001.0001, 1001.0002] example? this is a logical problem, not a numerical problem.

no matter how small the tolerance is one can make such examples where two numbers being equally ranked or not depends on the presence of a third number (i.e. a small step in between). the problem is using tolerance in ranking not the size of the tolerance.

-2- a user knows his data better than what the library can guess. if he wants to use tolerance he can simply do ts.round(10).rank() or something similar; as easy as that. but ranking data with tolerance has logical problem even if the tolerance is configurable.

-3- there is also cross consistency with other popular libraries. #8365 is comparing with scipy.stats.rankdata. such libraries build the convention and users' expectation of how their numbers should work and an inconsistent surprising behavior is a bug. (plus regression tests)

In certain domains, there is an undeniable value in sticking to conventions, even if one can argue an alternative is more technically correct.

-4- then, there is internal consistency; that results should be semantically consistent. currently on master:

>>> ts = pd.Series([1001.0002, 1001, 1001.0001])
>>> ts.rank() # method='average' default value
0    2
1    2
2    2
dtype: float64
>>> ts.rank(method='first')
0    3
1    1
2    2
dtype: float64 

if with method='average' all ranks are equal, then method='first' should return 1, 2, 3.

@jreback
Copy link
Contributor

jreback commented Sep 24, 2014

see #6886 which was put in place to handle really small floats.

@behzadnouri
Copy link
Contributor Author

@jreback yes, i had seen that commit before making my code changes, and i think the right thing to do back then was to remove tolerance in ranking floats. if a user wants to have tolerance let him .round his numbers to his desired precision.

ranking floats with absolute/relative tolerance has the logical issue mentioned above.

@jreback
Copy link
Contributor

jreback commented Sep 24, 2014

@behzadnouri no good. need to accomodate both situations. The original problem is just as valid. Is it possible to normalize the numbers first (if needed or does this cause more issues)? e.g. not a problem if it takes a slower loop when precision is an issue (as long as the fast path stays the same).

@jreback
Copy link
Contributor

jreback commented Sep 24, 2014

@behzadnouri also, pls, pls, pls add tests when you put up a PR

@immerrr
Copy link
Contributor

immerrr commented Sep 24, 2014

he can simply do ts.round(10).rank() or something similar

The problem with rounding is that the boundaries between equivalence classes become fixed, i.e. if you want 0.1 absolute tolerance and simply round up to 1 place after the delimiter [1.04, 1.05] will be rounded to [1.0, 1.1] and ranking will assign different ranks to these values while it's obvious that difference between them is less than 0.1 and they should be ranked the same.

@jreback
Copy link
Contributor

jreback commented Sep 24, 2014

care to weight in?

cc @seth-p
cc @jaimefrio

@behzadnouri
Copy link
Contributor Author

@immerrr and how do you rank [1.04, 1.10, 1.16] with .1 tolerance?

does 1.10 go with 1.16 or 1.04? or both?

seriously, you do not see the logical issue in here?

@seth-p
Copy link
Contributor

seth-p commented Sep 24, 2014

I agree with @behzadnouri. The problem is that the float64_are_diff() function (or any "is-not-within-tolerance" function) isn't transitive: float64_are_diff(a,b)==False and float64_are_diff(b,c)==False does not imply that float64_are_diff(a,c)==False. (This would be worded cleaner as (not float64_are_diff(,))==True...) As such, it shouldn't be used as an equivalence relation, which is a property we want in the definition of "a and b are of the same rank".

I agree that the ranking function should be applied based on strict (in)equality, possibly after the user has transformed the data, e.g. ts.round(10).rank() or log(ts).round(10).rank(). Anything else is just asking for trouble and confusion.

I'm tempted to suggest giving the user the option (which would not be the default) to specify the current behavior with a user-specified tolerance -- with lots of warnings -- but I think it's just handing them a loaded gun, and is probably best avoided.

CC'ing @nspies, who submitted #6886.

@jreback
Copy link
Contributor

jreback commented Sep 24, 2014

@seth-p is it worth testing when 2 values are equal if they are really really small (and so behavior can be different for the #6886) case?

@immerrr
Copy link
Contributor

immerrr commented Sep 24, 2014

@behzadnouri, neither of threshold-based adaptive-threshold-based clustering approaches will guarantee a unique solution, but the one with 'adaptive' boundaries may produce less clusters than fixed boundaries.

@seth-p
Copy link
Contributor

seth-p commented Sep 24, 2014

@jreback, no matter how small the threshold, you can't avoid the problem that is_close(a,b) and is_close(b,c) does not imply is_close(a,c). I would put the onus on the user to map the "raw" values to standardized discrete values, e.g. ts.round(14).rank() -- only they know what they want "same rank" to mean.

@immerrr
Copy link
Contributor

immerrr commented Sep 24, 2014

I would put the onus on the user to map the "raw" values to standardized discrete values

That would indeed simplify rank's contract, but we might consider adding a standard algorithm to perform rounding to next smallest value to compensate for stripped functionality.

@seth-p
Copy link
Contributor

seth-p commented Sep 24, 2014

@immerrr, can you elaborate? I don't think I understand what you mean.

@immerrr
Copy link
Contributor

immerrr commented Sep 24, 2014

Currently rank does effectively two things: it does some threshold-based discretization AND performs ranking on the result. I think that if we remove the first step and make rank only responsible for the second (which is good), there may be users that miss the old behaviour and will be annoyed having to reimplement it (which is bad). To avoid that we might want to consider adding a standard implementation of this discretization routine.

@seth-p
Copy link
Contributor

seth-p commented Sep 24, 2014

The problem is that the current behavior is not equivalent to discretization. So you really can't do it.

I haven't looked at the current behavior in detail, but on quick glance it looks similar to (but not the same as) log(ts).round(7). Am not sure it makes sense to create a function to do that.

@jreback jreback added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Numeric Operations Arithmetic, Comparison, and Logical operations labels Sep 24, 2014
@immerrr
Copy link
Contributor

immerrr commented Sep 24, 2014

Ok, current algorithm is broken, I've forgotten about that.

Perhaps, it was supposed to separate all numbers with distance >= tolerance into different bins and put the rest of the numbers (that are closer than that to at least one other number) to the next smallest bin (UPD: this wording is incorrect, the idea was to find a maximum independent set). It has two benefits over plain rounding that may be desirable:

  • it will result in less bins in cases like [1.04, 1.05] with threshold 0.1
  • it won't alter the values if all of them are sufficiently far away from each other

@immerrr
Copy link
Contributor

immerrr commented Sep 24, 2014

As for if this is discretization or not, it does map elements from almost-infinite set of floats to a finite set of floats that are farther than threshold from each other. It is not guaranteed to be unique, but that sounds like discretization to me.

@seth-p
Copy link
Contributor

seth-p commented Sep 24, 2014

  • true, though that comes at the cost of also binning [1.04, 1.05, 1.14, 1.15, 1.24, 1.25, ...] into a single bin.
  • whether or not it "alters" values is purely an implementation details; it wouldn't alter the original series

@immerrr
Copy link
Contributor

immerrr commented Sep 24, 2014

true, though that comes at the cost of also binning [1.04, 1.05, 1.14, 1.15, 1.24, 1.25, ...] into a single bin

Current algorithm would, the one that I described would not. Like I said, it's broken now.

whether or not it "alters" values is purely an implementation details; it wouldn't alter the original series

Hmm, you're correct, at some point I got confused thinking it returns a series.

@seth-p
Copy link
Contributor

seth-p commented Sep 24, 2014

How would your proposed algorithm bin [1.04, 1.05, 1.14, 1.15, 1.24, 1.25, ...] with a threshold of 0.1?

@immerrr
Copy link
Contributor

immerrr commented Sep 24, 2014

  • find a set of all numbers with distance >= tolerance from one another, one set of such numbers is:
{1.04, 1.14, 1.24, ...}
  • put all other numbers to the next smallest bin:
{1.04: {1.04, 1.05}, 1.14: {1.14, 1.15}, 1.24: {1.24, 1.25}, ...}

@seth-p
Copy link
Contributor

seth-p commented Sep 24, 2014

That's sort of arbitrary, since you're mapping 1.05 and 1.14 to different bins even though they're <0.1 apart. (The current behavior would regard the whole sequence as a single bin, if I undrestand it correctly.) So you end up with something that is (a) still arbitrary, and (b) different from the current behavior.

Perhaps a better example would be { 1.04, 1.13, 1.22, 1.31, ... } -- i.e. a sequence of numbers 0.09 apart. I see only two plausible binnings: (i) every item is its own bin (which seems odd given that successive items are <0.1 apart); or (ii) all items in a single bin (which is the current behavior, but is odd since most pairs of items are >0.1 apart).

@nspies
Copy link
Contributor

nspies commented Sep 24, 2014

I'm all for fixing the current approach. A few thoughts:

Before I forget, I'd like to reference issue #6945: np.nan and np.inf don't get sorted properly. Would be nice if whoever fixes the precision issue fixes that too!

I'm not sure what's wrong with giving the users an option of using some sort of precision in doing the comparisons -- the original code, before I messed with it, had a hard-coded precision limit of 1e-13, so clearly it was an issue for someone when there was no float precision handling. I don't care if this is a default or an extra option, but I think most people trying to sort floats with ties will hit this issue.

(Please, please, please make your tests include ties -- if you're going to change the definition of tie, please make sure the updated tests deal with ties properly)

Finally, I believe these algorithms affect both the Series and DataFrame tests, although the PR only updated the frame tests so check that both have the expected behaviors (whatever everyone decides is the expected behavior!).

@jreback
Copy link
Contributor

jreback commented Sep 24, 2014

how about we make a reasonable default that handles most cases, and a high precision option (maybe float_pricesion='high' for the original case (really tiny numbers). Most people won't care as long as it works, and for the really precision necessary folks, can use pass the option

e.g. see for example what is happening for reading float precision in 0.15.0 here

@jtratner
Copy link
Contributor

Just want to throw in some context here to make sure everyone's on the same page:

  • scipy.stats.rankdata does do a C-level != check just like this PR suggests, so there's certainly precedent here and it's been that way since at least 2010.
  • the floats_are_diff(a, b) here was calculating based on relative tolerance of a vs. b, (i.e., is (a-b) > (1e-7*b) which makes the behavior a little more subtle and it's not sufficient to do a single round across the entire array

@behzadnouri behzadnouri force-pushed the rank-tol branch 2 times, most recently from 0216d56 to ef48a43 Compare September 24, 2014 22:13
@behzadnouri
Copy link
Contributor Author

@jreback tests added

@immerrr
Copy link
Contributor

immerrr commented Sep 25, 2014

@seth-p I shouldn't have attempted to devise clever algorithms after work, the wording I had presented is wrong in several places. Essentially I was trying to achieve a "maximum independent set" binning on a graph with an edge between every a and b if float_are_close(a, b). And this somewhat supports my point about having a standard "bug-free" implementation of it somewhere.

That's sort of arbitrary, since you're mapping 1.05 and 1.14 to different bins even though they're <0.1 apart.

If I understand you correctly, you're basically saying that there exists another variant of binning of the same array of values in which these two numbers are in the same bin. That is correct, since maximum independent set is not guaranteed to be unique, I'm choosing one arbitrarily because any of them is good enough.

@jreback
Copy link
Contributor

jreback commented Jan 18, 2015

@behzadnouri care to summarize the current state of this PR and enumerate issues raised above?

@behzadnouri
Copy link
Contributor Author

@jreback i think removing tolerance all together is the way to go (i.e. what this pr is doing),

the point is that, as explained in this comment, "a is close to b" and "b is close to c" does not imply that "a is close to c". i.e. it is not transitive. so, ranking with tolerance is not well defined; as in the example in the first comment, this can cause inconsistent behaviour.

iranks = iseries.rank()
assert_series_equal(iranks, exp)
assert_series_equal(iranks, exp, check_dtype=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make the expected of the correct dtype

@jreback jreback added this to the 0.16.0 milestone Jan 19, 2015
@behzadnouri behzadnouri force-pushed the rank-tol branch 3 times, most recently from 0f265ca to 5395a98 Compare January 19, 2015 19:51
@behzadnouri
Copy link
Contributor Author

@jreback made the changes

@seth-p
Copy link
Contributor

seth-p commented Jan 22, 2015

FWIW, I (still) agree with @behzadnouri.

jreback added a commit that referenced this pull request Mar 3, 2015
BUG: floats cannot be ranked with tolerance
@jreback jreback merged commit 0d35dd4 into pandas-dev:master Mar 3, 2015
@jreback
Copy link
Contributor

jreback commented Mar 3, 2015

@behzadnouri thanks!

@behzadnouri behzadnouri deleted the rank-tol branch March 5, 2015 12:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Series/DataFrame.rank() doesn't handle certain floats properly
6 participants