BUG: floats cannot be ranked with tolerance #8379

behzadnouri · 2014-09-24T01:40:16Z

using tolerance in ranking floats can result in inconsistent behavior; currently on master:

>>> pd.Series([1001, 1001.0002]).rank()
0    1
1    2
dtype: float64
>>> pd.Series([1001, 1001.0001, 1001.0002]).rank()
0    2
1    2
2    2
dtype: float64

so, in effect 1001 == 1001.0002 if one takes small steps in between.

on branch:

>>> pd.Series([1001, 1001.0002]).rank()
0    1
1    2
dtype: float64
>>> pd.Series([1001, 1001.0001, 1001.0002]).rank()
0    1
1    2
2    3
dtype: float64

original issue(#8365):

>>> from scipy import stats
>>> ts = pd.Series([1000.000669 , 1000.000041 , 1000.000059 , 1000.000063 , 1000.000121 , 1000.000104 , 1000.000040 , 1000.000062 , 1000.000095 , 1000.000091 , 1000.000050 , 1000.000074 , 1000.000063 , 1000.000076 , 1000.000083 , 1000.000061 , 1000.000030 , 1000.000069 , 1000.000090 , 1000.000116 , 1000.000058 , 1000.000074 , 1000.000035 , 1000.000084 , 1000.000067 , 1000.000072 , 1000.000105 , 1000.000091 , 1000.000077 , 1000.000040 , 1000.000108 , 1000.000117 , 1000.000114 , 1000.000117 , 1000.000099 , 1000.000039 , 1000.000046 , 1000.000105 , 1000.000057])
>>> np.all(ts.rank() == stats.rankdata(ts))
True

jreback · 2014-09-24T01:42:02Z

hmm this was done for a reason but I don't remember exactly

jtratner · 2014-09-24T05:51:27Z

my impression is that the correct way to compare floats is with a tolerance, because != can give inconsistent results in different systems (due to how floats are represented). Quick googling for 'comparing floating point numbers' give a couple different explanations and suggests that relative tolerance is a good middle ground.

At the very least, we need to consider this further - it's not as simple as this example.

jtratner · 2014-09-24T05:52:57Z

but clearly the tolerance is too big somehow, the actual behavior is clearly wrong, but we want to replace it with something that will definitely be more correct.

immerrr · 2014-09-24T12:11:50Z

1.001E3 ~= 1.001002E3 (upto 1e-4) is a bug in equivalence cluster calculation, the code should compare cur_value to cluster_min instead of prev_value to make sure that for each a_i in the cluster float_is_diff(cur_value, a_i) is False, setting tolerance to zero just masks the issue by having cluster_min == prev_element.

As for the actual tolerance value, one thing that I know for sure about floats is that all tolerances should be configurable, because whatever sane tolerance value and type (absolute, relative, ULP) you pick it won't fit everyone. In this case, even though strictly speaking one doesn't need equality comparisons to rank data, one might want to do some "joining" knowing that the numbers come from some algorithm that produces a limited number of significant digits.

behzadnouri · 2014-09-24T12:59:02Z

@jtratner

-1- did you notice the [1001, 1001.0001, 1001.0002] example? this is a logical problem, not a numerical problem.

no matter how small the tolerance is one can make such examples where two numbers being equally ranked or not depends on the presence of a third number (i.e. a small step in between). the problem is using tolerance in ranking not the size of the tolerance.

-2- a user knows his data better than what the library can guess. if he wants to use tolerance he can simply do ts.round(10).rank() or something similar; as easy as that. but ranking data with tolerance has logical problem even if the tolerance is configurable.

-3- there is also cross consistency with other popular libraries. #8365 is comparing with scipy.stats.rankdata. such libraries build the convention and users' expectation of how their numbers should work and an inconsistent surprising behavior is a bug. (plus regression tests)

In certain domains, there is an undeniable value in sticking to conventions, even if one can argue an alternative is more technically correct.

-4- then, there is internal consistency; that results should be semantically consistent. currently on master:

>>> ts = pd.Series([1001.0002, 1001, 1001.0001])
>>> ts.rank() # method='average' default value
0    2
1    2
2    2
dtype: float64
>>> ts.rank(method='first')
0    3
1    1
2    2
dtype: float64

if with method='average' all ranks are equal, then method='first' should return 1, 2, 3.

jreback · 2014-09-24T12:59:05Z

see #6886 which was put in place to handle really small floats.

behzadnouri · 2014-09-24T13:08:21Z

@jreback yes, i had seen that commit before making my code changes, and i think the right thing to do back then was to remove tolerance in ranking floats. if a user wants to have tolerance let him .round his numbers to his desired precision.

ranking floats with absolute/relative tolerance has the logical issue mentioned above.

jreback · 2014-09-24T13:10:24Z

@behzadnouri no good. need to accomodate both situations. The original problem is just as valid. Is it possible to normalize the numbers first (if needed or does this cause more issues)? e.g. not a problem if it takes a slower loop when precision is an issue (as long as the fast path stays the same).

jreback · 2014-09-24T13:10:57Z

@behzadnouri also, pls, pls, pls add tests when you put up a PR

immerrr · 2014-09-24T13:12:45Z

he can simply do ts.round(10).rank() or something similar

The problem with rounding is that the boundaries between equivalence classes become fixed, i.e. if you want 0.1 absolute tolerance and simply round up to 1 place after the delimiter [1.04, 1.05] will be rounded to [1.0, 1.1] and ranking will assign different ranks to these values while it's obvious that difference between them is less than 0.1 and they should be ranked the same.

jreback · 2014-09-24T13:13:53Z

care to weight in?

cc @seth-p
cc @jaimefrio

behzadnouri · 2014-09-24T13:17:38Z

@immerrr and how do you rank [1.04, 1.10, 1.16] with .1 tolerance?

does 1.10 go with 1.16 or 1.04? or both?

seriously, you do not see the logical issue in here?

seth-p · 2014-09-24T13:39:05Z

I agree with @behzadnouri. The problem is that the float64_are_diff() function (or any "is-not-within-tolerance" function) isn't transitive: float64_are_diff(a,b)==False and float64_are_diff(b,c)==False does not imply that float64_are_diff(a,c)==False. (This would be worded cleaner as (not float64_are_diff(,))==True...) As such, it shouldn't be used as an equivalence relation, which is a property we want in the definition of "a and b are of the same rank".

I agree that the ranking function should be applied based on strict (in)equality, possibly after the user has transformed the data, e.g. ts.round(10).rank() or log(ts).round(10).rank(). Anything else is just asking for trouble and confusion.

I'm tempted to suggest giving the user the option (which would not be the default) to specify the current behavior with a user-specified tolerance -- with lots of warnings -- but I think it's just handing them a loaded gun, and is probably best avoided.

CC'ing @nspies, who submitted #6886.

jreback · 2014-09-24T13:49:38Z

@seth-p is it worth testing when 2 values are equal if they are really really small (and so behavior can be different for the #6886) case?

immerrr · 2014-09-24T14:01:31Z

@behzadnouri, neither of ~~threshold-based~~ adaptive-threshold-based clustering approaches will guarantee a unique solution, but the one with 'adaptive' boundaries may produce less clusters than fixed boundaries.

seth-p · 2014-09-24T14:05:11Z

@jreback, no matter how small the threshold, you can't avoid the problem that is_close(a,b) and is_close(b,c) does not imply is_close(a,c). I would put the onus on the user to map the "raw" values to standardized discrete values, e.g. ts.round(14).rank() -- only they know what they want "same rank" to mean.

immerrr · 2014-09-24T14:15:07Z

I would put the onus on the user to map the "raw" values to standardized discrete values

That would indeed simplify rank's contract, but we might consider adding a standard algorithm to perform rounding to next smallest value to compensate for stripped functionality.

seth-p · 2014-09-24T14:16:34Z

@immerrr, can you elaborate? I don't think I understand what you mean.

immerrr · 2014-09-24T14:23:24Z

Currently rank does effectively two things: it does some threshold-based discretization AND performs ranking on the result. I think that if we remove the first step and make rank only responsible for the second (which is good), there may be users that miss the old behaviour and will be annoyed having to reimplement it (which is bad). To avoid that we might want to consider adding a standard implementation of this discretization routine.

seth-p · 2014-09-24T14:30:13Z

The problem is that the current behavior is not equivalent to discretization. So you really can't do it.

I haven't looked at the current behavior in detail, but on quick glance it looks similar to (but not the same as) log(ts).round(7). Am not sure it makes sense to create a function to do that.

immerrr · 2014-09-24T14:47:09Z

Ok, current algorithm is broken, I've forgotten about that.

Perhaps, it was supposed to ~~separate all numbers with distance >= tolerance into different bins and put the rest of the numbers (that are closer than that to at least one other number) to the next smallest bin~~ (UPD: this wording is incorrect, the idea was to find a maximum independent set). It has two benefits over plain rounding that may be desirable:

it will result in less bins in cases like [1.04, 1.05] with threshold 0.1
it won't alter the values if all of them are sufficiently far away from each other

immerrr · 2014-09-24T14:53:01Z

As for if this is discretization or not, it does map elements from almost-infinite set of floats to a finite set of floats that are farther than threshold from each other. It is not guaranteed to be unique, but that sounds like discretization to me.

seth-p · 2014-09-24T14:53:51Z

true, though that comes at the cost of also binning [1.04, 1.05, 1.14, 1.15, 1.24, 1.25, ...] into a single bin.
whether or not it "alters" values is purely an implementation details; it wouldn't alter the original series

immerrr · 2014-09-24T15:03:36Z

true, though that comes at the cost of also binning [1.04, 1.05, 1.14, 1.15, 1.24, 1.25, ...] into a single bin

Current algorithm would, the one that I described would not. Like I said, it's broken now.

whether or not it "alters" values is purely an implementation details; it wouldn't alter the original series

Hmm, you're correct, at some point I got confused thinking it returns a series.

seth-p · 2014-09-24T15:06:24Z

How would your proposed algorithm bin [1.04, 1.05, 1.14, 1.15, 1.24, 1.25, ...] with a threshold of 0.1?

immerrr · 2014-09-24T15:10:27Z

find a set of all numbers with distance >= tolerance from one another, one set of such numbers is:

{1.04, 1.14, 1.24, ...}

put all other numbers to the next smallest bin:

{1.04: {1.04, 1.05}, 1.14: {1.14, 1.15}, 1.24: {1.24, 1.25}, ...}

seth-p · 2014-09-24T15:19:15Z

That's sort of arbitrary, since you're mapping 1.05 and 1.14 to different bins even though they're <0.1 apart. (The current behavior would regard the whole sequence as a single bin, if I undrestand it correctly.) So you end up with something that is (a) still arbitrary, and (b) different from the current behavior.

Perhaps a better example would be { 1.04, 1.13, 1.22, 1.31, ... } -- i.e. a sequence of numbers 0.09 apart. I see only two plausible binnings: (i) every item is its own bin (which seems odd given that successive items are <0.1 apart); or (ii) all items in a single bin (which is the current behavior, but is odd since most pairs of items are >0.1 apart).

nspies · 2014-09-24T16:17:29Z

I'm all for fixing the current approach. A few thoughts:

Before I forget, I'd like to reference issue #6945: np.nan and np.inf don't get sorted properly. Would be nice if whoever fixes the precision issue fixes that too!

I'm not sure what's wrong with giving the users an option of using some sort of precision in doing the comparisons -- the original code, before I messed with it, had a hard-coded precision limit of 1e-13, so clearly it was an issue for someone when there was no float precision handling. I don't care if this is a default or an extra option, but I think most people trying to sort floats with ties will hit this issue.

(Please, please, please make your tests include ties -- if you're going to change the definition of tie, please make sure the updated tests deal with ties properly)

Finally, I believe these algorithms affect both the Series and DataFrame tests, although the PR only updated the frame tests so check that both have the expected behaviors (whatever everyone decides is the expected behavior!).

jreback · 2014-09-24T16:20:51Z

how about we make a reasonable default that handles most cases, and a high precision option (maybe float_pricesion='high' for the original case (really tiny numbers). Most people won't care as long as it works, and for the really precision necessary folks, can use pass the option

e.g. see for example what is happening for reading float precision in 0.15.0 here

jtratner · 2014-09-24T19:31:47Z

Just want to throw in some context here to make sure everyone's on the same page:

scipy.stats.rankdata does do a C-level != check just like this PR suggests, so there's certainly precedent here and it's been that way since at least 2010.
the floats_are_diff(a, b) here was calculating based on relative tolerance of a vs. b, (i.e., is (a-b) > (1e-7*b) which makes the behavior a little more subtle and it's not sufficient to do a single round across the entire array

behzadnouri · 2014-09-24T22:51:30Z

@jreback tests added

immerrr · 2014-09-25T07:40:23Z

@seth-p I shouldn't have attempted to devise clever algorithms after work, the wording I had presented is wrong in several places. Essentially I was trying to achieve a "maximum independent set" binning on a graph with an edge between every a and b if float_are_close(a, b). And this somewhat supports my point about having a standard "bug-free" implementation of it somewhere.

That's sort of arbitrary, since you're mapping 1.05 and 1.14 to different bins even though they're <0.1 apart.

If I understand you correctly, you're basically saying that there exists another variant of binning of the same array of values in which these two numbers are in the same bin. That is correct, since maximum independent set is not guaranteed to be unique, I'm choosing one arbitrarily because any of them is good enough.

jreback · 2015-01-18T21:22:27Z

@behzadnouri care to summarize the current state of this PR and enumerate issues raised above?

behzadnouri · 2015-01-18T23:53:20Z

@jreback i think removing tolerance all together is the way to go (i.e. what this pr is doing),

the point is that, as explained in this comment, "a is close to b" and "b is close to c" does not imply that "a is close to c". i.e. it is not transitive. so, ranking with tolerance is not well defined; as in the example in the first comment, this can cause inconsistent behaviour.

jreback · 2015-01-19T12:26:37Z

pandas/tests/test_series.py

        iranks = iseries.rank()
-        assert_series_equal(iranks, exp)
+        assert_series_equal(iranks, exp, check_dtype=False)


make the expected of the correct dtype

behzadnouri · 2015-01-20T11:43:17Z

@jreback made the changes

seth-p · 2015-01-22T00:55:38Z

FWIW, I (still) agree with @behzadnouri.

BUG: floats cannot be ranked with tolerance

jreback · 2015-03-03T01:15:40Z

@behzadnouri thanks!

behzadnouri changed the title ~~ranking floats cannot be done with tolerance~~ BUG: ranking floats cannot be done with tolerance Sep 24, 2014

behzadnouri changed the title ~~BUG: ranking floats cannot be done with tolerance~~ BUG: floats cannot be ranked with tolerance Sep 24, 2014

jreback added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Numeric Operations Arithmetic, Comparison, and Logical operations labels Sep 24, 2014

behzadnouri force-pushed the rank-tol branch 2 times, most recently from 0216d56 to ef48a43 Compare September 24, 2014 22:13

behzadnouri force-pushed the rank-tol branch 3 times, most recently from bfd2c90 to 9f74147 Compare September 25, 2014 12:04

behzadnouri force-pushed the rank-tol branch from 9f74147 to 55eaeff Compare November 30, 2014 19:03

jreback reviewed Jan 19, 2015
View reviewed changes

jreback added this to the 0.16.0 milestone Jan 19, 2015

behzadnouri force-pushed the rank-tol branch 3 times, most recently from 0f265ca to 5395a98 Compare January 19, 2015 19:51

behzadnouri force-pushed the rank-tol branch from 5395a98 to 891d367 Compare February 1, 2015 01:33

BUG: floats cannot be ranked with tolerance

6c11ac2

behzadnouri force-pushed the rank-tol branch from 891d367 to 6c11ac2 Compare February 22, 2015 16:36

jreback added a commit that referenced this pull request Mar 3, 2015

Merge pull request #8379 from behzadnouri/rank-tol

0d35dd4

BUG: floats cannot be ranked with tolerance

jreback merged commit 0d35dd4 into pandas-dev:master Mar 3, 2015

behzadnouri deleted the rank-tol branch March 5, 2015 12:47

BUG: floats cannot be ranked with tolerance #8379

BUG: floats cannot be ranked with tolerance #8379

Conversation

behzadnouri commented Sep 24, 2014

jreback commented Sep 24, 2014

jtratner commented Sep 24, 2014

jtratner commented Sep 24, 2014

immerrr commented Sep 24, 2014

behzadnouri commented Sep 24, 2014

jreback commented Sep 24, 2014

behzadnouri commented Sep 24, 2014

jreback commented Sep 24, 2014

jreback commented Sep 24, 2014

immerrr commented Sep 24, 2014

jreback commented Sep 24, 2014

behzadnouri commented Sep 24, 2014

seth-p commented Sep 24, 2014

jreback commented Sep 24, 2014

immerrr commented Sep 24, 2014

seth-p commented Sep 24, 2014

immerrr commented Sep 24, 2014

seth-p commented Sep 24, 2014

immerrr commented Sep 24, 2014

seth-p commented Sep 24, 2014

immerrr commented Sep 24, 2014

immerrr commented Sep 24, 2014

seth-p commented Sep 24, 2014

immerrr commented Sep 24, 2014

seth-p commented Sep 24, 2014

immerrr commented Sep 24, 2014

seth-p commented Sep 24, 2014

nspies commented Sep 24, 2014

jreback commented Sep 24, 2014

jtratner commented Sep 24, 2014

behzadnouri commented Sep 24, 2014

immerrr commented Sep 25, 2014

jreback commented Jan 18, 2015

behzadnouri commented Jan 18, 2015

jreback Jan 19, 2015

Choose a reason for hiding this comment

behzadnouri commented Jan 20, 2015

seth-p commented Jan 22, 2015

jreback commented Mar 3, 2015