Series.rank() doesn't handle small floats correctly #6868

nspies · 2014-04-11T00:09:10Z

Floats below 1e-10 seem to all be receiving the same rank, incorrectly:

In [1]: import pandas

In [3]: import numpy

In [4]: series = pandas.Series([1e-100, 1e-25, 1e-20, 1e-15, 1e-10, 
                                1e-5, 1e-4, 1e-3, 1e-2, 1e-1])

In [5]: series
Out[5]: 
0    1.000000e-100
1     1.000000e-25
2     1.000000e-20
3     1.000000e-15
4     1.000000e-10
5     1.000000e-05
6     1.000000e-04
7     1.000000e-03
8     1.000000e-02
9     1.000000e-01
dtype: float64

In [6]: series.rank()
Out[6]: 
0     2.5
1     2.5
2     2.5
3     2.5
4     5.0
5     6.0
6     7.0
7     8.0
8     9.0
9    10.0
dtype: float64

In [7]: from scipy import stats

In [8]: stats.rankdata(series)
Out[8]: array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.])

In [13]: pandas.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.3.final.0
python-bits: 64
OS: Darwin
OS-release: 10.8.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.13.1
Cython: 0.19.1
numpy: 1.8.0
scipy: 0.12.0.dev-1d5c886
statsmodels: 0.5.0
IPython: 1.2.1
sphinx: 1.2.2
patsy: 0.2.0
scikits.timeseries: None
dateutil: 2.2
pytz: 2013b
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.2.0
openpyxl: 1.5.7
xlrd: 0.7.1
xlwt: None
xlsxwriter: None
sqlalchemy: 0.6.6
lxml: None
bs4: None
html5lib: None
bq: None
apiclient: None

The text was updated successfully, but these errors were encountered:

danielballan · 2014-04-11T13:47:52Z

The reason is here:

https://github.com/pydata/pandas/blob/master/pandas/algos.pyx#L9
https://github.com/pydata/pandas/blob/master/pandas/algos.pyx#L189

where any number below 1e-13 is considered to be below machine precision.

As you demonstrated, it seems scipy.stats.rankdata does not limit itself to this level of precision. From a practical data analysis standpoint, declaring numbers that low "too close to call" seems like a reasonable design decision to me, but since pandas' implementation is purportedly a nan-friendly version of rankdata, it should at least be documented.

nspies · 2014-04-12T19:01:59Z

How about replacing these comparisons with something like what numpy.isclose() does?

http://docs.scipy.org/doc/numpy/reference/generated/numpy.isclose.html

For my particular situation, I could easily roll something that does what I think is correct (in my case a log transformation would work fine), but people should be able to trust the output of the algorithms built-in to pandas -- 1e-13 seems to be very arbitrary and I'm sure I'm not the only person who uses floats that are substantially closer to zero than that. I think using a relative tolerance is much more flexible than the absolute tolerance currently being used here by pandas. (As a note, numpy.testing.assert_allclose uses a relative tolerance of 1e-07 and an absolute tolerance of 0).

On Apr 11, 2014, at 6:48 AM, Dan Allan wrote:

The reason is here:

https://github.com/pydata/pandas/blob/master/pandas/algos.pyx#L9
https://github.com/pydata/pandas/blob/master/pandas/algos.pyx#L189

where any number below 1e-13 is considered to be below machine precision.

As you demonstrated, it seems scipy.stats.rankdata does not limit itself to this level of precision. From a practical data analysis standpoint, declaring numbers that low "too close to call" seems like a reasonable design decision to me, but since pandas' implementation is purportedly a nan-friendly version of rankdata, it should at least be documented.

—
Reply to this email directly or view it on GitHub.

danielballan · 2014-04-14T20:42:43Z

I agree that would be better. But let's get the blessing of one of the Collaborators before anyone puts effort in.... @jreback?

jreback · 2014-04-14T21:28:21Z

I don't really know why this was put in the first place

welcome tey to change and see what happens

jtratner · 2014-04-14T21:47:57Z

we just want to keep performance in mind here. Noah, if you could put
something together and then run some of the bench marks on it (if they
exist) that would be great :) If there is a perf impact, we'll figure out
how to handle it from there.

…ndas-dev#6868

…ndas-dev#6868 cleaning up comments BUG: Series/DataFrame.rank() doesn't handle small floats correctly pandas-dev#6868 adding test for ranking with np.inf Added release note pandas-dev#6886 Fixing float conversions in test_rank()

adding test for ranking with np.inf Added release note #6886 Fixing float conversions in test_rank()

jreback · 2014-04-23T23:24:50Z

closed via #6886

…ndas-dev#6868 adding test for ranking with np.inf Added release note pandas-dev#6886 Fixing float conversions in test_rank()

jreback added Algos labels Apr 11, 2014

jreback added this to the 0.15.0 milestone Apr 11, 2014

nspies mentioned this issue Apr 15, 2014

Series.rank() doesn't handle small floats correctly #6886

Closed

jreback added Bug and removed Docs labels Apr 21, 2014

jreback modified the milestones: 0.14.0, 0.15.0 Apr 21, 2014

nspies pushed a commit to nspies/pandas that referenced this issue Apr 23, 2014

BUG: Series/DataFrame.rank() doesn't handle small floats correctly pa…

85fdc79

…ndas-dev#6868

jreback pushed a commit that referenced this issue Apr 23, 2014

BUG: Series/DataFrame.rank() doesn't handle small floats correctly #6868

d96fd80

adding test for ranking with np.inf Added release note #6886 Fixing float conversions in test_rank()

jreback closed this as completed Apr 23, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Series.rank() doesn't handle small floats correctly #6868

Series.rank() doesn't handle small floats correctly #6868

nspies commented Apr 11, 2014

danielballan commented Apr 11, 2014

nspies commented Apr 12, 2014

danielballan commented Apr 14, 2014

jreback commented Apr 14, 2014

jtratner commented Apr 14, 2014

jreback commented Apr 23, 2014

Series.rank() doesn't handle small floats correctly #6868

Series.rank() doesn't handle small floats correctly #6868

Comments

nspies commented Apr 11, 2014

danielballan commented Apr 11, 2014

nspies commented Apr 12, 2014

danielballan commented Apr 14, 2014

jreback commented Apr 14, 2014

jtratner commented Apr 14, 2014

jreback commented Apr 23, 2014