Skip to content

Series.rank() doesn't handle small floats correctly #6868

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nspies opened this issue Apr 11, 2014 · 6 comments
Closed

Series.rank() doesn't handle small floats correctly #6868

nspies opened this issue Apr 11, 2014 · 6 comments
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@nspies
Copy link
Contributor

nspies commented Apr 11, 2014

Floats below 1e-10 seem to all be receiving the same rank, incorrectly:

In [1]: import pandas

In [3]: import numpy

In [4]: series = pandas.Series([1e-100, 1e-25, 1e-20, 1e-15, 1e-10, 
                                1e-5, 1e-4, 1e-3, 1e-2, 1e-1])

In [5]: series
Out[5]: 
0    1.000000e-100
1     1.000000e-25
2     1.000000e-20
3     1.000000e-15
4     1.000000e-10
5     1.000000e-05
6     1.000000e-04
7     1.000000e-03
8     1.000000e-02
9     1.000000e-01
dtype: float64

In [6]: series.rank()
Out[6]: 
0     2.5
1     2.5
2     2.5
3     2.5
4     5.0
5     6.0
6     7.0
7     8.0
8     9.0
9    10.0
dtype: float64

In [7]: from scipy import stats

In [8]: stats.rankdata(series)
Out[8]: array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.])

In [13]: pandas.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.3.final.0
python-bits: 64
OS: Darwin
OS-release: 10.8.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.13.1
Cython: 0.19.1
numpy: 1.8.0
scipy: 0.12.0.dev-1d5c886
statsmodels: 0.5.0
IPython: 1.2.1
sphinx: 1.2.2
patsy: 0.2.0
scikits.timeseries: None
dateutil: 2.2
pytz: 2013b
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.2.0
openpyxl: 1.5.7
xlrd: 0.7.1
xlwt: None
xlsxwriter: None
sqlalchemy: 0.6.6
lxml: None
bs4: None
html5lib: None
bq: None
apiclient: None
@danielballan
Copy link
Contributor

The reason is here:

https://github.com/pydata/pandas/blob/master/pandas/algos.pyx#L9
https://github.com/pydata/pandas/blob/master/pandas/algos.pyx#L189

where any number below 1e-13 is considered to be below machine precision.

As you demonstrated, it seems scipy.stats.rankdata does not limit itself to this level of precision. From a practical data analysis standpoint, declaring numbers that low "too close to call" seems like a reasonable design decision to me, but since pandas' implementation is purportedly a nan-friendly version of rankdata, it should at least be documented.

@jreback jreback added this to the 0.15.0 milestone Apr 11, 2014
@nspies
Copy link
Contributor Author

nspies commented Apr 12, 2014

How about replacing these comparisons with something like what numpy.isclose() does?

http://docs.scipy.org/doc/numpy/reference/generated/numpy.isclose.html

For my particular situation, I could easily roll something that does what I think is correct (in my case a log transformation would work fine), but people should be able to trust the output of the algorithms built-in to pandas -- 1e-13 seems to be very arbitrary and I'm sure I'm not the only person who uses floats that are substantially closer to zero than that. I think using a relative tolerance is much more flexible than the absolute tolerance currently being used here by pandas. (As a note, numpy.testing.assert_allclose uses a relative tolerance of 1e-07 and an absolute tolerance of 0).

On Apr 11, 2014, at 6:48 AM, Dan Allan wrote:

The reason is here:

https://github.com/pydata/pandas/blob/master/pandas/algos.pyx#L9
https://github.com/pydata/pandas/blob/master/pandas/algos.pyx#L189

where any number below 1e-13 is considered to be below machine precision.

As you demonstrated, it seems scipy.stats.rankdata does not limit itself to this level of precision. From a practical data analysis standpoint, declaring numbers that low "too close to call" seems like a reasonable design decision to me, but since pandas' implementation is purportedly a nan-friendly version of rankdata, it should at least be documented.


Reply to this email directly or view it on GitHub.

@danielballan
Copy link
Contributor

I agree that would be better. But let's get the blessing of one of the Collaborators before anyone puts effort in.... @jreback?

@jreback
Copy link
Contributor

jreback commented Apr 14, 2014

I don't really know why this was put in the first place

welcome tey to change and see what happens

@jtratner
Copy link
Contributor

we just want to keep performance in mind here. Noah, if you could put
something together and then run some of the bench marks on it (if they
exist) that would be great :) If there is a perf impact, we'll figure out
how to handle it from there.

@jreback jreback added Bug and removed Docs labels Apr 21, 2014
@jreback jreback modified the milestones: 0.14.0, 0.15.0 Apr 21, 2014
nspies pushed a commit to nspies/pandas that referenced this issue Apr 23, 2014
…ndas-dev#6868

cleaning up comments

BUG: Series/DataFrame.rank() doesn't handle small floats correctly pandas-dev#6868

adding test for ranking with np.inf

Added release note pandas-dev#6886

Fixing float conversions in test_rank()
jreback pushed a commit that referenced this issue Apr 23, 2014


adding test for ranking with np.inf

Added release note #6886

Fixing float conversions in test_rank()
@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

closed via #6886

@jreback jreback closed this as completed Apr 23, 2014
jeffreystarr pushed a commit to jeffreystarr/pandas that referenced this issue Apr 28, 2014
…ndas-dev#6868

adding test for ranking with np.inf

Added release note pandas-dev#6886

Fixing float conversions in test_rank()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants