Series.rank() doesn't handle small floats correctly #6886

nspies · 2014-04-15T16:24:16Z

Okay, this fixes #6868 but does so with a bit of a performance penalty. The current pandas version performs just a tad slower than scipy.stats.rankdata(), but after these changes, it's about 2-3x slower. On the plus side, it does (what I think is) the right thing.

I'm no cython expert, so there may well be things that can be done to improve the speed.

Here's some quick benchmarking code (couldn't figure out how to work with the vbench suite):

import timeit
setup2 = """import pandas
import scipy.stats
import numpy
numpy.random.seed(154)
s = pandas.Series(numpy.random.normal(size=10000))"""

print "pandas:", timeit.repeat(stmt='s.rank()', setup=setup2, repeat=3, number=10000)
print "scipy:", timeit.repeat(stmt='scipy.stats.rankdata(s)', 
    setup=setup2, repeat=3, number=10000)

cpcloud · 2014-04-15T16:31:42Z

pandas/algos.pyx

@@ -132,6 +133,16 @@ cdef _take_2d_object(ndarray[object, ndim=2] values,
    return result


+cdef inline float64_are_diff(float64_t left, float64_t right):
+    if right == np.inf or right == -np.inf:
+        if left == right:


just say return left != right here.

you can also say cdef inline bint to indicate to cython that you want to return a C type rather than the default of object (when no return type is specified)

Those changes both make sense but don't end up changing the performance substantially.

On Apr 15, 2014, at 9:33 AM, Phillip Cloud wrote:

In pandas/algos.pyx:

@@ -132,6 +133,16 @@ cdef _take_2d_object(ndarray[object, ndim=2] values,
return result

+cdef inline float64_are_diff(float64_t left, float64_t right):

if right == np.inf or right == -np.inf:

if left == right:
you can also say cdef inline bint to indicate to cython that you want to return a C type rather than the default of object (when no return type is specified)

—
Reply to this email directly or view it on GitHub.

jtratner · 2014-04-16T02:37:38Z

okay, so I've been looking at this (cython -a is really useful here, but you have to copy all the pyx files that are under pandas/ to pandas/src). Adding the bint declaration + changing the inf comparisons to use c-level declarations seems to bring pandas closer to scipy:

In [14]: %timeit s.rank()
1000 loops, best of 3: 1.17 ms per loop

In [15]: %timeit scipy.stats.rankdata(s)
1000 loops, best of 3: 915 µs per loop

I can't figure out how to get the last line not to go up to Python level though.

cdef inline bint float64_are_diff(float64_t left, float64_t right):
    cdef double abs_diff, allowed
    if right == MAXfloat64 or right == -MAXfloat64:
        if left == right:
            return False
        else:
            return True
    else:
        abs_diff = fabs(left - right)
        allowed = REL_TOL * fabs(right)
        return abs_diff > allowed

Here's cython -a output (complete white means entirely at C-level, bright yellow means entirely at python level):

jtratner · 2014-04-16T02:48:45Z

Just for completeness, that's a reduction from 52cafd0, not just machine differences:

# 52cafd0
In [4]: %timeit s.rank()
%te100 loops, best of 3: 4.3 ms per loop

In [5]: %timeit scipy.stats.rankdata(s)
1000 loops, best of 3: 953 µs per loop
# small changes to use Cythonized inf comparison

so at the least we need to make sure that we are handling infs correctly

nspies · 2014-04-16T18:52:57Z

The changes you've made have brought the speed pretty much up to par with the original (master?).

jtratner · 2014-04-17T05:32:21Z

Cool well it was just tweaking what you had. Would you be able to put together a bench mark + a test that includes infs (unless we have one already) ? Then we could get this merged.

nspies · 2014-04-17T23:29:59Z

This is fabulous: no, there was no test for ranking with infs, and no, neither the new nor the original version passes the test that I put in. The problem is that all nan values get clobbered with either inf or -inf, and hence the loop doesn't distinguish between inputs that had an (-)inf and inputs that had a nan. Don't have time right now to think about how to reimplement things to fix this, but hopefully shouldn't be too difficult.

As for benchmarking the ranking, I've found several in stat_ops. 52172e9 which incorporates your changes shows performance comparable to the master.

tl;dr we've fixed the float precision issue but uncovered a pre-existing inf issue.

jtratner · 2014-04-18T02:48:42Z

Awesome! Great to find hidden bugs!

jreback · 2014-04-21T17:28:26Z

@nspies can you add a vbench for this ? (in vb_suite) somewhere

need a release note referecing the bugfix issue

also, don't merge in master to your branch, pls rebase this away

see here: https://github.com/pydata/pandas/wiki/Using-Git

jreback · 2014-04-22T15:35:32Z

@nspies ping me when you have updated

nspies · 2014-04-22T17:25:42Z

I did the fetch upstream/rebase upstream/master but it won't let me push (says the tip of my current branch is behind its remote counterpart). A little more help? Sorry, still a git newbie :)

Also, there're a bunch of vbench benchmarks already which I think should suffice: http://pandas.pydata.org/pandas-docs/vbench/vb_stat_ops.html#stats-rank-average. Let me know if you think I should add something more.

What do you want to do about the inf ranking issue -- report as a separate bug? Right now, the unit test I added doesn't pass on travis CI.

On Apr 21, 2014, at 10:28 AM, jreback wrote:

@nspies can you add a vbench for this ? (in vb_suite) somewhere

need a release note referecing the bugfix issue

also, don't merge in master to your branch, pls rebase this away

see here: https://github.com/pydata/pandas/wiki/Using-Git

—
Reply to this email directly or view it on GitHub.

jreback · 2014-04-22T17:48:27Z

after rebasing you need to force push

eg

git push myfork thisbranch -f

nspies · 2014-04-22T17:55:50Z

Okay, got it pushed. Let me know if you want me to do something else here.

On Apr 22, 2014, at 10:48 AM, jreback wrote:

after rebasing you need to force push

eg

git push myfork thisbranch -f

—
Reply to this email directly or view it on GitHub.

jtratner · 2014-04-22T18:04:40Z

@nspies - in regards to inf fix - either way is fine (but add a skip to the
test if it isn't fixed in this issue)

jreback · 2014-04-22T19:58:20Z

pandas/tests/test_series.py

+        iranks = iseries.rank()
+        assert_series_equal(iranks, exp)
+
+        values = np.array([-np.inf, -50, -1, -1e-20, -1e-25, -1e-50, 0, 1e-40, 1e-20, 1e-10, 2, 40, np.inf])


pass dtype='float64' here (this otherwise will fail on windows as numpy arrays default to float32 their), while pandas always uses float64 unless otherwise specified (but numpy is platform specific)

jreback · 2014-04-23T21:39:07Z

can you rebase and push again?

nspies · 2014-04-23T22:01:53Z

Okay, done.

jreback · 2014-04-23T22:07:30Z

gr8.

can you post the perf tests (that already exist)? (test_perf.sh -b master -t HEAD -r rank)

I don't believe their is a issue about the inf ranking issue? can you create one (ref this issue
and put a pointer to the tests (after this is merged) so that when someone fixes it the tests will be obvious)

squash if you can to a single commit

…ndas-dev#6868 cleaning up comments BUG: Series/DataFrame.rank() doesn't handle small floats correctly pandas-dev#6868 adding test for ranking with np.inf Added release note pandas-dev#6886 Fixing float conversions in test_rank()

nspies · 2014-04-23T22:48:22Z

I've opened #6945 regarding the inf issue.

Benchmarking

Invoked with :
--ncalls: 3
--repeats: 3

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
stats_rank_average                           |  36.8500 |  37.0790 |   0.9938 |
stats_rank_average_int                       |  25.7943 |  25.9283 |   0.9948 |
stats_rank2d_axis1_average                   |  14.7780 |  14.5180 |   1.0179 |
stats_rank2d_axis0_average                   |  26.8903 |  26.0603 |   1.0318 |
stats_rank_pct_average                       |  38.9826 |  34.4080 |   1.1330 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

I did the squash (but it still shows the individual commit messages -- is there some way to get rid of them?).

adding test for ranking with np.inf Added release note #6886 Fixing float conversions in test_rank()

jreback · 2014-04-23T23:24:14Z

merged via d96fd80

thanks for the contribution!

…ndas-dev#6868 adding test for ranking with np.inf Added release note pandas-dev#6886 Fixing float conversions in test_rank()

cpcloud reviewed Apr 15, 2014
View reviewed changes

jreback added Algos labels Apr 21, 2014

jreback modified the milestones: 0.15.0, 0.14.0 Apr 21, 2014

jreback added the Bug label Apr 21, 2014

jreback reviewed Apr 22, 2014
View reviewed changes

nspies pushed a commit to nspies/pandas that referenced this pull request Apr 23, 2014

Added release note pandas-dev#6886

a94073b

nspies mentioned this pull request Apr 23, 2014

Series/DataFrame.rank() doesn't handle np.inf properly #6945

Closed

jreback pushed a commit that referenced this pull request Apr 23, 2014

BUG: Series/DataFrame.rank() doesn't handle small floats correctly #6868

d96fd80

adding test for ranking with np.inf Added release note #6886 Fixing float conversions in test_rank()

jreback closed this Apr 23, 2014

jreback mentioned this pull request Apr 23, 2014

Series.rank() doesn't handle small floats correctly #6868

Closed

This was referenced Sep 24, 2014

Series/DataFrame.rank() doesn't handle certain floats properly #8365

Closed

BUG: floats cannot be ranked with tolerance #8379

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Series.rank() doesn't handle small floats correctly #6886

Series.rank() doesn't handle small floats correctly #6886

nspies commented Apr 15, 2014

cpcloud Apr 15, 2014

cpcloud Apr 15, 2014

nspies Apr 15, 2014

jtratner commented Apr 16, 2014

jtratner commented Apr 16, 2014

nspies commented Apr 16, 2014

jtratner commented Apr 17, 2014

nspies commented Apr 17, 2014

jtratner commented Apr 18, 2014

jreback commented Apr 21, 2014

jreback commented Apr 22, 2014

nspies commented Apr 22, 2014

jreback commented Apr 22, 2014

nspies commented Apr 22, 2014

jtratner commented Apr 22, 2014

jreback Apr 22, 2014

jreback commented Apr 23, 2014

nspies commented Apr 23, 2014

jreback commented Apr 23, 2014

nspies commented Apr 23, 2014

jreback commented Apr 23, 2014

Series.rank() doesn't handle small floats correctly #6886

Series.rank() doesn't handle small floats correctly #6886

Conversation

nspies commented Apr 15, 2014

cpcloud Apr 15, 2014

Choose a reason for hiding this comment

cpcloud Apr 15, 2014

Choose a reason for hiding this comment

nspies Apr 15, 2014

Choose a reason for hiding this comment

jtratner commented Apr 16, 2014

jtratner commented Apr 16, 2014

nspies commented Apr 16, 2014

jtratner commented Apr 17, 2014

nspies commented Apr 17, 2014

jtratner commented Apr 18, 2014

jreback commented Apr 21, 2014

jreback commented Apr 22, 2014

nspies commented Apr 22, 2014

jreback commented Apr 22, 2014

nspies commented Apr 22, 2014

jtratner commented Apr 22, 2014

jreback Apr 22, 2014

Choose a reason for hiding this comment

jreback commented Apr 23, 2014

nspies commented Apr 23, 2014

jreback commented Apr 23, 2014

nspies commented Apr 23, 2014

jreback commented Apr 23, 2014