Skip to content

Series.rank() doesn't handle small floats correctly #6886

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

nspies
Copy link
Contributor

@nspies nspies commented Apr 15, 2014

Okay, this fixes #6868 but does so with a bit of a performance penalty. The current pandas version performs just a tad slower than scipy.stats.rankdata(), but after these changes, it's about 2-3x slower. On the plus side, it does (what I think is) the right thing.

I'm no cython expert, so there may well be things that can be done to improve the speed.

Here's some quick benchmarking code (couldn't figure out how to work with the vbench suite):

import timeit
setup2 = """import pandas
import scipy.stats
import numpy
numpy.random.seed(154)
s = pandas.Series(numpy.random.normal(size=10000))"""

print "pandas:", timeit.repeat(stmt='s.rank()', setup=setup2, repeat=3, number=10000)
print "scipy:", timeit.repeat(stmt='scipy.stats.rankdata(s)', 
    setup=setup2, repeat=3, number=10000)

@@ -132,6 +133,16 @@ cdef _take_2d_object(ndarray[object, ndim=2] values,
return result


cdef inline float64_are_diff(float64_t left, float64_t right):
if right == np.inf or right == -np.inf:
if left == right:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just say return left != right here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can also say cdef inline bint to indicate to cython that you want to return a C type rather than the default of object (when no return type is specified)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those changes both make sense but don't end up changing the performance substantially.

On Apr 15, 2014, at 9:33 AM, Phillip Cloud wrote:

In pandas/algos.pyx:

@@ -132,6 +133,16 @@ cdef _take_2d_object(ndarray[object, ndim=2] values,
return result

+cdef inline float64_are_diff(float64_t left, float64_t right):

  • if right == np.inf or right == -np.inf:
  •    if left == right:
    
    you can also say cdef inline bint to indicate to cython that you want to return a C type rather than the default of object (when no return type is specified)


Reply to this email directly or view it on GitHub.

@jtratner
Copy link
Contributor

okay, so I've been looking at this (cython -a is really useful here, but you have to copy all the pyx files that are under pandas/ to pandas/src). Adding the bint declaration + changing the inf comparisons to use c-level declarations seems to bring pandas closer to scipy:

In [14]: %timeit s.rank()
1000 loops, best of 3: 1.17 ms per loop

In [15]: %timeit scipy.stats.rankdata(s)
1000 loops, best of 3: 915 µs per loop

I can't figure out how to get the last line not to go up to Python level though.

cdef inline bint float64_are_diff(float64_t left, float64_t right):
    cdef double abs_diff, allowed
    if right == MAXfloat64 or right == -MAXfloat64:
        if left == right:
            return False
        else:
            return True
    else:
        abs_diff = fabs(left - right)
        allowed = REL_TOL * fabs(right)
        return abs_diff > allowed

Here's cython -a output (complete white means entirely at C-level, bright yellow means entirely at python level):

cython-a output

@jtratner
Copy link
Contributor

Just for completeness, that's a reduction from 52cafd0, not just machine differences:

# 52cafd0
In [4]: %timeit s.rank()
%te100 loops, best of 3: 4.3 ms per loop

In [5]: %timeit scipy.stats.rankdata(s)
1000 loops, best of 3: 953 µs per loop
# small changes to use Cythonized inf comparison

so at the least we need to make sure that we are handling infs correctly

@nspies
Copy link
Contributor Author

nspies commented Apr 16, 2014

The changes you've made have brought the speed pretty much up to par with the original (master?).

@jtratner
Copy link
Contributor

Cool well it was just tweaking what you had. Would you be able to put together a bench mark + a test that includes infs (unless we have one already) ? Then we could get this merged.

@nspies
Copy link
Contributor Author

nspies commented Apr 17, 2014

This is fabulous: no, there was no test for ranking with infs, and no, neither the new nor the original version passes the test that I put in. The problem is that all nan values get clobbered with either inf or -inf, and hence the loop doesn't distinguish between inputs that had an (-)inf and inputs that had a nan. Don't have time right now to think about how to reimplement things to fix this, but hopefully shouldn't be too difficult.

As for benchmarking the ranking, I've found several in stat_ops. 52172e9 which incorporates your changes shows performance comparable to the master.

tl;dr we've fixed the float precision issue but uncovered a pre-existing inf issue.

@jtratner
Copy link
Contributor

Awesome! Great to find hidden bugs!

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Apr 21, 2014
@jreback
Copy link
Contributor

jreback commented Apr 21, 2014

@nspies can you add a vbench for this ? (in vb_suite) somewhere

need a release note referecing the bugfix issue

also, don't merge in master to your branch, pls rebase this away

see here: https://github.com/pydata/pandas/wiki/Using-Git

@jreback jreback added the Bug label Apr 21, 2014
@jreback
Copy link
Contributor

jreback commented Apr 22, 2014

@nspies ping me when you have updated

@nspies
Copy link
Contributor Author

nspies commented Apr 22, 2014

I did the fetch upstream/rebase upstream/master but it won't let me push (says the tip of my current branch is behind its remote counterpart). A little more help? Sorry, still a git newbie :)

Also, there're a bunch of vbench benchmarks already which I think should suffice: http://pandas.pydata.org/pandas-docs/vbench/vb_stat_ops.html#stats-rank-average. Let me know if you think I should add something more.

What do you want to do about the inf ranking issue -- report as a separate bug? Right now, the unit test I added doesn't pass on travis CI.

On Apr 21, 2014, at 10:28 AM, jreback wrote:

@nspies can you add a vbench for this ? (in vb_suite) somewhere

need a release note referecing the bugfix issue

also, don't merge in master to your branch, pls rebase this away

see here: https://github.com/pydata/pandas/wiki/Using-Git


Reply to this email directly or view it on GitHub.

@jreback
Copy link
Contributor

jreback commented Apr 22, 2014

after rebasing you need to force push

eg

git push myfork thisbranch -f

@nspies
Copy link
Contributor Author

nspies commented Apr 22, 2014

Okay, got it pushed. Let me know if you want me to do something else here.

On Apr 22, 2014, at 10:48 AM, jreback wrote:

after rebasing you need to force push

eg

git push myfork thisbranch -f


Reply to this email directly or view it on GitHub.

@jtratner
Copy link
Contributor

@nspies - in regards to inf fix - either way is fine (but add a skip to the
test if it isn't fixed in this issue)

iranks = iseries.rank()
assert_series_equal(iranks, exp)

values = np.array([-np.inf, -50, -1, -1e-20, -1e-25, -1e-50, 0, 1e-40, 1e-20, 1e-10, 2, 40, np.inf])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass dtype='float64' here (this otherwise will fail on windows as numpy arrays default to float32 their), while pandas always uses float64 unless otherwise specified (but numpy is platform specific)

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

can you rebase and push again?

nspies pushed a commit to nspies/pandas that referenced this pull request Apr 23, 2014
@nspies
Copy link
Contributor Author

nspies commented Apr 23, 2014

Okay, done.

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

gr8.

can you post the perf tests (that already exist)? (test_perf.sh -b master -t HEAD -r rank)

I don't believe their is a issue about the inf ranking issue? can you create one (ref this issue
and put a pointer to the tests (after this is merged) so that when someone fixes it the tests will be obvious)

squash if you can to a single commit

…ndas-dev#6868

cleaning up comments

BUG: Series/DataFrame.rank() doesn't handle small floats correctly pandas-dev#6868

adding test for ranking with np.inf

Added release note pandas-dev#6886

Fixing float conversions in test_rank()
@nspies
Copy link
Contributor Author

nspies commented Apr 23, 2014

I've opened #6945 regarding the inf issue.

Benchmarking

Invoked with :
--ncalls: 3
--repeats: 3

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
stats_rank_average                           |  36.8500 |  37.0790 |   0.9938 |
stats_rank_average_int                       |  25.7943 |  25.9283 |   0.9948 |
stats_rank2d_axis1_average                   |  14.7780 |  14.5180 |   1.0179 |
stats_rank2d_axis0_average                   |  26.8903 |  26.0603 |   1.0318 |
stats_rank_pct_average                       |  38.9826 |  34.4080 |   1.1330 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

I did the squash (but it still shows the individual commit messages -- is there some way to get rid of them?).

jreback pushed a commit that referenced this pull request Apr 23, 2014


adding test for ranking with np.inf

Added release note #6886

Fixing float conversions in test_rank()
@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

merged via d96fd80

thanks for the contribution!

@jreback jreback closed this Apr 23, 2014
jeffreystarr pushed a commit to jeffreystarr/pandas that referenced this pull request Apr 28, 2014
…ndas-dev#6868

adding test for ranking with np.inf

Added release note pandas-dev#6886

Fixing float conversions in test_rank()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Series.rank() doesn't handle small floats correctly
4 participants