-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Series.rank() doesn't handle small floats correctly #6886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -132,6 +133,16 @@ cdef _take_2d_object(ndarray[object, ndim=2] values, | |||
return result | |||
|
|||
|
|||
cdef inline float64_are_diff(float64_t left, float64_t right): | |||
if right == np.inf or right == -np.inf: | |||
if left == right: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just say return left != right
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can also say cdef inline bint
to indicate to cython that you want to return a C type rather than the default of object
(when no return type is specified)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those changes both make sense but don't end up changing the performance substantially.
On Apr 15, 2014, at 9:33 AM, Phillip Cloud wrote:
In pandas/algos.pyx:
@@ -132,6 +133,16 @@ cdef _take_2d_object(ndarray[object, ndim=2] values,
return result+cdef inline float64_are_diff(float64_t left, float64_t right):
- if right == np.inf or right == -np.inf:
you can also say cdef inline bint to indicate to cython that you want to return a C type rather than the default of object (when no return type is specified)if left == right:
—
Reply to this email directly or view it on GitHub.
okay, so I've been looking at this (cython -a is really useful here, but you have to copy all the pyx files that are under pandas/ to pandas/src). Adding the
I can't figure out how to get the last line not to go up to Python level though. cdef inline bint float64_are_diff(float64_t left, float64_t right):
cdef double abs_diff, allowed
if right == MAXfloat64 or right == -MAXfloat64:
if left == right:
return False
else:
return True
else:
abs_diff = fabs(left - right)
allowed = REL_TOL * fabs(right)
return abs_diff > allowed Here's cython -a output (complete white means entirely at C-level, bright yellow means entirely at python level): |
Just for completeness, that's a reduction from 52cafd0, not just machine differences: # 52cafd0
In [4]: %timeit s.rank()
%te100 loops, best of 3: 4.3 ms per loop
In [5]: %timeit scipy.stats.rankdata(s)
1000 loops, best of 3: 953 µs per loop
# small changes to use Cythonized inf comparison so at the least we need to make sure that we are handling infs correctly |
The changes you've made have brought the speed pretty much up to par with the original (master?). |
Cool well it was just tweaking what you had. Would you be able to put together a bench mark + a test that includes infs (unless we have one already) ? Then we could get this merged. |
This is fabulous: no, there was no test for ranking with infs, and no, neither the new nor the original version passes the test that I put in. The problem is that all nan values get clobbered with either As for benchmarking the ranking, I've found several in stat_ops. 52172e9 which incorporates your changes shows performance comparable to the master. tl;dr we've fixed the float precision issue but uncovered a pre-existing inf issue. |
Awesome! Great to find hidden bugs! |
@nspies can you add a vbench for this ? (in vb_suite) somewhere need a release note referecing the bugfix issue also, don't merge in master to your branch, pls rebase this away see here: https://github.com/pydata/pandas/wiki/Using-Git |
@nspies ping me when you have updated |
I did the fetch upstream/rebase upstream/master but it won't let me push (says the tip of my current branch is behind its remote counterpart). A little more help? Sorry, still a git newbie :) Also, there're a bunch of vbench benchmarks already which I think should suffice: http://pandas.pydata.org/pandas-docs/vbench/vb_stat_ops.html#stats-rank-average. Let me know if you think I should add something more. What do you want to do about the inf ranking issue -- report as a separate bug? Right now, the unit test I added doesn't pass on travis CI. On Apr 21, 2014, at 10:28 AM, jreback wrote:
|
after rebasing you need to force push eg git push myfork thisbranch -f |
Okay, got it pushed. Let me know if you want me to do something else here. On Apr 22, 2014, at 10:48 AM, jreback wrote:
|
@nspies - in regards to inf fix - either way is fine (but add a skip to the |
iranks = iseries.rank() | ||
assert_series_equal(iranks, exp) | ||
|
||
values = np.array([-np.inf, -50, -1, -1e-20, -1e-25, -1e-50, 0, 1e-40, 1e-20, 1e-10, 2, 40, np.inf]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pass dtype='float64'
here (this otherwise will fail on windows as numpy arrays default to float32
their), while pandas always uses float64
unless otherwise specified (but numpy is platform specific)
can you rebase and push again? |
Okay, done. |
gr8. can you post the perf tests (that already exist)? ( I don't believe their is a issue about the squash if you can to a single commit |
…ndas-dev#6868 cleaning up comments BUG: Series/DataFrame.rank() doesn't handle small floats correctly pandas-dev#6868 adding test for ranking with np.inf Added release note pandas-dev#6886 Fixing float conversions in test_rank()
I've opened #6945 regarding the Benchmarking
I did the squash (but it still shows the individual commit messages -- is there some way to get rid of them?). |
adding test for ranking with np.inf Added release note #6886 Fixing float conversions in test_rank()
merged via d96fd80 thanks for the contribution! |
…ndas-dev#6868 adding test for ranking with np.inf Added release note pandas-dev#6886 Fixing float conversions in test_rank()
Okay, this fixes #6868 but does so with a bit of a performance penalty. The current pandas version performs just a tad slower than
scipy.stats.rankdata()
, but after these changes, it's about 2-3x slower. On the plus side, it does (what I think is) the right thing.I'm no cython expert, so there may well be things that can be done to improve the speed.
Here's some quick benchmarking code (couldn't figure out how to work with the vbench suite):