Skip to content

ENH: tolerance for Float64Index including join / reindex-nearest #9817

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
hsuominen opened this issue Apr 5, 2015 · 7 comments
Open

ENH: tolerance for Float64Index including join / reindex-nearest #9817

hsuominen opened this issue Apr 5, 2015 · 7 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement Index Related to the Index class or subclasses

Comments

@hsuominen
Copy link

When trying to intersect two Index objects containing floats I get the following unexpected behavior:

>>> new_index = pd.Index(np.arange(0.0,1.0,0.1),dtype='float64')
>>> new_index2 = pd.Index(np.arange(0.5,1.0,0.1),dtype='float64')
>>> intersection = new_index.intersection(new_index2)
Float64Index([0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], dtype='float64')
Float64Index([0.5, 0.6, 0.7, 0.8, 0.9], dtype='float64')
Float64Index([0.5], dtype='float64')

Where I would expect the intersection to equal index2.

Pandas version string below:

INSTALLED VERSIONS

commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-45-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.16.0
nose: 1.3.4
Cython: 0.21.2
numpy: 1.9.2
scipy: 0.15.1
statsmodels: None
IPython: 2.4.1
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.2
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None

@dsm054
Copy link
Contributor

dsm054 commented Apr 6, 2015

This is basically suggesting we introduce a float tolerance for alignment, because your two values aren't the same:

>>> repr(new_index.values[6])
'0.60000000000000009'
>>> repr(new_index2.values[1])
'0.59999999999999998'

@shoyer
Copy link
Member

shoyer commented Apr 6, 2015

The problem here is unfortunately rather inherent in the nature of floating point numbers. In general, producing the same set of floating point numbers two different ways will produce numbers that are not exactly equal.

We could do more in pandas to handle floating point tolerance automatically, though the exact implementation remains to be worked out and there are some potential performance issues. See here for more discussion: #9530

@jreback
Copy link
Contributor

jreback commented Apr 6, 2015

Here's what getting called
as these are both monotonic. Would need a tol as @shoyer and @dsm054 point out for these types of comparisons.

Not a bad idea, but would require a bit of effort.

@jreback
Copy link
Contributor

jreback commented Apr 6, 2015

You can do this to in-effect get what you want. This would need a tolerance as well (in the reindexer) so make it robust.

In [28]: target, indexer = new_index.reindex(new_index2,method='nearest')

In [29]: target
Out[29]: Float64Index([0.5, 0.6, 0.7, 0.8, 0.9], dtype='float64')

In [30]: indexer
Out[30]: array([5, 6, 7, 8, 9])

@jreback jreback added Enhancement Indexing Related to indexing on series/frames, not to indexes themselves Reshaping Concat, Merge/Join, Stack/Unstack, Explode Dtype Conversions Unexpected or buggy dtype conversions Difficulty Intermediate labels Apr 6, 2015
@jreback jreback modified the milestones: 0.17.0, Next Major Release Apr 6, 2015
@jreback jreback changed the title Index.intersection strange behaviour with floats ENH: tolerance for Float64Index including join / reindex-nearest Apr 6, 2015
@shoyer
Copy link
Member

shoyer commented Jun 26, 2015

In the process of working this out in #10411.

Is there any safe default threshold to use when aligning float indexes?

Some possibilities:

  1. A fixed constant, e.g,. 1e-9
  2. A constant that depends on index values, e.g., 1e-9 * (dx.max() - idx.min()).
  3. A user settable tolerance in the Float64Index constructor, e.g., Float64Index(values, tol=1e-9)

For 2 and 3, how do we handle indexes with different tolerances? Just use the larger one, I guess?

My sense is that this may be unsolvable -- probably better to force users to be explicit and supply a tolerance manually. Unfortunately, automatic alignment comes up all the time in pandas, and there's no easy way to control the tolerances in these cases.

@Dimchord
Copy link

Have a look at numpy.isclose(). There are actually two tolerances, a relative (rtol, default 1e-05) and an absolute tolerance (atol, default 1e-08). From the documentation:

For finite values, isclose uses the following equation to test whether two floating point values are equivalent.

absolute(a - b) <= (atol + rtol * absolute(b))
The above equation is not symmetric in a and b, so that isclose(a, b) might be different from isclose(b, a) in some rare cases.

@shoyer
Copy link
Member

shoyer commented Feb 25, 2016

@Dimchord we added a tolerance argument into reindexing in the above mentioned pull requests -- it's in the latest version of pandas. We still haven't added a default tolerance for floating point indexes, though.

@shoyer shoyer mentioned this issue Jul 24, 2018
4 tasks
@toobaz toobaz added Index Related to the Index class or subclasses and removed Indexing Related to indexing on series/frames, not to indexes themselves labels Jun 28, 2019
@mroeschke mroeschke removed the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Apr 18, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement Index Related to the Index class or subclasses
Projects
None yet
Development

No branches or pull requests

8 participants