Skip to content

sym_diff failure on 3.4? #6444

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dsm054 opened this issue Feb 22, 2014 · 15 comments · Fixed by #6453
Closed

sym_diff failure on 3.4? #6444

dsm054 opened this issue Feb 22, 2014 · 15 comments · Fixed by #6453
Labels
Testing pandas testing functions or related to the test suite
Milestone

Comments

@dsm054
Copy link
Contributor

dsm054 commented Feb 22, 2014

test_symmetric_diff is failing after trying out pandas on a fresh 3.4 pull. It passes on 3.3 with the same numpy trunk version. It seems to have to do with the nan section:

>>> from pandas import Index
>>> import numpy as np
>>> idx1 = Index([1, 2, np.nan])
>>> idx2 = Index([0, 1, np.nan])
>>> result = idx1.sym_diff(idx2)
>>> expected = Index([0.0, np.nan, 2.0, np.nan])  # oddness with nans
>>> nans = pd.isnull(expected)
>>> result
Float64Index([0.0, nan, nan, 2.0], dtype='object')
>>> expected
Float64Index([0.0, nan, 2.0, nan], dtype='object')
>>> nans
array([False,  True, False,  True], dtype=bool)
>>> result[nans]
Float64Index([nan, 2.0], dtype='object')

Version info (basically a fresh 3.4 build with only numpy dev installed):

>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.0.candidate.1
python-bits: 32
OS: Linux
OS-release: 3.8.0-35-generic
machine: i686
processor: i686
byteorder: little
LC_ALL: None
LANG: en_CA.UTF-8

pandas: 0.13.1-282-g9564ead
Cython: None
numpy: 1.9.0.dev-2d6ea6e
scipy: None
statsmodels: None
IPython: None
sphinx: None
patsy: None
scikits.timeseries: None
dateutil: 2.2
pytz: 2013.9
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
bq: None
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: None
@jreback
Copy link
Contributor

jreback commented Feb 22, 2014

cc @TomAugspurger

yeh...Tom will have a look when he tries out 3.4

its has to do with nan ordering. weird

@jreback jreback added this to the 0.14.0 milestone Feb 22, 2014
@jreback
Copy link
Contributor

jreback commented Feb 22, 2014

@dsm054 you can take a look if you'd like as well!

right now I only tests 3.4 on windows...and has been coming up since this was merged in

#6016

@dsm054
Copy link
Contributor Author

dsm054 commented Feb 22, 2014

Not sure how I missed that in the search. Should we close this as a dup?

@jreback
Copy link
Contributor

jreback commented Feb 22, 2014

wasn't a 'direct' issue, just a comment...so this is fine

@TomAugspurger
Copy link
Contributor

I haven't been able to get a 3.4 virtualenv running yet. Does machine with 3.4 have a newer version of numpy as well?

On Feb 22, 2014, at 11:46 AM, "jreback" <[email protected]mailto:[email protected]> wrote:

cc @TomAugspurgerhttps://github.com/TomAugspurger

yeh...Tom will have a look when he tries out 3.4

its has to do with nan ordering. weird


Reply to this email directly or view it on GitHubhttps://github.com//issues/6444#issuecomment-35808943.

@dsm054
Copy link
Contributor Author

dsm054 commented Feb 22, 2014

What's the "expected" (= desired, really) behaviour for sorting an array with nans?

@jreback
Copy link
Contributor

jreback commented Feb 22, 2014

@TomAugspurger you can use numpy 1.8.

@jreback
Copy link
Contributor

jreback commented Feb 22, 2014

hmm...if you sort with a stable (mergesort), it should leave the nans alone (and you can't tell if 2 nans switch place). though IIRC their was an issue on the sort ordering....

@dsm054
Copy link
Contributor Author

dsm054 commented Feb 22, 2014

I had another look at this and I'm starting to think there's nothing wrong here. It's simply that in Python 3.3 we have

>>> set(Float64Index([0.0, np.nan, np.nan, 2.0], dtype='object'))
{0.0, nan, 2.0, nan}

and in 3.4 we have

>>> set(Float64Index([0.0, np.nan, np.nan, 2.0], dtype='object'))
{0.0, nan, nan, 2.0}

and we were never promised otherwise. If sorting preserves nan location (although annoyingly, the presence of nans breaks the sorting even of the non-nan elements in both lists and ndarrays), then since sets are unordered, we have no reason to expect our expected result. If we want to impose Series-style "push nans to the back" sorting, we can, but right now I think it's just that the test is too sensitive.

@TomAugspurger
Copy link
Contributor

Agreed. I think I'll rewrite the test to make sure the count of the nans is correct, and ignore the order.

@jreback
Copy link
Contributor

jreback commented Feb 23, 2014

ok

I suspect maybe 3.4 changed some sort of hashing scheme though
eg they are now more pseudo random whereas 3.3 it is not turned on

this is some of security thing I think

you can prob test this by seeing if the ordering of a dict keys is the same in 3.3 vs 3.4

@jreback
Copy link
Contributor

jreback commented Feb 23, 2014

http://bugs.python.org/issue19183

is prob the culprit

@dsm054
Copy link
Contributor Author

dsm054 commented Feb 23, 2014

I haven't been following the Hash Randomization Wars(tm), but if we're implicitly relying on a fixed set order, we've made a wrong step regardless of whether or not we actually observe a failure.

@jreback
Copy link
Contributor

jreback commented Feb 23, 2014

the issue is that a float64index doesn't sort the same when it had Nan's , and the set operations sort the results

  • need to fix this test so that it doesn't rely on the index sort order
  • need to doc that Nan's in a float64index can change the order
  • think about if Nan's should actually not be used in a float64index and instead use s NaT like value to represent

@jreback
Copy link
Contributor

jreback commented Feb 23, 2014

related is #6194

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants