Skip to content

COMPAT: different orderings in value_counts on 32-bit platforms #11227

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jreback opened this issue Oct 3, 2015 · 7 comments · Fixed by #39009
Closed

COMPAT: different orderings in value_counts on 32-bit platforms #11227

jreback opened this issue Oct 3, 2015 · 7 comments · Fixed by #39009
Labels
32bit 32-bit systems Testing pandas testing functions or related to the test suite
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Oct 3, 2015

This occurs on 32-bit linux, a slightly different ordering is returned from the hashtable. Only guess is that it is because the indexing is Py_ssize_t and this is hashed and has differing values. So the test should be slightly different for those platforms.

see test skipping here: d6c7a3a

Not a big deal, but here's the question. Should we guarantee these types of orderings, IOW, use a int64 instead of Py_ssize_t for indexing (on all platforms)?

@jreback jreback added Testing pandas testing functions or related to the test suite Compat pandas objects compatability with Numpy or Python functions labels Oct 3, 2015
@jreback jreback added this to the Next Major Release milestone Oct 3, 2015
@jreback
Copy link
Contributor Author

jreback commented Oct 3, 2015

cc @behzadnouri

@behzadnouri
Copy link
Contributor

@jreback what does it return on 32-bit linux?

@jreback
Copy link
Contributor Author

jreback commented Oct 3, 2015

`pd.Series([2, 1, 1], index=[5., np.nan, 10.3])``

the orderings of the nan and 10.3 are switched (compared to the 64-bit ones) coming out of: hashtable.pyx/value_count_scalar64

@jreback jreback added the 32bit 32-bit systems label Mar 29, 2017
@mroeschke mroeschke removed the Compat pandas objects compatability with Numpy or Python functions label Apr 3, 2020
@jreback
Copy link
Contributor Author

jreback commented Dec 31, 2020

this might be fixed by consistent hashing cc @realead

@realead
Copy link
Contributor

realead commented Jan 1, 2021

@jreback I think in general, we cannot guarantee the same hash for different platforms, e.g. for Python-objects:

>>> hash(10.5)
1152921504606846986

the above holds for 64bits. For 32bits the hash will be something different (as the above 64bit-result cannot be stored in 32bit).

Because the order of key-count-pairs provided by value_count_{{dtype}} (

cpdef value_count_{{dtype}}(ndarray[{{dtype}}] values, bint dropna):
) is arbitrary (see #12679) but depending on the hash function, the order of keys with the same count is thus also arbitrary after the sort (but depending on the hash-function).

However, for float64, we have the same hash function for 32 and 64 bit (at least at the moment). I guess back then, Python's hash was used for doubles and thus the hashes were different between 32 and 64 bit (see the example above), which explains different order.

So the example shoud be fine now, but... I must confess, I have changed this test some time ago 4cfa97a#diff-f2bbc83024c5767a6b4afec26ad8efa194d6dd8f140276a22bcf6b5e7bd37102L1197

From my point of view, this is not a bug in the first place: hashes can be different for different platforms (for whatever reasons) thus the order is arbitrary. A way to make order non-arbitrary would be e.g. to enforce insertion order in the result of value_count_{{dtype}}, but this would also mean a negative performance impact (as discussed here #12679 (comment) ).

@jreback
Copy link
Contributor Author

jreback commented Jan 3, 2021

ok i agree. hashtables are reproducible but on that platform only. would you be able to add that test above for 32-bit (in the different ordering) so we can close this issue

@jreback jreback modified the milestones: Contributions Welcome, 1.3 Jan 3, 2021
@realead
Copy link
Contributor

realead commented Jan 6, 2021

Once #39009 is merged, this issue would no longer exist, because the order would depend on the original ordering and not hash-functions (and thus would be independent of the platform).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
32bit 32-bit systems Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants