BUG, TST: Check uint64 behaviour in algorithms.py #14934

gfyoung · 2016-12-20T22:40:55Z

First of a series of PR's to patch and test uint64 behaviour in core/algorithms.py. In this PR, the following functions are checked:

duplicated() : robust but now has test to confirm
mode() : robust but now has test to confirm
unique() : non-robust but patched and tested

jreback · 2016-12-20T23:02:04Z

pandas/types/common.py

@@ -155,6 +155,18 @@ def is_integer_dtype(arr_or_dtype):
            not issubclass(tipo, (np.datetime64, np.timedelta64)))


+def is_signed_integer_dtype(arr_or_dtype):
+    tipo = _get_dtype_type(arr_or_dtype)
+    return (issubclass(tipo, np.signedinteger) and


ok to add to public api (pandas.api)

jreback · 2016-12-20T23:39:29Z

pandas/core/algorithms.py

@@ -508,6 +512,7 @@ def duplicated(values, keep='first'):
    duplicated : ndarray
    """

+    values = com._asarray_tuplesafe(values)


hmm, why are you doing this?

Your comment below answers this question. Fair enough. Will remove.

yep I figure out why you are doing it (but still not sure its necessary). I'd rather have a more strict requirement here for internal API's (unless good reason otherwise)

Yeah, that's perfectly reasonable. I'll update the doc to make it clear to avoid confusion.

jreback · 2016-12-20T23:40:42Z

pandas/core/algorithms.py

@@ -548,6 +567,9 @@ def mode(values):

    dtype = values.dtype
    if is_integer_dtype(values):
+        # This also works if dtype is uint64 because there is a 1-1
+        # correspondence between int64 and uint64, and we will cast
+        # all elements back to their correct uint64 values.
        values = _ensure_int64(values)


I would rather make this less magical and just select based on dtype. A read will be confused by this; better to do it on dtype.

jreback · 2016-12-20T23:41:57Z

pandas/tests/test_algos.py

@@ -772,6 +779,13 @@ def test_unique_index(self):
            tm.assert_numpy_array_equal(case.duplicated(),
                                        np.array([False, False, False]))

+    def test_duplicated_non_dtype(self):
+        # Make sure we can pass in array-likes with no dtype attribute.
+        cases = [[1, 2, 3], tuple([1, 2, 3])]


I don't think we need to support this. These are internal functions and you can expect a ndarray-like (and not a plain list). factorize is public and an exception.

codecov-io · 2016-12-21T07:00:06Z

Current coverage is 84.65% (diff: 100%)

Merging #14934 into master will increase coverage by <.01%

@@             master     #14934   diff @@
==========================================
  Files           144        144          
  Lines         51030      51043    +13   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43199      43213    +14   
+ Misses         7831       7830     -1   
  Partials          0          0

Powered by Codecov. Last update 74de478...6d31057

gfyoung · 2016-12-21T07:26:57Z

@jreback : Tests are green. Ready to merge if there is nothing else.

jreback · 2016-12-21T11:29:57Z

pandas/hashtable.pyx

@@ -236,6 +236,41 @@ def mode_int64(int64_t[:] values):

 @cython.wraparound(False)
 @cython.boundscheck(False)
+def mode_uint64(uint64_t[:] values):


can we move these to tempita? (for mode)

Certainly. Done.

jreback · 2016-12-21T11:30:29Z

pandas/tests/test_algos.py

@@ -1202,6 +1209,20 @@ def test_int64_add_overflow():
                                       b_mask=np.array([False, True]))


+def test_mode_uint64():


can you add a test class (and consolidate the tests)

jreback · 2016-12-21T22:39:13Z

pandas/tests/test_algos.py

@@ -1202,6 +1209,57 @@ def test_int64_add_overflow():
                                       b_mask=np.array([False, True]))


+class TestMode(tm.TestCase):
+
+    def test_basic(self):


looks like there is a test for mode in series/test_analytics.py can you make sure that we are looping over all integer, uint, float32, float64, and datelike types? (most are there, just making sure).

Also ok with moving all the test_mode test from series and putting them here (but then, leave the original test with a skip and a comment that says these are tested in algorithms).

alternatively, put a note here that these are tested in the series routine.

either one is ok.

I duplicated tests between Series and algos but also wrote additional tests for algos to account for other array-like objects. I see no reason to couple testing together.

gfyoung · 2016-12-22T05:54:35Z

@jreback : Addressed all comments, and everything is green and passing. Ready to merge if there are no other concerns.

jreback · 2016-12-22T11:41:02Z

pandas/core/algorithms.py

@@ -547,10 +566,12 @@ def mode(values):
        constructor = Series

    dtype = values.dtype
-    if is_integer_dtype(values):
+    if is_signed_integer_dtype(values):
        values = _ensure_int64(values)
        result = constructor(sorted(htable.mode_int64(values)), dtype=dtype)


side note, not sure whey sorted is used here, this does a conversion to list (then back when it is constructed).

can you do separate PR to fix this? (use np.sort)

and it can be at the end of the function (and actually you can move the boiler plate to the end of the function)
(e.g. the construct / sort)

or can you do in this PR

And how is that any different from np.sort, which converts to ndarray before we then convert to Series? We might as well just use Series.sort_values then?

its not, yes you could do a .sort_values() (I just don't want sorted)

Can't quite move to the end (categorical doesn't sort). However, can just replace sorted with np.sort in the code.

jreback · 2016-12-22T11:43:06Z

pandas/src/hashtable_func_helper.pxi.in

+    table = kh_init_{{table_type}}()
+
+    {{if dtype == 'object'}}
+    build_count_table_{{dtype}}(values, mask, table)


hmm, any reason NOT to make the signatures the same for build_count_table_*?

build_count_table_object takes a mask argument, whereas the others do null detection internally and take a dropna parameter instead. ~~It might be possible to standardize, but as all of these functions are tied to Python-facing functions (e.g. value_counts), I don't think it would be a good idea to investigate in this PR.~~

hmm, ok, can you make an issue about this then?

this PR still need some work, so might be simpler to do that first FYI.

Actually, it doesn't seem as difficult as I thought previous. I'm going to try to make the change here in this PR.

ok, lmk then.

So looking into this further, I think I understand why object was special-cased from the rest. It appears that Python's hashing for object (PyObject_Hash) does not work very well with null values. Thus, when you insert a value like nan into such a hashtable multiple times, they will be interpreted as different keys, which is obviously wrong.

I have successfully refactored the code to align signatures nonetheless.

jreback · 2016-12-22T11:44:43Z

few comments

1) duplicated() Updates documentation to describe the "values" parameter in the signature, adds tests for uint64, and refactors to use duplicated_uint64. 2) mode() Updates documentation to describe the "values" parameter in the signature, adds tests for uint64, and reactors to use mode_uint64. 3) unique() Uses UInt64HashTable to patch a uint64 overflow bug analogous to that seen in Series.unique (patched in pandas-devgh-14915). 4) Types API Introduces "is_signed_integer_dtype" and "is_unsigned _integer_dtype" to the public API. Used in refactoring/ patching of 1-3.

jreback · 2016-12-23T11:15:30Z

thanks!

First of a series of PR's to patch and test `uint64` behaviour in `core/algorithms.py`. In this PR, the following functions are checked: 1. `duplicated()` : robust but now has test to confirm 2. `mode()` : robust but now has test to confirm 3. `unique()` : non- robust but patched and tested Author: gfyoung <[email protected]> Closes pandas-dev#14934 from gfyoung/core-algorithms-uint64 and squashes the following commits: 6d31057 [gfyoung] DOC, TST, BUG: Improve uint64 core/algos behavior

1) Patches handling of `uint64` in `value_counts` 2) Adds tests for handling of `uint64` in `factorize` xref #14934 Author: gfyoung <[email protected]> Closes #15162 from gfyoung/core-algorithms-uint64-three and squashes the following commits: 1fb256b [gfyoung] BUG, TST: Patch uint64 behavior in value_counts and factorize

1) Patches handling of `uint64` in `value_counts` 2) Adds tests for handling of `uint64` in `factorize` xref pandas-dev#14934 Author: gfyoung <[email protected]> Closes pandas-dev#15162 from gfyoung/core-algorithms-uint64-three and squashes the following commits: 1fb256b [gfyoung] BUG, TST: Patch uint64 behavior in value_counts and factorize

jreback reviewed Dec 20, 2016

View reviewed changes

jreback added Bug Dtype Conversions Unexpected or buggy dtype conversions labels Dec 20, 2016

jreback reviewed Dec 20, 2016

View reviewed changes

gfyoung force-pushed the core-algorithms-uint64 branch from d9b7c84 to b51f879 Compare December 21, 2016 02:02

gfyoung mentioned this pull request Dec 21, 2016

BUG: Patch rank() uint64 behavior #14935

Closed

jreback added this to the 0.20.0 milestone Dec 21, 2016

jreback reviewed Dec 21, 2016

View reviewed changes

gfyoung force-pushed the core-algorithms-uint64 branch 2 times, most recently from b47353a to 9f2563c Compare December 21, 2016 22:24

jreback reviewed Dec 21, 2016

View reviewed changes

gfyoung force-pushed the core-algorithms-uint64 branch 3 times, most recently from 588f03b to b454731 Compare December 22, 2016 05:24

jreback reviewed Dec 22, 2016

View reviewed changes

gfyoung force-pushed the core-algorithms-uint64 branch from b454731 to bbdb77d Compare December 22, 2016 17:45

gfyoung force-pushed the core-algorithms-uint64 branch from bbdb77d to 6d31057 Compare December 23, 2016 08:23

jreback closed this in 5710947 Dec 23, 2016

gfyoung deleted the core-algorithms-uint64 branch December 23, 2016 13:46

gfyoung mentioned this pull request Jan 19, 2017

BUG, TST: Patch handling of uint64 in algorithms #15162

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG, TST: Check uint64 behaviour in algorithms.py #14934

BUG, TST: Check uint64 behaviour in algorithms.py #14934

gfyoung commented Dec 20, 2016 •

edited

Loading

jreback Dec 20, 2016

gfyoung Dec 21, 2016

jreback Dec 20, 2016

gfyoung Dec 21, 2016

jreback Dec 21, 2016

gfyoung Dec 21, 2016 •

edited

Loading

jreback Dec 20, 2016

jreback Dec 20, 2016

codecov-io commented Dec 21, 2016 •

edited

Loading

gfyoung commented Dec 21, 2016

jreback Dec 21, 2016

gfyoung Dec 21, 2016

jreback Dec 21, 2016

gfyoung Dec 21, 2016

jreback Dec 21, 2016 •

edited

Loading

gfyoung Dec 22, 2016 •

edited

Loading

gfyoung commented Dec 22, 2016

jreback Dec 22, 2016

jreback Dec 22, 2016

jreback Dec 22, 2016

gfyoung Dec 22, 2016

jreback Dec 22, 2016

gfyoung Dec 22, 2016 •

edited

Loading

jreback Dec 22, 2016

gfyoung Dec 22, 2016 •

edited

Loading

jreback Dec 22, 2016

gfyoung Dec 22, 2016 •

edited

Loading

jreback Dec 23, 2016

gfyoung Dec 23, 2016 •

edited

Loading

jreback commented Dec 22, 2016

jreback commented Dec 23, 2016

		@@ -1202,6 +1209,20 @@ def test_int64_add_overflow():
		b_mask=np.array([False, True]))


		def test_mode_uint64():

BUG, TST: Check uint64 behaviour in algorithms.py #14934

BUG, TST: Check uint64 behaviour in algorithms.py #14934

Conversation

gfyoung commented Dec 20, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Dec 21, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Dec 21, 2016 • edited Loading

Current coverage is 84.65% (diff: 100%)

gfyoung commented Dec 21, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback Dec 21, 2016 • edited Loading

Choose a reason for hiding this comment

gfyoung Dec 22, 2016 • edited Loading

Choose a reason for hiding this comment

gfyoung commented Dec 22, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Dec 22, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Dec 22, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Dec 22, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Dec 23, 2016 • edited Loading

Choose a reason for hiding this comment

jreback commented Dec 22, 2016

jreback commented Dec 23, 2016

gfyoung commented Dec 20, 2016 •

edited

Loading

gfyoung Dec 21, 2016 •

edited

Loading

codecov-io commented Dec 21, 2016 •

edited

Loading

jreback Dec 21, 2016 •

edited

Loading

gfyoung Dec 22, 2016 •

edited

Loading

gfyoung Dec 22, 2016 •

edited

Loading

gfyoung Dec 22, 2016 •

edited

Loading

gfyoung Dec 22, 2016 •

edited

Loading

gfyoung Dec 23, 2016 •

edited

Loading