Skip to content

BUG, TST: Check uint64 behaviour in algorithms.py #14934

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Dec 20, 2016

First of a series of PR's to patch and test uint64 behaviour in core/algorithms.py. In this PR, the following functions are checked:

  1. duplicated() : robust but now has test to confirm
  2. mode() : robust but now has test to confirm
  3. unique() : non-robust but patched and tested

@@ -155,6 +155,18 @@ def is_integer_dtype(arr_or_dtype):
not issubclass(tipo, (np.datetime64, np.timedelta64)))


def is_signed_integer_dtype(arr_or_dtype):
tipo = _get_dtype_type(arr_or_dtype)
return (issubclass(tipo, np.signedinteger) and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok to add to public api (pandas.api)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, done.

@jreback jreback added Bug Dtype Conversions Unexpected or buggy dtype conversions labels Dec 20, 2016
@@ -508,6 +512,7 @@ def duplicated(values, keep='first'):
duplicated : ndarray
"""

values = com._asarray_tuplesafe(values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, why are you doing this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your comment below answers this question. Fair enough. Will remove.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep I figure out why you are doing it (but still not sure its necessary). I'd rather have a more strict requirement here for internal API's (unless good reason otherwise)

Copy link
Member Author

@gfyoung gfyoung Dec 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's perfectly reasonable. I'll update the doc to make it clear to avoid confusion.

@@ -548,6 +567,9 @@ def mode(values):

dtype = values.dtype
if is_integer_dtype(values):
# This also works if dtype is uint64 because there is a 1-1
# correspondence between int64 and uint64, and we will cast
# all elements back to their correct uint64 values.
values = _ensure_int64(values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather make this less magical and just select based on dtype. A read will be confused by this; better to do it on dtype.

@@ -772,6 +779,13 @@ def test_unique_index(self):
tm.assert_numpy_array_equal(case.duplicated(),
np.array([False, False, False]))

def test_duplicated_non_dtype(self):
# Make sure we can pass in array-likes with no dtype attribute.
cases = [[1, 2, 3], tuple([1, 2, 3])]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to support this. These are internal functions and you can expect a ndarray-like (and not a plain list). factorize is public and an exception.

@codecov-io
Copy link

codecov-io commented Dec 21, 2016

Current coverage is 84.65% (diff: 100%)

Merging #14934 into master will increase coverage by <.01%

@@             master     #14934   diff @@
==========================================
  Files           144        144          
  Lines         51030      51043    +13   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43199      43213    +14   
+ Misses         7831       7830     -1   
  Partials          0          0          

Powered by Codecov. Last update 74de478...6d31057

@gfyoung
Copy link
Member Author

gfyoung commented Dec 21, 2016

@jreback : Tests are green. Ready to merge if there is nothing else.

@jreback jreback added this to the 0.20.0 milestone Dec 21, 2016
@@ -236,6 +236,41 @@ def mode_int64(int64_t[:] values):

@cython.wraparound(False)
@cython.boundscheck(False)
def mode_uint64(uint64_t[:] values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move these to tempita? (for mode)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly. Done.

@@ -1202,6 +1209,20 @@ def test_int64_add_overflow():
b_mask=np.array([False, True]))


def test_mode_uint64():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test class (and consolidate the tests)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@gfyoung gfyoung force-pushed the core-algorithms-uint64 branch 2 times, most recently from b47353a to 9f2563c Compare December 21, 2016 22:24
@@ -1202,6 +1209,57 @@ def test_int64_add_overflow():
b_mask=np.array([False, True]))


class TestMode(tm.TestCase):

def test_basic(self):
Copy link
Contributor

@jreback jreback Dec 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like there is a test for mode in series/test_analytics.py can you make sure that we are looping over all integer, uint, float32, float64, and datelike types? (most are there, just making sure).

Also ok with moving all the test_mode test from series and putting them here (but then, leave the original test with a skip and a comment that says these are tested in algorithms).

alternatively, put a note here that these are tested in the series routine.

either one is ok.

Copy link
Member Author

@gfyoung gfyoung Dec 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I duplicated tests between Series and algos but also wrote additional tests for algos to account for other array-like objects. I see no reason to couple testing together.

@gfyoung gfyoung force-pushed the core-algorithms-uint64 branch 3 times, most recently from 588f03b to b454731 Compare December 22, 2016 05:24
@gfyoung
Copy link
Member Author

gfyoung commented Dec 22, 2016

@jreback : Addressed all comments, and everything is green and passing. Ready to merge if there are no other concerns.

@@ -547,10 +566,12 @@ def mode(values):
constructor = Series

dtype = values.dtype
if is_integer_dtype(values):
if is_signed_integer_dtype(values):
values = _ensure_int64(values)
result = constructor(sorted(htable.mode_int64(values)), dtype=dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side note, not sure whey sorted is used here, this does a conversion to list (then back when it is constructed).

can you do separate PR to fix this? (use np.sort)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and it can be at the end of the function (and actually you can move the boiler plate to the end of the function)
(e.g. the construct / sort)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or can you do in this PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And how is that any different from np.sort, which converts to ndarray before we then convert to Series? We might as well just use Series.sort_values then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not, yes you could do a .sort_values() (I just don't want sorted)

Copy link
Member Author

@gfyoung gfyoung Dec 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't quite move to the end (categorical doesn't sort). However, can just replace sorted with np.sort in the code.

table = kh_init_{{table_type}}()

{{if dtype == 'object'}}
build_count_table_{{dtype}}(values, mask, table)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, any reason NOT to make the signatures the same for build_count_table_*?

Copy link
Member Author

@gfyoung gfyoung Dec 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_count_table_object takes a mask argument, whereas the others do null detection internally and take a dropna parameter instead. It might be possible to standardize, but as all of these functions are tied to Python-facing functions (e.g. value_counts), I don't think it would be a good idea to investigate in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, ok, can you make an issue about this then?

this PR still need some work, so might be simpler to do that first FYI.

Copy link
Member Author

@gfyoung gfyoung Dec 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it doesn't seem as difficult as I thought previous. I'm going to try to make the change here in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, lmk then.

Copy link
Member Author

@gfyoung gfyoung Dec 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So looking into this further, I think I understand why object was special-cased from the rest. It appears that Python's hashing for object (PyObject_Hash) does not work very well with null values. Thus, when you insert a value like nan into such a hashtable multiple times, they will be interpreted as different keys, which is obviously wrong.

I have successfully refactored the code to align signatures nonetheless.

@jreback
Copy link
Contributor

jreback commented Dec 22, 2016

few comments

@gfyoung gfyoung force-pushed the core-algorithms-uint64 branch from b454731 to bbdb77d Compare December 22, 2016 17:45
1) duplicated()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and refactors to use duplicated_uint64.

2) mode()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and reactors to use mode_uint64.

3) unique()

Uses UInt64HashTable to patch a uint64 overflow bug
analogous to that seen in Series.unique (patched in
pandas-devgh-14915).

4) Types API

Introduces "is_signed_integer_dtype" and "is_unsigned
_integer_dtype" to the public API. Used in refactoring/
patching of 1-3.
@gfyoung gfyoung force-pushed the core-algorithms-uint64 branch from bbdb77d to 6d31057 Compare December 23, 2016 08:23
@jreback jreback closed this in 5710947 Dec 23, 2016
@jreback
Copy link
Contributor

jreback commented Dec 23, 2016

thanks!

@gfyoung gfyoung deleted the core-algorithms-uint64 branch December 23, 2016 13:46
ShaharBental pushed a commit to ShaharBental/pandas that referenced this pull request Dec 26, 2016
First of a series of PR's to patch and test `uint64` behaviour in
`core/algorithms.py`.  In this PR, the following functions are
checked:    1. `duplicated()` : robust but now has test to confirm  2.
`mode()` : robust but now has test to confirm  3. `unique()` : non-
robust but patched and tested

Author: gfyoung <[email protected]>

Closes pandas-dev#14934 from gfyoung/core-algorithms-uint64 and squashes the following commits:

6d31057 [gfyoung] DOC, TST, BUG: Improve uint64 core/algos behavior
jreback pushed a commit that referenced this pull request Jan 22, 2017
1) Patches handling of `uint64` in `value_counts`
2) Adds tests for handling of `uint64` in `factorize`

xref #14934

Author: gfyoung <[email protected]>

Closes #15162 from gfyoung/core-algorithms-uint64-three and squashes the following commits:

1fb256b [gfyoung] BUG, TST: Patch uint64 behavior in value_counts and factorize
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
1) Patches handling of `uint64` in `value_counts`
2) Adds tests for handling of `uint64` in `factorize`

xref pandas-dev#14934

Author: gfyoung <[email protected]>

Closes pandas-dev#15162 from gfyoung/core-algorithms-uint64-three and squashes the following commits:

1fb256b [gfyoung] BUG, TST: Patch uint64 behavior in value_counts and factorize
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants