-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG, TST: Check uint64 behaviour in algorithms.py #14934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -155,6 +155,18 @@ def is_integer_dtype(arr_or_dtype): | |||
not issubclass(tipo, (np.datetime64, np.timedelta64))) | |||
|
|||
|
|||
def is_signed_integer_dtype(arr_or_dtype): | |||
tipo = _get_dtype_type(arr_or_dtype) | |||
return (issubclass(tipo, np.signedinteger) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok to add to public api (pandas.api)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, done.
@@ -508,6 +512,7 @@ def duplicated(values, keep='first'): | |||
duplicated : ndarray | |||
""" | |||
|
|||
values = com._asarray_tuplesafe(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, why are you doing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your comment below answers this question. Fair enough. Will remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep I figure out why you are doing it (but still not sure its necessary). I'd rather have a more strict requirement here for internal API's (unless good reason otherwise)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's perfectly reasonable. I'll update the doc to make it clear to avoid confusion.
@@ -548,6 +567,9 @@ def mode(values): | |||
|
|||
dtype = values.dtype | |||
if is_integer_dtype(values): | |||
# This also works if dtype is uint64 because there is a 1-1 | |||
# correspondence between int64 and uint64, and we will cast | |||
# all elements back to their correct uint64 values. | |||
values = _ensure_int64(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather make this less magical and just select based on dtype. A read will be confused by this; better to do it on dtype.
@@ -772,6 +779,13 @@ def test_unique_index(self): | |||
tm.assert_numpy_array_equal(case.duplicated(), | |||
np.array([False, False, False])) | |||
|
|||
def test_duplicated_non_dtype(self): | |||
# Make sure we can pass in array-likes with no dtype attribute. | |||
cases = [[1, 2, 3], tuple([1, 2, 3])] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to support this. These are internal functions and you can expect a ndarray-like (and not a plain list). factorize
is public and an exception.
d9b7c84
to
b51f879
Compare
Current coverage is 84.65% (diff: 100%)@@ master #14934 diff @@
==========================================
Files 144 144
Lines 51030 51043 +13
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 43199 43213 +14
+ Misses 7831 7830 -1
Partials 0 0
|
@jreback : Tests are green. Ready to merge if there is nothing else. |
@@ -236,6 +236,41 @@ def mode_int64(int64_t[:] values): | |||
|
|||
@cython.wraparound(False) | |||
@cython.boundscheck(False) | |||
def mode_uint64(uint64_t[:] values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we move these to tempita? (for mode)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly. Done.
@@ -1202,6 +1209,20 @@ def test_int64_add_overflow(): | |||
b_mask=np.array([False, True])) | |||
|
|||
|
|||
def test_mode_uint64(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a test class (and consolidate the tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
b47353a
to
9f2563c
Compare
@@ -1202,6 +1209,57 @@ def test_int64_add_overflow(): | |||
b_mask=np.array([False, True])) | |||
|
|||
|
|||
class TestMode(tm.TestCase): | |||
|
|||
def test_basic(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like there is a test for mode in series/test_analytics.py can you make sure that we are looping over all integer, uint, float32, float64, and datelike types? (most are there, just making sure).
Also ok with moving all the test_mode test from series and putting them here (but then, leave the original test with a skip and a comment that says these are tested in algorithms).
alternatively, put a note here that these are tested in the series routine.
either one is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I duplicated tests between Series
and algos
but also wrote additional tests for algos
to account for other array-like objects. I see no reason to couple testing together.
588f03b
to
b454731
Compare
@jreback : Addressed all comments, and everything is green and passing. Ready to merge if there are no other concerns. |
@@ -547,10 +566,12 @@ def mode(values): | |||
constructor = Series | |||
|
|||
dtype = values.dtype | |||
if is_integer_dtype(values): | |||
if is_signed_integer_dtype(values): | |||
values = _ensure_int64(values) | |||
result = constructor(sorted(htable.mode_int64(values)), dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
side note, not sure whey sorted
is used here, this does a conversion to list (then back when it is constructed).
can you do separate PR to fix this? (use np.sort)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and it can be at the end of the function (and actually you can move the boiler plate to the end of the function)
(e.g. the construct / sort)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or can you do in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And how is that any different from np.sort
, which converts to ndarray
before we then convert to Series
? We might as well just use Series.sort_values
then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its not, yes you could do a .sort_values()
(I just don't want sorted
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't quite move to the end (categorical
doesn't sort). However, can just replace sorted
with np.sort
in the code.
table = kh_init_{{table_type}}() | ||
|
||
{{if dtype == 'object'}} | ||
build_count_table_{{dtype}}(values, mask, table) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, any reason NOT to make the signatures the same for build_count_table_*?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
build_count_table_object
takes a mask
argument, whereas the others do null
detection internally and take a dropna
parameter instead. It might be possible to standardize, but as all of these functions are tied to Python-facing functions (e.g. value_counts
), I don't think it would be a good idea to investigate in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, ok, can you make an issue about this then?
this PR still need some work, so might be simpler to do that first FYI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it doesn't seem as difficult as I thought previous. I'm going to try to make the change here in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, lmk then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So looking into this further, I think I understand why object
was special-cased from the rest. It appears that Python's hashing for object
(PyObject_Hash
) does not work very well with null
values. Thus, when you insert a value like nan
into such a hashtable multiple times, they will be interpreted as different keys, which is obviously wrong.
I have successfully refactored the code to align signatures nonetheless.
few comments |
b454731
to
bbdb77d
Compare
1) duplicated() Updates documentation to describe the "values" parameter in the signature, adds tests for uint64, and refactors to use duplicated_uint64. 2) mode() Updates documentation to describe the "values" parameter in the signature, adds tests for uint64, and reactors to use mode_uint64. 3) unique() Uses UInt64HashTable to patch a uint64 overflow bug analogous to that seen in Series.unique (patched in pandas-devgh-14915). 4) Types API Introduces "is_signed_integer_dtype" and "is_unsigned _integer_dtype" to the public API. Used in refactoring/ patching of 1-3.
bbdb77d
to
6d31057
Compare
thanks! |
First of a series of PR's to patch and test `uint64` behaviour in `core/algorithms.py`. In this PR, the following functions are checked: 1. `duplicated()` : robust but now has test to confirm 2. `mode()` : robust but now has test to confirm 3. `unique()` : non- robust but patched and tested Author: gfyoung <[email protected]> Closes pandas-dev#14934 from gfyoung/core-algorithms-uint64 and squashes the following commits: 6d31057 [gfyoung] DOC, TST, BUG: Improve uint64 core/algos behavior
1) Patches handling of `uint64` in `value_counts` 2) Adds tests for handling of `uint64` in `factorize` xref #14934 Author: gfyoung <[email protected]> Closes #15162 from gfyoung/core-algorithms-uint64-three and squashes the following commits: 1fb256b [gfyoung] BUG, TST: Patch uint64 behavior in value_counts and factorize
1) Patches handling of `uint64` in `value_counts` 2) Adds tests for handling of `uint64` in `factorize` xref pandas-dev#14934 Author: gfyoung <[email protected]> Closes pandas-dev#15162 from gfyoung/core-algorithms-uint64-three and squashes the following commits: 1fb256b [gfyoung] BUG, TST: Patch uint64 behavior in value_counts and factorize
First of a series of PR's to patch and test
uint64
behaviour incore/algorithms.py
. In this PR, the following functions are checked:duplicated()
: robust but now has test to confirmmode()
: robust but now has test to confirmunique()
: non-robust but patched and tested