-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Convert uint64 in maybe_convert_numeric #15005
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Convert uint64 in maybe_convert_numeric #15005
Conversation
Sadly, this function gets even more complex because of all the different ways in which you can hit those two behavior cases I mentioned. In addition, you have to check those cases in so many different places (hence my decision to write two inner functions to avoid duplicate code). If there are ways in which the logic can be simplified without sacrificing robustness, that would be great! |
a83b51b
to
e15620f
Compare
Current coverage is 84.78% (diff: 100%)@@ master #15005 diff @@
==========================================
Files 145 145
Lines 51090 51090
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
Hits 43315 43315
Misses 7775 7775
Partials 0 0
|
bint seen_float = False | ||
bint seen_complex = False | ||
bint seen_int = False | ||
bint seen_bool = False | ||
object val | ||
float64_t fval | ||
|
||
def check_uint64_nan(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these need to be marked
cdef bint inline check_uint64_nan
(and function below), otherwise this will slow way......down
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good catch. Done.
Whether or not we should return the original input array to avoid | ||
data truncation. | ||
""" | ||
if seen_null and seen_uint: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to pass in typed parameters as well (bint seen_null, bint seen_uint)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, done.
return_values : bool | ||
Whether or not we should return the original input array to avoid | ||
data truncation. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
definitly have a look at this with cython -a
to see any hot spots. and profile as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Though how exactly do you run it with a file like this? Given that it's interleaved with other files (and not standalone), I get a lot of compiler errors when running this:
cython -a pandas/src/inference.pyx
e15620f
to
c10449c
Compare
try this (you can't do this directly on inference.pyx because its only part of a file (IOW it doesn't have all of the proper declarations), but lib.pyx does (which is the congolomeration).
at the top of lib.pyx, you can recompile with this = True, then %prun works nicely
|
@jreback : Ran Some timing info (note I setup the array so that we hit as many of the potential pain points as possible): In [1]: from pandas import lib, np
In [2]: n = 1000000
In [3]:
In [4]: arr = np.repeat([2**63], n)
In [5]: arr = arr + np.arange(1000000).astype('uint64')
In [6]: arr = np.array([arr[i] if i%2 == 0 else str(arr[i]) for i in range(n)],
dtype=object)
In [7]:
In [8]: arr[-1] = -1
In [9]: %timeit -n 100 lib.maybe_convert_objects(arr, set())
100 loops, best of 3: 44 µs per loop |
can u add an asv to that effect? |
Add handling for uint64 elements in an array with the follow behavior specifications: 1) If uint64 and NaN are both detected, the original input will be returned if coerce_numeric is False. Otherwise, an Exception is raised. 2) If uint64 and negative numbers are both detected, the original input be returned if coerce_numeric is False. Otherwise, an Exception is raised. Closes pandas-devgh-14982. Partial fix for pandas-devgh-14983.
c10449c
to
c3bd28a
Compare
@jreback : Added |
@gfyoung You updated Can you also show the results of the asv tests for I was wondering, regarding your question on code complexity, you now call the |
@jorisvandenbossche : To address your points:
For (2) and (3), I don't expect perf to be impacted too much, but we'll see. |
@jorisvandenbossche : I just checked (1), and it appears more work will be needed (xref #14901). Could we put that off as a follow-up? I can tackle (2) and (3) for this PR. |
if not seen_float: | ||
if maybe_int: | ||
as_int = int(val) | ||
if maybe_int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when does maybe_int get turned off? this is very confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback : maybe_int
gets turned on / off when we call floatify
right above. This is not super confusing IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, maybe its the diff
@@ -655,22 +731,60 @@ def maybe_convert_numeric(object[:] values, set na_values, | |||
val = values[i] | |||
|
|||
if val.__hash__ is not None and val in na_values: | |||
seen_null = True | |||
if _check_uint64_nan(seen_uint, seen_null, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this might be a bit simpler if instead of individual values we have a structure with those values instead, e.g.
struct Seen {
bint null,
bint uint,
.....
} seen;
then you can pass this to functions as needed (so the args to those checking functions will be consistent),
e.g. you don't have seen_uint, seen_null (and other args in another case).
can be done in a followup as well (if it makes sense)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we leave as a follow-up for the time being. I'm not sure yet about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, that's fine. This is just amazingly complicated code (the number of paths). If you have time / energy to simplify would be great (and a structure to hold state), and/or cython class might be just the thing here.
This is really a state machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😄 - indeed. Unfortunate but necessary. I agree that something could be simplified, but given that it is very performant at this point and most likely could be applied to maybe_convert_objects
, it would be best for a future PR.
thanks! so if you follow up with fixes for |
1) Patches `float` handling by reducing the "closeness" level when checking conversions. 2) Patches `uint` handling by allowing casts to `uint` dtypes of equal or lesser size to `int64` (when values are less than `INT64_MAX` Closes #14941. Follow-up to #15005. Author: gfyoung <[email protected]> Closes #15024 from gfyoung/to-numeric-uint and squashes the following commits: 9e35819 [gfyoung] BUG: Patch float and uint handling in to_numeric
Creates a Seen class to abstract objects encountered when doing type conversions. Follow-up to #15005. Author: gfyoung <[email protected]> Closes #15018 from gfyoung/maybe-convert-refactor and squashes the following commits: 531a361 [gfyoung] MAINT: Refactor logic in maybe_convert_*
Add handling for
uint64
elements in an array with the follow behavior specifications:If
uint64
andNaN
are both detected, the original input will be returned ifcoerce_numeric
is
False
. Otherwise, anException
is raised.If
uint64
and negative numbers are both detected, the original input be returned ifcoerce_numeric
isFalse
. Otherwise, anException
is raised.Closes #14982.
Partial fix for #14983.