Skip to content

BUG: Convert uint64 in maybe_convert_numeric #15005

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Dec 28, 2016

Add handling for uint64 elements in an array with the follow behavior specifications:

  1. If uint64 and NaN are both detected, the original input will be returned if coerce_numeric
    is False. Otherwise, an Exception is raised.

  2. If uint64 and negative numbers are both detected, the original input be returned if coerce_numeric is False. Otherwise, an Exception is raised.

Closes #14982.
Partial fix for #14983.

@gfyoung
Copy link
Member Author

gfyoung commented Dec 28, 2016

Sadly, this function gets even more complex because of all the different ways in which you can hit those two behavior cases I mentioned. In addition, you have to check those cases in so many different places (hence my decision to write two inner functions to avoid duplicate code). If there are ways in which the logic can be simplified without sacrificing robustness, that would be great!

@gfyoung gfyoung force-pushed the maybe-convert-numeric-uint64 branch from a83b51b to e15620f Compare December 28, 2016 21:58
@codecov-io
Copy link

codecov-io commented Dec 28, 2016

Current coverage is 84.78% (diff: 100%)

Merging #15005 into master will not change coverage

@@             master     #15005   diff @@
==========================================
  Files           145        145          
  Lines         51090      51090          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits          43315      43315          
  Misses         7775       7775          
  Partials          0          0          

Powered by Codecov. Last update 0d3ecfa...c3bd28a

bint seen_float = False
bint seen_complex = False
bint seen_int = False
bint seen_bool = False
object val
float64_t fval

def check_uint64_nan():
Copy link
Contributor

@jreback jreback Dec 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these need to be marked

cdef bint inline check_uint64_nan (and function below), otherwise this will slow way......down

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch. Done.

Whether or not we should return the original input array to avoid
data truncation.
"""
if seen_null and seen_uint:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to pass in typed parameters as well (bint seen_null, bint seen_uint)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, done.

return_values : bool
Whether or not we should return the original input array to avoid
data truncation.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitly have a look at this with cython -a to see any hot spots. and profile as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Though how exactly do you run it with a file like this? Given that it's interleaved with other files (and not standalone), I get a lot of compiler errors when running this:

cython -a pandas/src/inference.pyx

@jreback jreback added Bug Dtype Conversions Unexpected or buggy dtype conversions labels Dec 29, 2016
@gfyoung gfyoung force-pushed the maybe-convert-numeric-uint64 branch from e15620f to c10449c Compare December 29, 2016 10:30
@jreback
Copy link
Contributor

jreback commented Dec 29, 2016

try this (you can't do this directly on inference.pyx because its only part of a file (IOW it doesn't have all of the proper declarations), but lib.pyx does (which is the congolomeration).

(pandas) bash-3.2$ cython -a pandas/lib.pyx -I pandas/src
(pandas) bash-3.2$ git st
On branch weightby
Your branch is ahead of 'agg2' by 2 commits.
  (use "git push" to publish your local commits)
Untracked files:
  (use "git add <file>..." to include in what will be committed)

        pandas/inference.html
        pandas/lib.html
        pandas/properties.html
        pandas/reduce.html

nothing added to commit but untracked files present (use "git add" to track)

at the top of lib.pyx, you can recompile with this = True, then %prun works nicely

# cython: profile=False

@gfyoung
Copy link
Member Author

gfyoung commented Dec 29, 2016

@jreback : Ran cython -a as requested, and no hotspots to report. Your cdef inline bint suggestion really made a difference as you already said. 😄

Some timing info (note I setup the array so that we hit as many of the potential pain points as possible):

In [1]: from pandas import lib, np
In [2]: n = 1000000
In [3]:
In [4]: arr = np.repeat([2**63], n)
In [5]: arr = arr + np.arange(1000000).astype('uint64')
In [6]: arr = np.array([arr[i] if i%2 == 0 else str(arr[i]) for i in range(n)],
                        dtype=object)
In [7]:
In [8]: arr[-1] = -1
In [9]: %timeit -n 100 lib.maybe_convert_objects(arr, set())
100 loops, best of 3: 44 µs per loop

@jreback
Copy link
Contributor

jreback commented Dec 29, 2016

can u add an asv to that effect?

Add handling for uint64 elements in an array
with the follow behavior specifications:

1) If uint64 and NaN are both detected, the
original input will be returned if coerce_numeric
is False. Otherwise, an Exception is raised.

2) If uint64 and negative numbers are both
detected, the original input be returned if
coerce_numeric is False. Otherwise, an
Exception is raised.

Closes pandas-devgh-14982.
Partial fix for pandas-devgh-14983.
@gfyoung gfyoung force-pushed the maybe-convert-numeric-uint64 branch from c10449c to c3bd28a Compare December 29, 2016 23:03
@gfyoung
Copy link
Member Author

gfyoung commented Dec 30, 2016

@jreback : Added asv test. Ready to merge if there are no other concerns.

@gfyoung gfyoung closed this Dec 30, 2016
@gfyoung gfyoung reopened this Dec 30, 2016
@jorisvandenbossche
Copy link
Member

@gfyoung You updated maybe_convert_numeric to work with uints, but maybe_convert_numeric is used in to_numeric, so the result is also that to_numeric now can handle those?
If so, can you add tests for this as well? (as this is the API for users)

Can you also show the results of the asv tests for to_numeric before/after to check the perf impact.

I was wondering, regarding your question on code complexity, you now call the _check_uint64_nan and _check_uint64_int64_conflict in many places throughout the for loop. But could it also be a choice to only call those after the for loop? This would give worse performance for cases with conflicts (as the original values are not returned / error raised on first occurence of a conflict, but only after parsing all values), but would improve code complexity slightly and give possibly slightly better performance for cases without conflicts (for arrays that actually get converted)

@gfyoung
Copy link
Member Author

gfyoung commented Dec 30, 2016

@jorisvandenbossche : To address your points:

  1. to_numeric : Yes! It can handle uint64 now. I will add test cases.
  2. Checking at the end certainly is plausible. Given how fast it is, it couldn't hurt.
  3. Benchmarks for to_numeric can be provided once (1) and (2) have been resolved.

For (2) and (3), I don't expect perf to be impacted too much, but we'll see.

@gfyoung
Copy link
Member Author

gfyoung commented Dec 30, 2016

@jorisvandenbossche : I just checked (1), and it appears more work will be needed (xref #14901). Could we put that off as a follow-up? I can tackle (2) and (3) for this PR.

if not seen_float:
if maybe_int:
as_int = int(val)
if maybe_int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when does maybe_int get turned off? this is very confusing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback : maybe_int gets turned on / off when we call floatify right above. This is not super confusing IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, maybe its the diff

@@ -655,22 +731,60 @@ def maybe_convert_numeric(object[:] values, set na_values,
val = values[i]

if val.__hash__ is not None and val in na_values:
seen_null = True
if _check_uint64_nan(seen_uint, seen_null,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this might be a bit simpler if instead of individual values we have a structure with those values instead, e.g.

struct Seen {
    bint null,
    bint uint, 
   .....
} seen;

then you can pass this to functions as needed (so the args to those checking functions will be consistent),
e.g. you don't have seen_uint, seen_null (and other args in another case).

can be done in a followup as well (if it makes sense)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we leave as a follow-up for the time being. I'm not sure yet about this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, that's fine. This is just amazingly complicated code (the number of paths). If you have time / energy to simplify would be great (and a structure to hold state), and/or cython class might be just the thing here.

This is really a state machine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😄 - indeed. Unfortunate but necessary. I agree that something could be simplified, but given that it is very performant at this point and most likely could be applied to maybe_convert_objects, it would be best for a future PR.

@jreback jreback closed this in 17d7ddb Dec 30, 2016
@jreback jreback added this to the 0.20.0 milestone Dec 30, 2016
@jreback
Copy link
Contributor

jreback commented Dec 30, 2016

thanks!

so if you follow up with fixes for .to_numeric and change this to a state machine (e.g. really a class with methods), then this would be great.

@gfyoung gfyoung deleted the maybe-convert-numeric-uint64 branch December 30, 2016 18:54
jreback pushed a commit that referenced this pull request Dec 31, 2016
1) Patches `float` handling by reducing the "closeness" level when
checking conversions.  2) Patches `uint` handling by allowing casts to
`uint` dtypes of equal or lesser size to `int64` (when values are less
than `INT64_MAX`    Closes #14941.  Follow-up to #15005.

Author: gfyoung <[email protected]>

Closes #15024 from gfyoung/to-numeric-uint and squashes the following commits:

9e35819 [gfyoung] BUG: Patch float and uint handling in to_numeric
jreback pushed a commit that referenced this pull request Jan 3, 2017
Creates a Seen class to abstract objects encountered when doing type
conversions.    Follow-up to #15005.

Author: gfyoung <[email protected]>

Closes #15018 from gfyoung/maybe-convert-refactor and squashes the following commits:

531a361 [gfyoung] MAINT: Refactor logic in maybe_convert_*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: maybe_convert_numeric fails with uint64
4 participants