BUG: Convert uint64 in maybe_convert_numeric #15005

gfyoung · 2016-12-28T19:54:53Z

Add handling for uint64 elements in an array with the follow behavior specifications:

If uint64 and NaN are both detected, the original input will be returned if coerce_numeric
is False. Otherwise, an Exception is raised.
If uint64 and negative numbers are both detected, the original input be returned if coerce_numeric is False. Otherwise, an Exception is raised.

Closes #14982.
Partial fix for #14983.

gfyoung · 2016-12-28T19:57:09Z

Sadly, this function gets even more complex because of all the different ways in which you can hit those two behavior cases I mentioned. In addition, you have to check those cases in so many different places (hence my decision to write two inner functions to avoid duplicate code). If there are ways in which the logic can be simplified without sacrificing robustness, that would be great!

codecov-io · 2016-12-28T21:58:34Z

Current coverage is 84.78% (diff: 100%)

Merging #15005 into master will not change coverage

@@             master     #15005   diff @@
==========================================
  Files           145        145          
  Lines         51090      51090          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits          43315      43315          
  Misses         7775       7775          
  Partials          0          0

Powered by Codecov. Last update 0d3ecfa...c3bd28a

jreback · 2016-12-29T01:44:06Z

pandas/src/inference.pyx

        bint seen_float = False
        bint seen_complex = False
        bint seen_int = False
        bint seen_bool = False
        object val
        float64_t fval

+    def check_uint64_nan():


these need to be marked

cdef bint inline check_uint64_nan (and function below), otherwise this will slow way......down

Ah, good catch. Done.

jreback · 2016-12-29T01:44:35Z

pandas/src/inference.pyx

+            Whether or not we should return the original input array to avoid
+            data truncation.
+        """
+        if seen_null and seen_uint:


you need to pass in typed parameters as well (bint seen_null, bint seen_uint)

jreback · 2016-12-29T01:48:04Z

pandas/src/inference.pyx

+        return_values : bool
+            Whether or not we should return the original input array to avoid
+            data truncation.
+        """


definitly have a look at this with cython -a to see any hot spots. and profile as well.

Sounds good. Though how exactly do you run it with a file like this? Given that it's interleaved with other files (and not standalone), I get a lot of compiler errors when running this:

cython -a pandas/src/inference.pyx

jreback · 2016-12-29T11:00:13Z

try this (you can't do this directly on inference.pyx because its only part of a file (IOW it doesn't have all of the proper declarations), but lib.pyx does (which is the congolomeration).

(pandas) bash-3.2$ cython -a pandas/lib.pyx -I pandas/src
(pandas) bash-3.2$ git st
On branch weightby
Your branch is ahead of 'agg2' by 2 commits.
  (use "git push" to publish your local commits)
Untracked files:
  (use "git add <file>..." to include in what will be committed)

        pandas/inference.html
        pandas/lib.html
        pandas/properties.html
        pandas/reduce.html

nothing added to commit but untracked files present (use "git add" to track)

at the top of lib.pyx, you can recompile with this = True, then %prun works nicely

# cython: profile=False

gfyoung · 2016-12-29T19:15:21Z

@jreback : Ran cython -a as requested, and no hotspots to report. Your cdef inline bint suggestion really made a difference as you already said. 😄

Some timing info (note I setup the array so that we hit as many of the potential pain points as possible):

In [1]: from pandas import lib, np
In [2]: n = 1000000
In [3]:
In [4]: arr = np.repeat([2**63], n)
In [5]: arr = arr + np.arange(1000000).astype('uint64')
In [6]: arr = np.array([arr[i] if i%2 == 0 else str(arr[i]) for i in range(n)],
                        dtype=object)
In [7]:
In [8]: arr[-1] = -1
In [9]: %timeit -n 100 lib.maybe_convert_objects(arr, set())
100 loops, best of 3: 44 µs per loop

jreback · 2016-12-29T22:31:23Z

can u add an asv to that effect?

Add handling for uint64 elements in an array with the follow behavior specifications: 1) If uint64 and NaN are both detected, the original input will be returned if coerce_numeric is False. Otherwise, an Exception is raised. 2) If uint64 and negative numbers are both detected, the original input be returned if coerce_numeric is False. Otherwise, an Exception is raised. Closes pandas-devgh-14982. Partial fix for pandas-devgh-14983.

gfyoung · 2016-12-30T09:29:24Z

@jreback : Added asv test. Ready to merge if there are no other concerns.

jorisvandenbossche · 2016-12-30T10:36:37Z

@gfyoung You updated maybe_convert_numeric to work with uints, but maybe_convert_numeric is used in to_numeric, so the result is also that to_numeric now can handle those?
If so, can you add tests for this as well? (as this is the API for users)

Can you also show the results of the asv tests for to_numeric before/after to check the perf impact.

I was wondering, regarding your question on code complexity, you now call the _check_uint64_nan and _check_uint64_int64_conflict in many places throughout the for loop. But could it also be a choice to only call those after the for loop? This would give worse performance for cases with conflicts (as the original values are not returned / error raised on first occurence of a conflict, but only after parsing all values), but would improve code complexity slightly and give possibly slightly better performance for cases without conflicts (for arrays that actually get converted)

gfyoung · 2016-12-30T10:47:49Z

@jorisvandenbossche : To address your points:

to_numeric : Yes! It can handle uint64 now. I will add test cases.
Checking at the end certainly is plausible. Given how fast it is, it couldn't hurt.
Benchmarks for to_numeric can be provided once (1) and (2) have been resolved.

For (2) and (3), I don't expect perf to be impacted too much, but we'll see.

gfyoung · 2016-12-30T10:55:00Z

@jorisvandenbossche : I just checked (1), and it appears more work will be needed (xref #14901). Could we put that off as a follow-up? I can tackle (2) and (3) for this PR.

jreback · 2016-12-30T13:21:45Z

pandas/src/inference.pyx

-                if not seen_float:
-                    if maybe_int:
-                        as_int = int(val)
+                if maybe_int:


when does maybe_int get turned off? this is very confusing

@jreback : maybe_int gets turned on / off when we call floatify right above. This is not super confusing IMO.

ok, maybe its the diff

jreback · 2016-12-30T18:44:19Z

pandas/src/inference.pyx

@@ -655,22 +731,60 @@ def maybe_convert_numeric(object[:] values, set na_values,
        val = values[i]

        if val.__hash__ is not None and val in na_values:
+            seen_null = True
+            if _check_uint64_nan(seen_uint, seen_null,


maybe this might be a bit simpler if instead of individual values we have a structure with those values instead, e.g.

struct Seen { bint null, bint uint, ..... } seen;

then you can pass this to functions as needed (so the args to those checking functions will be consistent),
e.g. you don't have seen_uint, seen_null (and other args in another case).

can be done in a followup as well (if it makes sense)

Why don't we leave as a follow-up for the time being. I'm not sure yet about this.

yep, that's fine. This is just amazingly complicated code (the number of paths). If you have time / energy to simplify would be great (and a structure to hold state), and/or cython class might be just the thing here.

This is really a state machine.

😄 - indeed. Unfortunate but necessary. I agree that something could be simplified, but given that it is very performant at this point and most likely could be applied to maybe_convert_objects, it would be best for a future PR.

jreback · 2016-12-30T18:54:11Z

thanks!

so if you follow up with fixes for .to_numeric and change this to a state machine (e.g. really a class with methods), then this would be great.

1) Patches `float` handling by reducing the "closeness" level when checking conversions. 2) Patches `uint` handling by allowing casts to `uint` dtypes of equal or lesser size to `int64` (when values are less than `INT64_MAX` Closes #14941. Follow-up to #15005. Author: gfyoung <[email protected]> Closes #15024 from gfyoung/to-numeric-uint and squashes the following commits: 9e35819 [gfyoung] BUG: Patch float and uint handling in to_numeric

Creates a Seen class to abstract objects encountered when doing type conversions. Follow-up to #15005. Author: gfyoung <[email protected]> Closes #15018 from gfyoung/maybe-convert-refactor and squashes the following commits: 531a361 [gfyoung] MAINT: Refactor logic in maybe_convert_*

gfyoung force-pushed the maybe-convert-numeric-uint64 branch from a83b51b to e15620f Compare December 28, 2016 21:58

jreback reviewed Dec 29, 2016

View reviewed changes

jreback added Bug Dtype Conversions Unexpected or buggy dtype conversions labels Dec 29, 2016

gfyoung force-pushed the maybe-convert-numeric-uint64 branch from e15620f to c10449c Compare December 29, 2016 10:30

gfyoung force-pushed the maybe-convert-numeric-uint64 branch from c10449c to c3bd28a Compare December 29, 2016 23:03

gfyoung closed this Dec 30, 2016

gfyoung reopened this Dec 30, 2016

jreback reviewed Dec 30, 2016

View reviewed changes

jreback closed this in 17d7ddb Dec 30, 2016

jreback added this to the 0.20.0 milestone Dec 30, 2016

gfyoung deleted the maybe-convert-numeric-uint64 branch December 30, 2016 18:54

This was referenced Dec 30, 2016

MAINT: Refactor logic in maybe_convert_* #15018

Closed

BUG: Patch float and uint handling in to_numeric #15024

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Convert uint64 in maybe_convert_numeric #15005

BUG: Convert uint64 in maybe_convert_numeric #15005

gfyoung commented Dec 28, 2016

gfyoung commented Dec 28, 2016

codecov-io commented Dec 28, 2016 •

edited

Loading

jreback Dec 29, 2016 •

edited

Loading

gfyoung Dec 29, 2016

jreback Dec 29, 2016

gfyoung Dec 29, 2016

jreback Dec 29, 2016

gfyoung Dec 29, 2016

jreback commented Dec 29, 2016

gfyoung commented Dec 29, 2016 •

edited

Loading

jreback commented Dec 29, 2016

gfyoung commented Dec 30, 2016

jorisvandenbossche commented Dec 30, 2016

gfyoung commented Dec 30, 2016 •

edited

Loading

gfyoung commented Dec 30, 2016

jreback Dec 30, 2016

gfyoung Dec 30, 2016

jreback Dec 30, 2016

jreback Dec 30, 2016

gfyoung Dec 30, 2016

jreback Dec 30, 2016

gfyoung Dec 30, 2016

jreback commented Dec 30, 2016

BUG: Convert uint64 in maybe_convert_numeric #15005

BUG: Convert uint64 in maybe_convert_numeric #15005

Conversation

gfyoung commented Dec 28, 2016

gfyoung commented Dec 28, 2016

codecov-io commented Dec 28, 2016 • edited Loading

Current coverage is 84.78% (diff: 100%)

jreback Dec 29, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 29, 2016

gfyoung commented Dec 29, 2016 • edited Loading

jreback commented Dec 29, 2016

gfyoung commented Dec 30, 2016

jorisvandenbossche commented Dec 30, 2016

gfyoung commented Dec 30, 2016 • edited Loading

gfyoung commented Dec 30, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 30, 2016

codecov-io commented Dec 28, 2016 •

edited

Loading

jreback Dec 29, 2016 •

edited

Loading

gfyoung commented Dec 29, 2016 •

edited

Loading

gfyoung commented Dec 30, 2016 •

edited

Loading