-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Parse uint64 in read_csv #15020
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e4c204f
to
e4e3c10
Compare
Current coverage is 84.77% (diff: 100%)@@ master #15020 diff @@
==========================================
Files 145 145
Lines 51131 51131
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
Hits 43345 43345
Misses 7786 7786
Partials 0 0
|
e4e3c10
to
0e11b64
Compare
pls run a perf check (maybe need some additonal asvs) |
Adds behavior to allow for parsing of uint64 data in read_csv. Also ensures that they are properly handled along with NaN and negative values. Closes pandas-devgh-14983.
0e11b64
to
d018e38
Compare
@jreback : Added one benchmark, and performance does not seem impacted significantly (if at all). |
@@ -288,6 +288,7 @@ Bug Fixes | |||
- Bug in ``Index`` power operations with reversed operands (:issue:`14973`) | |||
- Bug in ``TimedeltaIndex`` addition where overflow was being allowed without error (:issue:`14816`) | |||
- Bug in ``DataFrame`` construction in which unsigned 64-bit integer elements were being converted to objects (:issue:`14881`) | |||
- Bug in ``pd.read_csv()`` in which unsigned 64-bit integer elements were being improperly converted to the wrong data types (:issue:`14983`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also call this an enhancement that read_csv now supports reading uint64 values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this is a bug because read_csv
should be able to handle all data types equally. In addition, both the issue and this PR have already been tagged as a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't really matter whether it is a bug fix or not (or how we tag the PR), what I just wanted to say is that we can highlight it more by putting it in the enhancement section, if we think it is worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, fair enough. I just realized though: #14937 (see my changes to thewhatsnew
) might actually resolve this discussion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, indeed, we can group all uint enhancement/bug fixes together
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so we can leave that for #14937 then. 😄
@jreback : Any other comments about this PR? Otherwise, seems ready to merge. |
@gfyoung haven't looked yet. soon. |
@@ -1750,6 +1772,78 @@ cdef inline int _try_double_nogil(parser_t *parser, int col, | |||
|
|||
return 0 | |||
|
|||
cdef _try_uint64(parser_t *parser, int col, int line_start, int line_end, | |||
bint na_filter, kh_str_t *na_hashset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prob for another PR, but we should generate these try_* routines with tempita (if possible)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible I imagine but not as nice since the implementations are definitely customized to the dtype
(e.g. note how I did not pass in any NA
value for uint64
, but the int64
implementation does)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, if it seems worth it
@@ -1876,3 +1886,88 @@ int64_t str_to_int64(const char *p_item, int64_t int_min, int64_t int_max, | |||
*error = 0; | |||
return number; | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe generate some code here too (for another PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, possible but not as nice because the implementations are definitely customized to the dtype
(e.g. compare the difference in handling for "negative numbers" for int64
and uint64
).
thanks @gfyoung my comments are for future code generation. |
@jreback : Yep, got it. Thanks! |
Adds behavior to allow for parsing of
uint64
data inread_csv
. Also ensuresthat they are properly handled along with
NaN
and negative values.Closes #14983.