BUG: Parse uint64 in read_csv #15020

gfyoung · 2016-12-31T10:11:45Z

Adds behavior to allow for parsing of uint64 data in read_csv. Also ensures
that they are properly handled along with NaN and negative values.

Closes #14983.

codecov-io · 2016-12-31T10:56:59Z

Current coverage is 84.77% (diff: 100%)

Merging #15020 into master will not change coverage

@@             master     #15020   diff @@
==========================================
  Files           145        145          
  Lines         51131      51131          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits          43345      43345          
  Misses         7786       7786          
  Partials          0          0

Powered by Codecov. Last update 3662413...d018e38

jreback · 2016-12-31T15:46:11Z

pls run a perf check (maybe need some additonal asvs)

Adds behavior to allow for parsing of uint64 data in read_csv. Also ensures that they are properly handled along with NaN and negative values. Closes pandas-devgh-14983.

gfyoung · 2016-12-31T21:11:47Z

@jreback : Added one benchmark, and performance does not seem impacted significantly (if at all).

jorisvandenbossche · 2017-01-02T09:14:39Z

doc/source/whatsnew/v0.20.0.txt

@@ -288,6 +288,7 @@ Bug Fixes
 - Bug in ``Index`` power operations with reversed operands (:issue:`14973`)
 - Bug in ``TimedeltaIndex`` addition where overflow was being allowed without error (:issue:`14816`)
 - Bug in ``DataFrame`` construction in which unsigned 64-bit integer elements were being converted to objects (:issue:`14881`)
+- Bug in ``pd.read_csv()`` in which unsigned 64-bit integer elements were being improperly converted to the wrong data types (:issue:`14983`)


We could also call this an enhancement that read_csv now supports reading uint64 values?

IMO this is a bug because read_csv should be able to handle all data types equally. In addition, both the issue and this PR have already been tagged as a bug.

It doesn't really matter whether it is a bug fix or not (or how we tag the PR), what I just wanted to say is that we can highlight it more by putting it in the enhancement section, if we think it is worth it.

Ah, fair enough. I just realized though: #14937 (see my changes to thewhatsnew) might actually resolve this discussion?

yep, indeed, we can group all uint enhancement/bug fixes together

Okay, so we can leave that for #14937 then. 😄

gfyoung · 2017-01-02T19:40:02Z

@jreback : Any other comments about this PR? Otherwise, seems ready to merge.

jreback · 2017-01-02T19:41:15Z

@gfyoung haven't looked yet. soon.

jreback · 2017-01-02T19:47:55Z

pandas/parser.pyx

@@ -1750,6 +1772,78 @@ cdef inline int _try_double_nogil(parser_t *parser, int col,

    return 0

+cdef _try_uint64(parser_t *parser, int col, int line_start, int line_end,
+                 bint na_filter, kh_str_t *na_hashset):


prob for another PR, but we should generate these try_* routines with tempita (if possible)

It's possible I imagine but not as nice since the implementations are definitely customized to the dtype (e.g. note how I did not pass in any NA value for uint64, but the int64 implementation does)

ok, if it seems worth it

jreback · 2017-01-02T19:49:22Z

pandas/src/parser/tokenizer.c

@@ -1876,3 +1886,88 @@ int64_t str_to_int64(const char *p_item, int64_t int_min, int64_t int_max,
    *error = 0;
    return number;
 }
+


maybe generate some code here too (for another PR)

Again, possible but not as nice because the implementations are definitely customized to the dtype (e.g. compare the difference in handling for "negative numbers" for int64 and uint64).

jreback · 2017-01-02T19:50:22Z

thanks @gfyoung

my comments are for future code generation.

gfyoung · 2017-01-02T19:53:09Z

@jreback : Yep, got it. Thanks!

gfyoung force-pushed the read-csv-uint64 branch from e4c204f to e4e3c10 Compare December 31, 2016 10:56

gfyoung force-pushed the read-csv-uint64 branch from e4e3c10 to 0e11b64 Compare December 31, 2016 11:29

jreback added Bug Dtype Conversions Unexpected or buggy dtype conversions IO CSV read_csv, to_csv labels Dec 31, 2016

BUG: Parse uint64 in read_csv

d018e38

Adds behavior to allow for parsing of uint64 data in read_csv. Also ensures that they are properly handled along with NaN and negative values. Closes pandas-devgh-14983.

gfyoung force-pushed the read-csv-uint64 branch from 0e11b64 to d018e38 Compare December 31, 2016 21:11

jorisvandenbossche approved these changes Jan 2, 2017

View reviewed changes

jreback reviewed Jan 2, 2017

View reviewed changes

jreback added this to the 0.20.0 milestone Jan 2, 2017

jreback merged commit 74e20a0 into pandas-dev:master Jan 2, 2017

gfyoung deleted the read-csv-uint64 branch January 2, 2017 19:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Parse uint64 in read_csv #15020

BUG: Parse uint64 in read_csv #15020

gfyoung commented Dec 31, 2016

codecov-io commented Dec 31, 2016 •

edited

Loading

jreback commented Dec 31, 2016

gfyoung commented Dec 31, 2016

jorisvandenbossche Jan 2, 2017

gfyoung Jan 2, 2017

jorisvandenbossche Jan 2, 2017

gfyoung Jan 2, 2017

jorisvandenbossche Jan 2, 2017

gfyoung Jan 2, 2017

gfyoung commented Jan 2, 2017

jreback commented Jan 2, 2017

jreback Jan 2, 2017

gfyoung Jan 2, 2017

jreback Jan 2, 2017

jreback Jan 2, 2017

gfyoung Jan 2, 2017

jreback commented Jan 2, 2017

gfyoung commented Jan 2, 2017

BUG: Parse uint64 in read_csv #15020

BUG: Parse uint64 in read_csv #15020

Conversation

gfyoung commented Dec 31, 2016

codecov-io commented Dec 31, 2016 • edited Loading

Current coverage is 84.77% (diff: 100%)

jreback commented Dec 31, 2016

gfyoung commented Dec 31, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung commented Jan 2, 2017

jreback commented Jan 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jan 2, 2017

gfyoung commented Jan 2, 2017

codecov-io commented Dec 31, 2016 •

edited

Loading