Skip to content

BUG: Parse uint64 in read_csv #15020

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 2, 2017

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Dec 31, 2016

Adds behavior to allow for parsing of uint64 data in read_csv. Also ensures
that they are properly handled along with NaN and negative values.

Closes #14983.

@codecov-io
Copy link

codecov-io commented Dec 31, 2016

Current coverage is 84.77% (diff: 100%)

Merging #15020 into master will not change coverage

@@             master     #15020   diff @@
==========================================
  Files           145        145          
  Lines         51131      51131          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits          43345      43345          
  Misses         7786       7786          
  Partials          0          0          

Powered by Codecov. Last update 3662413...d018e38

@jreback jreback added Bug Dtype Conversions Unexpected or buggy dtype conversions IO CSV read_csv, to_csv labels Dec 31, 2016
@jreback
Copy link
Contributor

jreback commented Dec 31, 2016

pls run a perf check (maybe need some additonal asvs)

Adds behavior to allow for parsing of
uint64 data in read_csv. Also ensures
that they are properly handled along
with NaN and negative values.

Closes pandas-devgh-14983.
@gfyoung
Copy link
Member Author

gfyoung commented Dec 31, 2016

@jreback : Added one benchmark, and performance does not seem impacted significantly (if at all).

@@ -288,6 +288,7 @@ Bug Fixes
- Bug in ``Index`` power operations with reversed operands (:issue:`14973`)
- Bug in ``TimedeltaIndex`` addition where overflow was being allowed without error (:issue:`14816`)
- Bug in ``DataFrame`` construction in which unsigned 64-bit integer elements were being converted to objects (:issue:`14881`)
- Bug in ``pd.read_csv()`` in which unsigned 64-bit integer elements were being improperly converted to the wrong data types (:issue:`14983`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also call this an enhancement that read_csv now supports reading uint64 values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is a bug because read_csv should be able to handle all data types equally. In addition, both the issue and this PR have already been tagged as a bug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't really matter whether it is a bug fix or not (or how we tag the PR), what I just wanted to say is that we can highlight it more by putting it in the enhancement section, if we think it is worth it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, fair enough. I just realized though: #14937 (see my changes to thewhatsnew) might actually resolve this discussion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, indeed, we can group all uint enhancement/bug fixes together

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so we can leave that for #14937 then. 😄

@gfyoung
Copy link
Member Author

gfyoung commented Jan 2, 2017

@jreback : Any other comments about this PR? Otherwise, seems ready to merge.

@jreback
Copy link
Contributor

jreback commented Jan 2, 2017

@gfyoung haven't looked yet. soon.

@@ -1750,6 +1772,78 @@ cdef inline int _try_double_nogil(parser_t *parser, int col,

return 0

cdef _try_uint64(parser_t *parser, int col, int line_start, int line_end,
bint na_filter, kh_str_t *na_hashset):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prob for another PR, but we should generate these try_* routines with tempita (if possible)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible I imagine but not as nice since the implementations are definitely customized to the dtype (e.g. note how I did not pass in any NA value for uint64, but the int64 implementation does)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, if it seems worth it

@@ -1876,3 +1886,88 @@ int64_t str_to_int64(const char *p_item, int64_t int_min, int64_t int_max,
*error = 0;
return number;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe generate some code here too (for another PR)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, possible but not as nice because the implementations are definitely customized to the dtype (e.g. compare the difference in handling for "negative numbers" for int64 and uint64).

@jreback jreback added this to the 0.20.0 milestone Jan 2, 2017
@jreback jreback merged commit 74e20a0 into pandas-dev:master Jan 2, 2017
@jreback
Copy link
Contributor

jreback commented Jan 2, 2017

thanks @gfyoung

my comments are for future code generation.

@gfyoung
Copy link
Member Author

gfyoung commented Jan 2, 2017

@jreback : Yep, got it. Thanks!

@gfyoung gfyoung deleted the read-csv-uint64 branch January 2, 2017 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants