Skip to content

BUG: read_csv fails with uint64 #14983

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gfyoung opened this issue Dec 25, 2016 · 2 comments
Closed

BUG: read_csv fails with uint64 #14983

gfyoung opened this issue Dec 25, 2016 · 2 comments
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions IO CSV read_csv, to_csv
Milestone

Comments

@gfyoung
Copy link
Member

gfyoung commented Dec 25, 2016

master at aba7d2:

>>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>> data = 'a\n' + str(2**63)
>>>
>>> read_csv(StringIO(data), engine='c').info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
a    1 non-null object
dtypes: object(1)
memory usage: 88.0+ bytes
>>>
>>> read_csv(StringIO(data), engine='python').info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
a    1 non-null object
dtypes: object(1)
memory usage: 88.0+ bytes

We should be able to handle uint64, and tests like this one here should not be enforcing buggy behavior.

The buggy behavior for the C engine traces to here, where we attempt to cast according to this order defined here. Note for starters that uint64 is not in that list. This try-except is due to OverflowError with int64, after which we immediately convert to an object array of strings. At first, I thought inserting uint64 to the list would be good, but that can cause bad casting in the other direction, i.e. negative numbers get converted to their uint64 equivalents.

The buggy behavior for the Python engine traces to here, where we attempt to infer the dtype here. However, as I pointed out in #14982, this function fails with uint64 with a similar (and non-sensical) try-except for OverflowError in int64.

The questions that I posed in #14982 are also relevant here, since they should be consistent across both engines that also is performant. Patching the Python engine probably requires fixing #14982 first, and patching the C engine probably requires adding new functions to parser.pyx to parse uint64 and tokenizer.c. However, in light of the questions that I posed in #14982, I'm not really sure what is best.

@jreback
Copy link
Contributor

jreback commented Dec 26, 2016

I think response in #14982 answers this. Key idea is to make sure this is performant though.

@jreback jreback added Bug Difficulty Intermediate Dtype Conversions Unexpected or buggy dtype conversions IO CSV read_csv, to_csv labels Dec 26, 2016
@jreback jreback added this to the Next Major Release milestone Dec 26, 2016
@jreback
Copy link
Contributor

jreback commented Dec 26, 2016

also in this issue can make sure that passing dtype='uint64' works properly (e.g. explict user casting)

gfyoung added a commit to forking-repos/pandas that referenced this issue Dec 28, 2016
Add handling for uint64 elements in an array
with the follow behavior specifications:

1) If uint64 and NaN are both detected, the
original input will be returned if coerce_numeric
is False. Otherwise, an Exception is raised.

2) If uint64 and negative numbers are both
detected, the original input be returned if
coerce_numeric is False. Otherwise, an
Exception is raised.

Closes pandas-devgh-14982.
Partial fix for pandas-devgh-14983.
gfyoung added a commit to forking-repos/pandas that referenced this issue Dec 28, 2016
Add handling for uint64 elements in an array
with the follow behavior specifications:

1) If uint64 and NaN are both detected, the
original input will be returned if coerce_numeric
is False. Otherwise, an Exception is raised.

2) If uint64 and negative numbers are both
detected, the original input be returned if
coerce_numeric is False. Otherwise, an
Exception is raised.

Closes pandas-devgh-14982.
Partial fix for pandas-devgh-14983.
gfyoung added a commit to forking-repos/pandas that referenced this issue Dec 29, 2016
Add handling for uint64 elements in an array
with the follow behavior specifications:

1) If uint64 and NaN are both detected, the
original input will be returned if coerce_numeric
is False. Otherwise, an Exception is raised.

2) If uint64 and negative numbers are both
detected, the original input be returned if
coerce_numeric is False. Otherwise, an
Exception is raised.

Closes pandas-devgh-14982.
Partial fix for pandas-devgh-14983.
gfyoung added a commit to forking-repos/pandas that referenced this issue Dec 29, 2016
Add handling for uint64 elements in an array
with the follow behavior specifications:

1) If uint64 and NaN are both detected, the
original input will be returned if coerce_numeric
is False. Otherwise, an Exception is raised.

2) If uint64 and negative numbers are both
detected, the original input be returned if
coerce_numeric is False. Otherwise, an
Exception is raised.

Closes pandas-devgh-14982.
Partial fix for pandas-devgh-14983.
jreback pushed a commit that referenced this issue Dec 30, 2016
Add handling for `uint64` elements in an array with the follow
behavior specifications:    1) If `uint64` and `NaN` are both
detected, the original input will be returned if `coerce_numeric`  is
`False`. Otherwise, an `Exception` is raised.    2) If `uint64` and
negative numbers are both detected, the original input be returned if
`coerce_numeric` is `False`. Otherwise, an `Exception` is raised.
Closes #14982.  Partial fix for #14983.

Author: gfyoung <[email protected]>

Closes #15005 from gfyoung/maybe-convert-numeric-uint64 and squashes the following commits:

c3bd28a [gfyoung] BUG: Convert uint64 in maybe_convert_numeric
gfyoung added a commit to forking-repos/pandas that referenced this issue Dec 31, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Dec 31, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Dec 31, 2016
Adds behavior to allow for parsing of
uint64 data in read_csv. Also ensures
that they are properly handled along
with NaN and negative values.

Closes pandas-devgh-14983.
gfyoung added a commit to forking-repos/pandas that referenced this issue Dec 31, 2016
Adds behavior to allow for parsing of
uint64 data in read_csv. Also ensures
that they are properly handled along
with NaN and negative values.

Closes pandas-devgh-14983.
gfyoung added a commit to forking-repos/pandas that referenced this issue Dec 31, 2016
Adds behavior to allow for parsing of
uint64 data in read_csv. Also ensures
that they are properly handled along
with NaN and negative values.

Closes pandas-devgh-14983.
gfyoung added a commit to forking-repos/pandas that referenced this issue Dec 31, 2016
Adds behavior to allow for parsing of
uint64 data in read_csv. Also ensures
that they are properly handled along
with NaN and negative values.

Closes pandas-devgh-14983.
@jreback jreback modified the milestones: 0.20.0, Next Major Release Jan 2, 2017
jreback pushed a commit that referenced this issue Jan 2, 2017
Adds behavior to allow for parsing of
uint64 data in read_csv. Also ensures
that they are properly handled along
with NaN and negative values.

Closes gh-14983.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

2 participants