-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
when a column contains alpha numeric ending with 'e', pandas converts these to float64 #8272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, I'm interested in working on this one. Would it be possible to have it assigned to me? Thanks. |
@staple your comment is enough to for it! |
Hi, here’s a rough description of what I think is going on: It looks like the excel parser attempts to convert strings to floats - if all strings in a column can be fully parsed as floats, that column becomes numeric. Determination of whether a string can be converted to a float is done using ‘floatify’ in parse_helper.h. This is in turn implemented using an implementation of strtod (renamed xstrtod) in the same file. From a cursory search it looks like this implementation of strtod comes from Sanos (http://www.jbox.dk/sanos/). The issue trigging this failure is that this strtod implementation accepts a complete numeric string with a trailing ‘e’ as the representation of a floating point number. That is, ‘12e’ is parsed as the number 12.0, with no trailing characters. (Because there are no trailing characters, floatify converts the string to a float.) I compared this with the behavior of strtod in my vanilla ubuntu 12 system, and there ’12e’ is parsed as the number 12.0, but with a trailing character ‘e’. Looking at the c-99 spec it appears that in this case treating ‘e’ as a trailing character and not part of the numeric representation is the conforming behavior. Would it make sense to consider changing the strtod implementation to treat the ‘e’ as a trailing character in this case rather than part of the numerical representation? I haven’t yet investigated how far reaching the consequences of doing this would be, but when I tested the change with a simple patch all of the tests passed. |
@jreback Digging a little more, it looks like there are two separate copies of xstrtod in the code (in parse_helper.h and tokenizer.c). They seem to have been copied from the same original implementation in Sanos, but in at least one case the may have been patched differently in the pandas code over time. Would it make sense to look into unifying these two xstrtods? I don't have a strong opinion on whether the functionality requested in this issue warrants a change to xstrtod. But, overall does it make sense to change xstrtod to be "conforming" in this dangling 'e' case? (If so, maybe a separate low priority ticket could be filed for this.) From a quick look, currently the version of xstrtod in parse_helper.h is used by parser internals and ObjectBlock.convert. And the version in tokenizer.c is used for a parser.converter. |
@jreback, thanks not sure which changes you are referring to, but let me know if there are any further todos wrt organizing xstrtod. |
Those changes basically revolved around adding an improved |
closing, but pls reopen if the issue persists. |
I'm reading some excel spreadsheets (xlsx format) into pandas using read_excel, which generally works great. The problem I have is that when a column contains alpha numeric ending with 'e', pandas converts these to float64 type without 'e' in its value, and I would like them to be treated as strings. After reading them in, I can convert the column to str:

dataset = read_excel(excel_file,sheet_name,na_values=[" "])
while reading this excel file to dataset it shows loan id as a number without 'e'
The text was updated successfully, but these errors were encountered: