Skip to content

when a column contains alpha numeric ending with 'e', pandas converts these to float64 #8272

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Siva1018 opened this issue Sep 15, 2014 · 9 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions IO Excel read_excel, to_excel

Comments

@Siva1018
Copy link

I'm reading some excel spreadsheets (xlsx format) into pandas using read_excel, which generally works great. The problem I have is that when a column contains alpha numeric ending with 'e', pandas converts these to float64 type without 'e' in its value, and I would like them to be treated as strings. After reading them in, I can convert the column to str:
screenshot from 2014-09-15 15 26 42

dataset = read_excel(excel_file,sheet_name,na_values=[" "])
while reading this excel file to dataset it shows loan id as a number without 'e'

screenshot from 2014-09-15 15 32 01

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions IO Excel read_excel, to_excel Good as first PR labels Sep 21, 2014
@staple
Copy link
Contributor

staple commented Sep 27, 2014

Hi, I'm interested in working on this one. Would it be possible to have it assigned to me? Thanks.

@jreback
Copy link
Contributor

jreback commented Sep 27, 2014

@staple your comment is enough

to for it!

@staple
Copy link
Contributor

staple commented Sep 28, 2014

Hi, here’s a rough description of what I think is going on:

It looks like the excel parser attempts to convert strings to floats - if all strings in a column can be fully parsed as floats, that column becomes numeric. Determination of whether a string can be converted to a float is done using ‘floatify’ in parse_helper.h. This is in turn implemented using an implementation of strtod (renamed xstrtod) in the same file. From a cursory search it looks like this implementation of strtod comes from Sanos (http://www.jbox.dk/sanos/).

The issue trigging this failure is that this strtod implementation accepts a complete numeric string with a trailing ‘e’ as the representation of a floating point number. That is, ‘12e’ is parsed as the number 12.0, with no trailing characters. (Because there are no trailing characters, floatify converts the string to a float.) I compared this with the behavior of strtod in my vanilla ubuntu 12 system, and there ’12e’ is parsed as the number 12.0, but with a trailing character ‘e’. Looking at the c-99 spec it appears that in this case treating ‘e’ as a trailing character and not part of the numeric representation is the conforming behavior.

Would it make sense to consider changing the strtod implementation to treat the ‘e’ as a trailing character in this case rather than part of the numerical representation? I haven’t yet investigated how far reaching the consequences of doing this would be, but when I tested the change with a simple patch all of the tests passed.

@jreback
Copy link
Contributor

jreback commented Sep 29, 2014

@Siva1018 I think that #8212 would solve this, e.g. make the user specify a dtype for this case.

The soln of fixing strtod I suppose might work too, but seems pretty complicated for small usecase. Feel free to investigate though.

@staple
Copy link
Contributor

staple commented Oct 1, 2014

@jreback Digging a little more, it looks like there are two separate copies of xstrtod in the code (in parse_helper.h and tokenizer.c). They seem to have been copied from the same original implementation in Sanos, but in at least one case the may have been patched differently in the pandas code over time. Would it make sense to look into unifying these two xstrtods?

I don't have a strong opinion on whether the functionality requested in this issue warrants a change to xstrtod. But, overall does it make sense to change xstrtod to be "conforming" in this dangling 'e' case? (If so, maybe a separate low priority ticket could be filed for this.) From a quick look, currently the version of xstrtod in parse_helper.h is used by parser internals and ObjectBlock.convert. And the version in tokenizer.c is used for a parser.converter.

@jreback
Copy link
Contributor

jreback commented Oct 1, 2014

cc @AmrAS1 maybe have a better idea

@staple have been some recent changes in master w.r.t strtod/xstrtod.

@staple
Copy link
Contributor

staple commented Oct 1, 2014

@jreback, thanks not sure which changes you are referring to, but let me know if there are any further todos wrt organizing xstrtod.

@mdmueller
Copy link
Contributor

Those changes basically revolved around adding an improved xstrtod called fast_xstrtod, which shouldn't have anything to do with this issue. If xstrtod's behavior is changed at all, those changes will have to be added to fast_xstrtod as well.

@jreback
Copy link
Contributor

jreback commented Oct 20, 2015

closing, but pls reopen if the issue persists.

@jreback jreback closed this as completed Oct 20, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions IO Excel read_excel, to_excel
Projects
None yet
Development

No branches or pull requests

4 participants