-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: read_excel respect Excel text type for numbers #20828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
No comment on why the initial As for the rest, it's three problems, any of which would fix things
In [15]: df.fillna("999999")
Out[15]:
A
0 20099
1 999999
In [16]: df.values
Out[16]:
array([[ 20099.],
[ nan]]) We have issues for the first two. Not sure about the third. I'm not sure if |
Same issue as #11331, but that got closed so we can use this one. In parsing CSV unless told otherwise we always try to parse as numeric - that behavior carried over to Excel parser. I agree that by default it would make sense to respect the Excel metadata and interpret, e.g., |
@chris-b1 I think that'd be a very helpful and reliably sensible addition when reading excel files. I'd love to contribute (if I can manage to do that)... |
Is this being worked on or being tracked somewhere else? Still seeing this on 1.2.3. This is affecting an application that I have running in production and will contribute if I can. specific pain points stemming from this issue:
|
@rwspielman pandas is completely a volunteer project - you are welcome to contribute |
Is this an issue with Pandas? Or is this an issue with |
@iamyojimbo Its a bug in pandas. I confirmed openpyxl will correctly bring in 'numbers stored as text' as strings. When it converts to DataFrame, it runs through Line 2005 in 822db7a
I confirmed this by checking the datatype in pandas/pandas/io/excel/_openpyxl.py Line 527 in 822db7a
(Also I checked openpyxl directly using your s/o link) That function Solutions that I see:
|
Problem description
I am trying to read an excel file that has a column (called "raster") of numbers with a leading apostrophe (so that they may be interpreted as text by Excel) since this is one common way to maintain leading zeros for numbers. The numbers need to be always 6 digits long. Additionally some of the values in this column are missing.
The file I am using for this example can be found here.
Code Sample, a copy-pastable example if possible
This returns:
When I read it without any explicit datatype declaration, the column is read with object type float64 as can be seen above and as a result leading zeros disappear. Next, when I use the fillna function to replace the NaN values and use a string, the column becomes object datatype to take this into account (as far as I understood).
This returns:
Assuming that the column is now of type object (i.e. string), I go on to do the padding to make them back to 6 digits:
This returns:
This is the unexpected result for me.
I have intentionally not made the changes permanent (hence the print in the same line as pad).
This makes me realize that the numbers had really not been converted to strings when I replaced the NaNs with "999999" since when I try this:
This returns another representation of the column when explicitly converted to string (and I have tested this works reliably as string later on too i.e. with padding etc.) :
Bottomline: I know I could have avoided this trouble by explicitly defining datatypes at the start but since I forgot to do that and then ran into this strange behavior, I thought it is worth mentioning here. Whatever makes pandas better makes me happy since I personally like working with pandas a lottt.
The text was updated successfully, but these errors were encountered: