ENH: read_excel respect Excel text type for numbers #20828

shabie · 2018-04-26T08:28:10Z

Problem description

I am trying to read an excel file that has a column (called "raster") of numbers with a leading apostrophe (so that they may be interpreted as text by Excel) since this is one common way to maintain leading zeros for numbers. The numbers need to be always 6 digits long. Additionally some of the values in this column are missing.

The file I am using for this example can be found here.

Code Sample, a copy-pastable example if possible

df = pd.read_excel("test.xlsx", 
                   names=["raster", "benennung"],
                   sheet_name="Tabelle1",
                  )
print(df)
print(df.dtypes)

This returns:

 raster benennung
0  20099.0      Test
1  20099.0    Test 2
2      NaN    Test 3

raster       float64
benennung     object
dtype: object

When I read it without any explicit datatype declaration, the column is read with object type float64 as can be seen above and as a result leading zeros disappear. Next, when I use the fillna function to replace the NaN values and use a string, the column becomes object datatype to take this into account (as far as I understood).

df.raster = df.raster.fillna("999999")
print(df.raster)

This returns:

0     20099
1     20099
2    999999
Name: raster, dtype: object

Assuming that the column is now of type object (i.e. string), I go on to do the padding to make them back to 6 digits:

print(df.raster.str.pad(6, side="left", fillchar="0"))

This returns:

0       NaN
1       NaN
2    999999
Name: raster, dtype: object

This is the unexpected result for me.

I have intentionally not made the changes permanent (hence the print in the same line as pad).

This makes me realize that the numbers had really not been converted to strings when I replaced the NaNs with "999999" since when I try this:

print(df.raster.astype(str))

This returns another representation of the column when explicitly converted to string (and I have tested this works reliably as string later on too i.e. with padding etc.) :

0    20099.0
1    20099.0
2     999999
Name: raster, dtype: object

Bottomline: I know I could have avoided this trouble by explicitly defining datatypes at the start but since I forgot to do that and then ran into this strange behavior, I thought it is worth mentioning here. Whatever makes pandas better makes me happy since I personally like working with pandas a lottt.

TomAugspurger · 2018-04-26T11:19:03Z

No comment on why the initial read_excel read in floats instead of strings; I don't know if that's behaving correctly or not (maybe @chris-b1 knows).

As for the rest, it's three problems, any of which would fix things

Lack of integer NA, which forces the float in the first place
Lack of a dedicated string type, so df.fillna('99999') fills the missing values. But we use object dtype, which can contain a mix of strings and floats
(Maybe) some strangeness in the printing.

In [15]: df.fillna("999999")
Out[15]:
        A
0   20099
1  999999

In [16]: df.values
Out[16]:
array([[ 20099.],
       [    nan]])

We have issues for the first two. Not sure about the third. I'm not sure if Out[15] should print that like an integer or not.

chris-b1 · 2018-04-26T13:47:56Z

Same issue as #11331, but that got closed so we can use this one.

In parsing CSV unless told otherwise we always try to parse as numeric - that behavior carried over to Excel parser. I agree that by default it would make sense to respect the Excel metadata and interpret, e.g., '020099 as text.

shabie · 2018-04-26T13:55:27Z

@chris-b1 I think that'd be a very helpful and reliably sensible addition when reading excel files.

I'd love to contribute (if I can manage to do that)...

rwspielman · 2021-04-06T18:06:05Z

Is this being worked on or being tracked somewhere else? Still seeing this on 1.2.3. This is affecting an application that I have running in production and will contribute if I can.

specific pain points stemming from this issue:

integers longer than 15 digits that are stored as text in excel get converted to integer in pandas. If I immediately convert back to excel, I lose this number because excel doesn't accept integers with > 15 digits
zip codes with leading zero stored as text in excel get converted to integer and lose the leading zeroes.

jreback · 2021-04-06T18:08:27Z

@rwspielman pandas is completely a volunteer project - you are welcome to contribute

iamyojimbo · 2021-04-09T06:55:16Z

Is this an issue with Pandas? Or is this an issue with openpyxl?
To test, tried to convert the .xlsx file to a CSV using the example here: https://stackoverflow.com/a/64099529/1772238
But it also trimmed off my leading zeros.

rwspielman · 2021-04-09T15:15:59Z

@iamyojimbo Its a bug in pandas. I confirmed openpyxl will correctly bring in 'numbers stored as text' as strings. When it converts to DataFrame, it runs through maybe_convert_numeric in _libs.lib.pyx. It stays as a string datatype up until this point. By running through that function, it ignores the datatype that openpyxl brings it in as.

pandas/pandas/_libs/lib.pyx

Line 2005 in 822db7a

def maybe_convert_numeric(

I confirmed this by checking the datatype in io.excel._openpyxl in the method _convert_cell of the class OpenpyxlReader

pandas/pandas/io/excel/_openpyxl.py

Line 527 in 822db7a

def _convert_cell(self, cell, convert_float: bool) -> Scalar:

(Also I checked openpyxl directly using your s/o link)

That function maybe_convert_numeric is being run deeply embedded in pandas and I'm not sure what would be the right technique to resolve the issue as it relates to read_excel.

Solutions that I see:

Change the maybe_convert_numeric function to pass the test that I wrote here
Add a flag that will pass through to where maybe_convert_numeric is called and do not call it when reading from excel so that it will respect excel datatypes read in through openpyxl. This is probably the right solution to avoid changing too much. (Though I do think that maybe_convert_numeric losing some data with the leading zeros is a bug)

chris-b1 changed the title ~~Pandas object datatype strange behavior~~ ENH: read_excel respect Excel text type for numbers Apr 26, 2018

chris-b1 added the IO Excel read_excel, to_excel label Apr 26, 2018

mroeschke added the Bug label May 7, 2020

Polandia94 mentioned this issue Nov 23, 2022

ENH: read_excel: Add a callable to access the Cell before getting the value #49872

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: read_excel respect Excel text type for numbers #20828

ENH: read_excel respect Excel text type for numbers #20828

shabie commented Apr 26, 2018 •

edited

Loading

TomAugspurger commented Apr 26, 2018

chris-b1 commented Apr 26, 2018

shabie commented Apr 26, 2018

rwspielman commented Apr 6, 2021

jreback commented Apr 6, 2021

iamyojimbo commented Apr 9, 2021

rwspielman commented Apr 9, 2021 •

edited

Loading

ENH: read_excel respect Excel text type for numbers #20828

ENH: read_excel respect Excel text type for numbers #20828

Comments

shabie commented Apr 26, 2018 • edited Loading

Problem description

Code Sample, a copy-pastable example if possible

TomAugspurger commented Apr 26, 2018

chris-b1 commented Apr 26, 2018

shabie commented Apr 26, 2018

rwspielman commented Apr 6, 2021

jreback commented Apr 6, 2021

iamyojimbo commented Apr 9, 2021

rwspielman commented Apr 9, 2021 • edited Loading

shabie commented Apr 26, 2018 •

edited

Loading

rwspielman commented Apr 9, 2021 •

edited

Loading