BUG: read_excel does not convert integral floats to ints when backed by openpyxl #46988

mttr · 2022-05-10T17:24:14Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
print(pd.read_excel("./Numbers.xlsx"))

sample file: Numbers.xlsx

Issue Description

Per the documentation (see: convert_float), by default read_excel should convert any integral float into an integer. Here is the output from the above code using a simple xslx file with a single column of integers:

As you can see, the values from that column are represented by floats instead of ints.

This appears to be a regression (this worked as expected in 1.2.4) introduced by this PR, though it's worth noting that this was made under the belief that openpyxl did this conversion already (perhaps it did at the time- I haven't looked into it yet).

Other excel engines look to be unaffected.

Expected Behavior

Installed Versions

INSTALLED VERSIONS
------------------
commit           : 4bfe3d07b4858144c219b9346329027024102ab6
python           : 3.9.1.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 21.4.0
Version          : Darwin Kernel Version 21.4.0: Fri Mar 18 00:45:05 PDT 2022; root:xnu-8020.101.4~15/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.4.2
numpy            : 1.22.3
pytz             : 2021.1
dateutil         : 2.8.1
pip              : 22.0.4
setuptools       : 49.2.1
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 3.0.1
IPython          : 7.24.1
pandas_datareader: None
bs4              : None
bottleneck       : None
brotli           : None
fastparquet      : None
fsspec           : None
gcsfs            : None
markupsafe       : 2.0.1
matplotlib       : None
numba            : None
numexpr          : None
odfpy            : None
openpyxl         : 3.0.9
pandas_gbq       : None
pyarrow          : None
pyreadstat       : None
pyxlsb           : 1.0.9
s3fs             : None
scipy            : None
snappy           : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : 2.0.1
xlwt             : None
zstandard        : None

The text was updated successfully, but these errors were encountered:

ahawryluk · 2022-05-13T04:02:01Z

I can also reproduce this bug with your input file. The .xlsx loads as floats but if I save it as .xls or .ods it loads as ints. Even more strange, pandas/tests/io/data/excel/test_types.* includes a column of ints, and it loads correctly and consistently for all file extensions / backends. Something funny going on here.

mttr · 2022-05-13T22:29:16Z

@ahawryluk I opened up and compared my file with test_types.xmlx in my editor, and it appears in mine the values are stored as floats:

    <row r="1">
      <c r="A1" s="1">
        <v>1.0</v>
      </c>
    </row>
    <row r="2">
      <c r="A2" s="1">
        <v>2.0</v>
      </c>
    </row>
    <row r="3">
      <c r="A3" s="1">
        <v>3.0</v>
      </c>
    </row>

...whereas test_types.xslx has columns stored as ints:

      <c r="A1" t="s">
        <v>0</v>
      </c>
      <c r="B1" t="s">
        <v>1</v>
      </c>
      <c r="C1" t="s">
        <v>2</v>
      </c>
      <c r="D1" t="s">
        <v>3</v>
      </c>

So, for the sake of full disclosure, my sample file was exported out of Google Sheets. I don't think that's meaningful, since the files that are failing our unit tests after upgrading pandas are almost certainly not exported from Sheets, but it is a distinction.

...Actually, the more I think about it, I'm not sure I can rule out that the files that are tripping up our tests weren't exported out of Sheets. It's quite possible an XSLX file was dropped into Drive and redownloaded before making it into our codebase. 🤷

ahawryluk · 2022-05-15T23:09:08Z

@mttr Thanks for figuring that out. It makes a lot more sense now.

The Microsoft standards MS-OE376 and MS-OI29500 each merely state that "if the cell contains a number, the value shall be a textual representation of a double-precision floating point number." So if Google Sheets writes a "1.0" where Excel writes a "1", I think they both meet the Microsoft XLSX standard.

The openpyxl _cast_number function decides if a value should be returned as int or float:

def _cast_number(value):
    "Convert numbers as string to an int or float"
    if "." in value or "E" in value or "e" in value:
        return float(value)
    return int(value)

And as you correctly point out, pandas had a very similar, but not indentical, conversion step proir to #39782:

val = int(cell.value)
if val == cell.value:
    return val

which we removed because it seemed redundant.

Although we could hide the issue by reintstating the xlsx integer check in read_excel, I think this is a bug in openpyxl because

openpyxl is already has the integer conversion in its scope
both 1 and 1.0 are valid "textual representation[s] of a double-precision floating point number" and thus
the decision to convert to integer should be based on the value, not the spelling

Of course, I've been wrong about many things before, so I'll wait and see what you and others think before I assume I'm right about this one.

mttr · 2022-05-16T17:46:52Z

Here's my take on it: as far as I can tell, openpyxl does not seem to specify how integral floats should be represented. When I look at that _cast_number function, it appears to be a perfectly reasonable implementation if my goal is to pull a number from a cell value. (If they do have an explicit intention specified somewhere, I haven't been able to find it).

On the other hand, pandas, in this context, takes a stance that integral floats should be treated like integers (which I happen to like, though I fully admit I'm biased here 🙂 ), and is currently behaving inconsistently across different engines. So I lean slightly in the direction of this being a pandas bug, but I can be convinced otherwise.

ahawryluk · 2022-05-17T23:09:37Z

@rhshadrach I'm curious what you think about this bug. My practical side says we should reintroduce an integer check in pandas for xlsx files, especially since it was our PR #39782 that revealed the behaviour described above. My purist side says we should first ask the good folks at openpyxl if this behavior suprises them.

rhshadrach · 2022-05-19T03:17:36Z

This was an unintended change in behavior in a minor verison; I agree it is a regression and I believe the integer check should be reintroduced for 1.x. However, for 2.0, I agree with the deprecation of convert_float.

To me, the having the convert_float argument is causing read_excel to perform two separate operations: (a) read data and (b) infer dtypes. Instead, I think the pandas API should allow users accomplish each of these separately - this allows functions to not only be smaller and more maintainable, but also allows reuse.

I wonder if pd.read_excel(...).convert_dtypes() suffices here; it will downcast floats to ints if it doesn't result a change in value.

mttr added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 10, 2022

simonjayhawkins added Regression Functionality that used to work in a prior pandas version IO Excel read_excel, to_excel labels May 12, 2022

simonjayhawkins removed the Needs Triage Issue that has not been reviewed by a pandas team member label May 16, 2022

simonjayhawkins added this to the 1.4.3 milestone May 19, 2022

ahawryluk mentioned this issue May 25, 2022

BUG: read_excel loading some xlsx ints as floats #47121

Merged

4 tasks

mroeschke closed this as completed in #47121 Jun 6, 2022

ahawryluk mentioned this issue Jun 6, 2022

BUG: pd.read_excel writing numbers with .0 in string column when using an Excel file downloaded from Google Sheets #46810

Closed

3 tasks

davereinhart mentioned this issue Jul 26, 2022

Unpin pandas and upgrade to latest version seattleflu/id3c#312

Merged

ng-henry mentioned this issue Nov 17, 2022

BUG: Fix large floats in Excel losing precision when converted to integer #49635

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_excel does not convert integral floats to ints when backed by openpyxl #46988

BUG: read_excel does not convert integral floats to ints when backed by openpyxl #46988

mttr commented May 10, 2022 •

edited

Loading

ahawryluk commented May 13, 2022

mttr commented May 13, 2022

ahawryluk commented May 15, 2022

mttr commented May 16, 2022

ahawryluk commented May 17, 2022

rhshadrach commented May 19, 2022 •

edited

Loading

BUG: read_excel does not convert integral floats to ints when backed by openpyxl #46988

BUG: read_excel does not convert integral floats to ints when backed by openpyxl #46988

Comments

mttr commented May 10, 2022 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

ahawryluk commented May 13, 2022

mttr commented May 13, 2022

ahawryluk commented May 15, 2022

mttr commented May 16, 2022

ahawryluk commented May 17, 2022

rhshadrach commented May 19, 2022 • edited Loading

mttr commented May 10, 2022 •

edited

Loading

rhshadrach commented May 19, 2022 •

edited

Loading