Skip to content

BUG: read_excel does not convert integral floats to ints when backed by openpyxl #46988

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
mttr opened this issue May 10, 2022 · 6 comments · Fixed by #47121
Closed
3 tasks done

BUG: read_excel does not convert integral floats to ints when backed by openpyxl #46988

mttr opened this issue May 10, 2022 · 6 comments · Fixed by #47121
Labels
Bug IO Excel read_excel, to_excel Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@mttr
Copy link

mttr commented May 10, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
print(pd.read_excel("./Numbers.xlsx"))

sample file: Numbers.xlsx

Issue Description

Per the documentation (see: convert_float), by default read_excel should convert any integral float into an integer. Here is the output from the above code using a simple xslx file with a single column of integers:

    1.0
0   2.0
1   3.0
2   4.0
3   5.0
4   6.0
5   7.0
6   8.0
7   9.0
8  10.0
9  11.0

As you can see, the values from that column are represented by floats instead of ints.

This appears to be a regression (this worked as expected in 1.2.4) introduced by this PR, though it's worth noting that this was made under the belief that openpyxl did this conversion already (perhaps it did at the time- I haven't looked into it yet).

Other excel engines look to be unaffected.

Expected Behavior

    1
0   2
1   3
2   4
3   5
4   6
5   7
6   8
7   9
8  10
9  11

Installed Versions

INSTALLED VERSIONS
------------------
commit           : 4bfe3d07b4858144c219b9346329027024102ab6
python           : 3.9.1.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 21.4.0
Version          : Darwin Kernel Version 21.4.0: Fri Mar 18 00:45:05 PDT 2022; root:xnu-8020.101.4~15/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.4.2
numpy            : 1.22.3
pytz             : 2021.1
dateutil         : 2.8.1
pip              : 22.0.4
setuptools       : 49.2.1
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 3.0.1
IPython          : 7.24.1
pandas_datareader: None
bs4              : None
bottleneck       : None
brotli           : None
fastparquet      : None
fsspec           : None
gcsfs            : None
markupsafe       : 2.0.1
matplotlib       : None
numba            : None
numexpr          : None
odfpy            : None
openpyxl         : 3.0.9
pandas_gbq       : None
pyarrow          : None
pyreadstat       : None
pyxlsb           : 1.0.9
s3fs             : None
scipy            : None
snappy           : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : 2.0.1
xlwt             : None
zstandard        : None
@mttr mttr added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 10, 2022
@simonjayhawkins simonjayhawkins added Regression Functionality that used to work in a prior pandas version IO Excel read_excel, to_excel labels May 12, 2022
@ahawryluk
Copy link
Contributor

I can also reproduce this bug with your input file. The .xlsx loads as floats but if I save it as .xls or .ods it loads as ints. Even more strange, pandas/tests/io/data/excel/test_types.* includes a column of ints, and it loads correctly and consistently for all file extensions / backends. Something funny going on here.

@mttr
Copy link
Author

mttr commented May 13, 2022

@ahawryluk I opened up and compared my file with test_types.xmlx in my editor, and it appears in mine the values are stored as floats:

    <row r="1">
      <c r="A1" s="1">
        <v>1.0</v>
      </c>
    </row>
    <row r="2">
      <c r="A2" s="1">
        <v>2.0</v>
      </c>
    </row>
    <row r="3">
      <c r="A3" s="1">
        <v>3.0</v>
      </c>
    </row>

...whereas test_types.xslx has columns stored as ints:

      <c r="A1" t="s">
        <v>0</v>
      </c>
      <c r="B1" t="s">
        <v>1</v>
      </c>
      <c r="C1" t="s">
        <v>2</v>
      </c>
      <c r="D1" t="s">
        <v>3</v>
      </c>

So, for the sake of full disclosure, my sample file was exported out of Google Sheets. I don't think that's meaningful, since the files that are failing our unit tests after upgrading pandas are almost certainly not exported from Sheets, but it is a distinction.

...Actually, the more I think about it, I'm not sure I can rule out that the files that are tripping up our tests weren't exported out of Sheets. It's quite possible an XSLX file was dropped into Drive and redownloaded before making it into our codebase. 🤷

@ahawryluk
Copy link
Contributor

@mttr Thanks for figuring that out. It makes a lot more sense now.

The Microsoft standards MS-OE376 and MS-OI29500 each merely state that "if the cell contains a number, the value shall be a textual representation of a double-precision floating point number." So if Google Sheets writes a "1.0" where Excel writes a "1", I think they both meet the Microsoft XLSX standard.

The openpyxl _cast_number function decides if a value should be returned as int or float:

def _cast_number(value):
    "Convert numbers as string to an int or float"
    if "." in value or "E" in value or "e" in value:
        return float(value)
    return int(value)

And as you correctly point out, pandas had a very similar, but not indentical, conversion step proir to #39782:

val = int(cell.value)
if val == cell.value:
    return val

which we removed because it seemed redundant.

Although we could hide the issue by reintstating the xlsx integer check in read_excel, I think this is a bug in openpyxl because

  • openpyxl is already has the integer conversion in its scope
  • both 1 and 1.0 are valid "textual representation[s] of a double-precision floating point number" and thus
  • the decision to convert to integer should be based on the value, not the spelling

Of course, I've been wrong about many things before, so I'll wait and see what you and others think before I assume I'm right about this one.

@simonjayhawkins simonjayhawkins removed the Needs Triage Issue that has not been reviewed by a pandas team member label May 16, 2022
@mttr
Copy link
Author

mttr commented May 16, 2022

Here's my take on it: as far as I can tell, openpyxl does not seem to specify how integral floats should be represented. When I look at that _cast_number function, it appears to be a perfectly reasonable implementation if my goal is to pull a number from a cell value. (If they do have an explicit intention specified somewhere, I haven't been able to find it).

On the other hand, pandas, in this context, takes a stance that integral floats should be treated like integers (which I happen to like, though I fully admit I'm biased here 🙂 ), and is currently behaving inconsistently across different engines. So I lean slightly in the direction of this being a pandas bug, but I can be convinced otherwise.

@ahawryluk
Copy link
Contributor

@rhshadrach I'm curious what you think about this bug. My practical side says we should reintroduce an integer check in pandas for xlsx files, especially since it was our PR #39782 that revealed the behaviour described above. My purist side says we should first ask the good folks at openpyxl if this behavior suprises them.

@rhshadrach
Copy link
Member

rhshadrach commented May 19, 2022

This was an unintended change in behavior in a minor verison; I agree it is a regression and I believe the integer check should be reintroduced for 1.x. However, for 2.0, I agree with the deprecation of convert_float.

To me, the having the convert_float argument is causing read_excel to perform two separate operations: (a) read data and (b) infer dtypes. Instead, I think the pandas API should allow users accomplish each of these separately - this allows functions to not only be smaller and more maintainable, but also allows reuse.

I wonder if pd.read_excel(...).convert_dtypes() suffices here; it will downcast floats to ints if it doesn't result a change in value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Excel read_excel, to_excel Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants