-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: read_excel does not convert integral floats to ints when backed by openpyxl #46988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I can also reproduce this bug with your input file. The .xlsx loads as floats but if I save it as .xls or .ods it loads as ints. Even more strange, pandas/tests/io/data/excel/test_types.* includes a column of ints, and it loads correctly and consistently for all file extensions / backends. Something funny going on here. |
@ahawryluk I opened up and compared my file with <row r="1">
<c r="A1" s="1">
<v>1.0</v>
</c>
</row>
<row r="2">
<c r="A2" s="1">
<v>2.0</v>
</c>
</row>
<row r="3">
<c r="A3" s="1">
<v>3.0</v>
</c>
</row> ...whereas <c r="A1" t="s">
<v>0</v>
</c>
<c r="B1" t="s">
<v>1</v>
</c>
<c r="C1" t="s">
<v>2</v>
</c>
<c r="D1" t="s">
<v>3</v>
</c> So, for the sake of full disclosure, my sample file was exported out of Google Sheets. I don't think that's meaningful, since the files that are failing our unit tests after upgrading pandas are almost certainly not exported from Sheets, but it is a distinction. ...Actually, the more I think about it, I'm not sure I can rule out that the files that are tripping up our tests weren't exported out of Sheets. It's quite possible an XSLX file was dropped into Drive and redownloaded before making it into our codebase. 🤷 |
@mttr Thanks for figuring that out. It makes a lot more sense now. The Microsoft standards MS-OE376 and MS-OI29500 each merely state that "if the cell contains a number, the value shall be a textual representation of a double-precision floating point number." So if Google Sheets writes a "1.0" where Excel writes a "1", I think they both meet the Microsoft XLSX standard. The openpyxl _cast_number function decides if a value should be returned as int or float: def _cast_number(value):
"Convert numbers as string to an int or float"
if "." in value or "E" in value or "e" in value:
return float(value)
return int(value) And as you correctly point out, pandas had a very similar, but not indentical, conversion step proir to #39782: val = int(cell.value)
if val == cell.value:
return val which we removed because it seemed redundant. Although we could hide the issue by reintstating the xlsx integer check in read_excel, I think this is a bug in openpyxl because
Of course, I've been wrong about many things before, so I'll wait and see what you and others think before I assume I'm right about this one. |
Here's my take on it: as far as I can tell, openpyxl does not seem to specify how integral floats should be represented. When I look at that On the other hand, pandas, in this context, takes a stance that integral floats should be treated like integers (which I happen to like, though I fully admit I'm biased here 🙂 ), and is currently behaving inconsistently across different engines. So I lean slightly in the direction of this being a pandas bug, but I can be convinced otherwise. |
@rhshadrach I'm curious what you think about this bug. My practical side says we should reintroduce an integer check in pandas for xlsx files, especially since it was our PR #39782 that revealed the behaviour described above. My purist side says we should first ask the good folks at openpyxl if this behavior suprises them. |
This was an unintended change in behavior in a minor verison; I agree it is a regression and I believe the integer check should be reintroduced for 1.x. However, for 2.0, I agree with the deprecation of convert_float. To me, the having the convert_float argument is causing read_excel to perform two separate operations: (a) read data and (b) infer dtypes. Instead, I think the pandas API should allow users accomplish each of these separately - this allows functions to not only be smaller and more maintainable, but also allows reuse. I wonder if |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
sample file: Numbers.xlsx
Issue Description
Per the documentation (see: convert_float), by default
read_excel
should convert any integral float into an integer. Here is the output from the above code using a simple xslx file with a single column of integers:As you can see, the values from that column are represented by floats instead of ints.
This appears to be a regression (this worked as expected in 1.2.4) introduced by this PR, though it's worth noting that this was made under the belief that openpyxl did this conversion already (perhaps it did at the time- I haven't looked into it yet).
Other excel engines look to be unaffected.
Expected Behavior
Installed Versions
The text was updated successfully, but these errors were encountered: