Skip to content

BUG: read_excel with openpyxl and missing dimension #39486

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Feb 5, 2021

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach commented Jan 30, 2021

Targeted for 1.2.2 because this resolves an issue that occurs from the change of default engine from xlrd to openpyxl in 1.2.

@rhshadrach rhshadrach added Bug IO Excel read_excel, to_excel labels Jan 30, 2021
@rhshadrach rhshadrach added this to the 1.2.2 milestone Jan 30, 2021
@rhshadrach rhshadrach marked this pull request as draft January 30, 2021 19:04
@rhshadrach
Copy link
Member Author

rhshadrach commented Jan 30, 2021

Didn't read the discussion from #39001; I'm thinking now this is probably the wrong thing to do.

This approach can be used along with calling reset_dimension to resolve #39001 without the performance hit, see #39001 (comment).

@rhshadrach rhshadrach marked this pull request as ready for review January 30, 2021 21:06
@asishm
Copy link
Contributor

asishm commented Jan 30, 2021

This can probably be a follow up issue, but running this patch on the sample.xlsx file of #38956 results in different shapes of dataframes with engine=xlrd vs engine=openpyxl. This is NOT caused by this patch however.

openpyxl generates 39 rows vs 9 rows for xlrd. The first 9 rows seem equivalent for both. The latter rows with openpyxl are all NaNs.

@rhshadrach
Copy link
Member Author

rhshadrach commented Jan 31, 2021

Agreed @asishm - I think that is the cause for #39181 as well, and my guess is that adding a call to reset_dimension will fix as well. However, I need to test this. If it is, I think it makes sense to add them all here as they are all the same core issue (openpyxl being given no/incorrect dimension information).

@rhshadrach rhshadrach marked this pull request as draft January 31, 2021 14:30
@rhshadrach rhshadrach marked this pull request as ready for review January 31, 2021 15:16
@rhshadrach
Copy link
Member Author

rhshadrach commented Jan 31, 2021

My previous comment was not correct - if there are empty cells then we still get null values coming through and need to trim, e.g. the following would result in a DataFrame with rows 7 through 17 np.nan. I've modified the added dimension_large.xlsx to include this phenomenon as well.

<row r="6" customFormat="false" ht="13.8" hidden="false" customHeight="false" outlineLevel="0" collapsed="false">
  <c r="A6" s="3" t="n">
    <v>3</v>
  </c>
  <c r="B6" s="4" t="n">
    <v>6</v>
  </c>
  <c r="C6" s="4" t="n">
    <v>9</v>
  </c>
</row>
<row r="16">
  <c r="B16" s="7"/>
  <c r="C16" s="7"/>
</row>
<row r="17">
  <c r="B17" s="7"/>
  <c r="C17" s="7"/>
</row>

@rhshadrach
Copy link
Member Author

I've reverted the fixes for #39181 as they are of a different nature than just dimension information and opened #39547

@rhshadrach
Copy link
Member Author

rhshadrach commented Feb 2, 2021

Current minimum version for openpyxl is 2.6.0, this patch would need 2.6.1 (reset_dimensions doesn't exist before this, accesses protected members). Is increasing the minimum version here okay to do for 1.2.2? cc @jreback @simonjayhawkins

@jreback jreback added this to the 1.2.2 milestone Feb 2, 2021
@jreback
Copy link
Contributor

jreback commented Feb 2, 2021

Current minimum version for openpyxl is 2.6.0, this patch would need 2.6.1 (reset_dimensions doesn't exist before this, accesses protected members). Is increasing the minimum version here okay to do for 1.2.2? cc @jreback @simonjayhawkins

yep that's fine (just update all the locations, install, show_versions, and put it in the whatsnew note)

@jreback
Copy link
Contributor

jreback commented Feb 3, 2021

can you rebase

@rhshadrach
Copy link
Member Author

rhshadrach commented Feb 3, 2021

@jreback I've merged master and increased the minimum version for openpyxl.

@jreback
Copy link
Contributor

jreback commented Feb 4, 2021

pandas/io/excel/_openpyxl.py:543: error: Unsupported operand types for + ("List[Union[Union[str, int, float, bool], Union[Any, Any, Any, Any]]]" and "List[str]") [operator]

:-<

@rhshadrach
Copy link
Member Author

Thanks @jreback, turns out that this needs openpyxl 3.0.0. Prior versions still suffer from not reading the full excel file even after calling reset_dimension.

@jreback
Copy link
Contributor

jreback commented Feb 4, 2021

can we easily fallback to behavior if older version of openpyxl is installed? (just because this is a minor release)

@rhshadrach
Copy link
Member Author

Yes - good idea and I think it should be easy enough. Because I don't understand what happened between openpyxl 2.6.4 and 3.0.0 (there are no release notes for the 3.0.0 that I can find), my thought is to not modify behavior if openpyxl < 3.0.0. If that is the case, then what should be done with the minimum version here? Keep at 2.6.0 until pandas 1.3 (or 2.0)?

@jreback
Copy link
Contributor

jreback commented Feb 4, 2021

Yes - good idea and I think it should be easy enough. Because I don't understand what happened between openpyxl 2.6.4 and 3.0.0 (there are no release notes for the 3.0.0 that I can find), my thought is to not modify behavior if openpyxl < 3.0.0. If that is the case, then what should be done with the minimum version here? Keep at 2.6.0 until pandas 1.3 (or 2.0)?

yes exactly so we don't actually bump the min in 1.2.2 (or just bump to 2.6.1) and just fallback (even if slow / whatever), and make the bump for 1.3

@rhshadrach
Copy link
Member Author

@jreback - changes made. One thing I noticed is that the minimum version in _optional.py was incorrect - I've updated it to 2.6.0 - uncertain if this needs a line in the whatsnew.

@jreback jreback merged commit e1a9b78 into pandas-dev:master Feb 5, 2021
@jreback
Copy link
Contributor

jreback commented Feb 5, 2021

thanks @rhshadrach

happy to have the min version bump for 1.3

@jreback
Copy link
Contributor

jreback commented Feb 5, 2021

@meeseeksdev backport 1.2.x

@lumberbot-app

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Excel read_excel, to_excel
Projects
None yet
4 participants