-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Fix for BUG: multi-index excel header fails if all numeric #11328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ok, pls add a whatsnew note in bug fixes (reference this PR number) |
ideally I'd like a round-trip test, e.g. take a csv write to excel, read it back and compare |
@@ -557,6 +557,12 @@ def test_read_excel_multiindex(self): | |||
actual = read_excel(mi_file, 'mi_column_name', header=[0,1], index_col=0) | |||
tm.assert_frame_equal(actual, expected) | |||
|
|||
# Issue #11317 | |||
expected.columns = mi.set_levels(['1','2'],level=1).set_names(['c1', 'c2']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
show the construction of expected here (I know it is probably above), but nice to have it actually here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed the style in that particular function. It keeps modifying the stuff above, as multiple sheets are read in and tested for different variations.
The check failed for Python 2.7, so I have committed a change for that to parsers.py and also updated what's new for v0.17.1. |
pls squash. ping when green. |
sorry, meant to have you add a round-trip test as well. |
I've added a round trip test. Discovered an issue where the Excel file isn't properly formatted when you have a MultiIndex column and merge_cells=false. Not sure how to test for that. Submitted code and assuming the tests will kick in |
reader = ExcelFile(path) | ||
df = read_excel(reader, 'test1', header=[0,1], | ||
index_col=[0, 1], | ||
parse_dates=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you fix formatting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. See next commit
Seems fine to me. It's not introduced by this PR, but one inconsistency I just noticed is that |
@chris-b1 can u file an issue about your last |
frame = self.frame | ||
arrays = np.arange(len(frame.index) * 2).reshape(2, -1) | ||
new_index = MultiIndex.from_arrays(arrays, | ||
names=['first', 'second']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are the provies xls* files still used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes they are. I also added some to illustrate the bug.
needs a squash |
@jreback I think I have done everything that is needed. (Note - this is my first time doing any bug fixing with pandas or any open source project for that matter!) There is one issue and I don't know how I can debug it. In the test test_to_excel_multiindex_cols, if merge_cells is False, it fails on Python 2.7 and Python 3.5 according to the Travis CI testing. What is happening is that the multiindex is made from tuples [(40,1),(40,2),(50,1),(50,2)]. When merge_cells is False, the created column names are [ "40.1", "40.2", "50.1", "50.2" ]. On Python 2.7 and 3.5, it is reading those as floats rather than strings. I did set up a Python 2.7 environment for debugging, but could not replicate. I could not set up a Python 3.5 environment for debugging (This is all on Windows). So I have put something in there that makes the tests pass, but someone who knows how to do cross-platform debugging should look at this. Thanks for your patience in letting me figure this out. |
http://pandas.pydata.org/pandas-docs/stable/contributing.html#creating-a-development-environment for creating an environment on windows is quite easy (for any python). As fair as the issue, maybe @chris-b1 has a thought? |
That is along the same lines as #11357. Basically the type inference would need to happen after the splitting, http://pandas.pydata.org/pandas-docs/stable/contributing.html#creating-a-development-environment As fair as the issue, maybe @chris-b1 https://github.com/chris-b1 has a — |
@jreback I'm unsure how to figure this out. I built a Python 2.7 environment on Linux, and can't replicate the bug. But the travis-ci build failed. Here is the part that is confusing. One of the Python 2.7 runs succeeded and the other one failed. See https://travis-ci.org/Dr-Irv/pandas/builds/86225791 . Job 5.2 is Python 2.7 and succeeded. Job 5.3 is Python 2.7 and failed. How can I replicate the failure that corresponds to Job 5.3? When I run the nosetests on test_excel.py on Python 2.7, it passes (with my next-to-last version of test_excel.py). |
Do you have all the dependencies (xlrd, xlsxwriter, openpyxl, xlst) installed locally? Might be skipping the failing test. |
@chris-b1 Yes. I even explicitly ran the test that failed (according to the log) and it passes (no skip message). I turned on the "-v" flag and see that it is run. Also, it doesn't explain why in the Travis-CI tests, one 2.7 run succeeded and the other 2.7 run failed. |
It looks like the passing 2.7 is just the "slow' tests. Try installing the specific version of openpyxl in travis https://github.com/pydata/pandas/blob/master/ci/requirements-2.7_LOCALE.run |
@chris-b1 That was what I needed. I can reproduce so I will now work on it. |
@Dr-Irv yep this is why conda works well for this, I just create a separate environment with the contents of a particular travis run - then easy to repro |
@chris-b1 @jreback So with Chris's hint, I was able to get it to fail on Linux. But on Windows, ci/requirements-2.7_LOCALE.run is not working with conda. I'm trying to hack it to get it to work. Question for either of you - how do you tell from the Travis-CI runs which ci/requirements* file corresponds to which run? |
in LOCALE should work (though setting the LOCALE itself is tricky), see |
@jreback I have figured out the issue, but am unsure what the right solution is. Here is a table that shows the combinations that succeeded and failed, which shows the Python version, whether OpenPyxl version was specified in the requirements file, which version of OpenPyxl was used, and whether the Travis-CI run succeeded or failed
What is happening is that Openpyxl versions < 2.0 convert the string '40.1' to a float before writing it to the Excel file, while Openpyxl versions > 2.0 leave the string '40.1' as '40.1' . The writer based on xlwt writes it out as a string. These strings are created when merge_cells=false, where a '.' is used as a separator to join the values of the hierarchical index. Possible solutions are Let me know which direction to go to. |
@Dr-Irv - it looks like you could add in a workaround for |
@@ -115,4 +115,11 @@ Bug Fixes | |||
- Bugs in ``to_excel`` with duplicate columns (:issue:`11007`, :issue:`10982`, :issue:`10970`) | |||
- Fixed a bug that prevented the construction of an empty series of dtype | |||
``datetime64[ns, tz]`` (:issue:`11245`). | |||
|
|||
- Bug in ``read_excel`` with multi-index containing integers (:issue:`11317`, :issue:`11328`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only list 11328
pls squash when you have make corrections |
I gave this another try, looks good to me! |
@jreback I believe I've done everything you have asked, and the tests passed. |
lgtm just need a squash
then s everything (or use f) then
|
This includes updates to 3 Excel files, plus a test in test_excel.py, plus the fix in parsers.py issue when read_html with previous fix With read_html, the fix didn't work on Python 2.7. Handle the string conversion correctly Add bug fixed to what's new Revert "Add bug fixed to what's new" This reverts commit 05b2344. Revert "issue when read_html with previous fix" This reverts commit d1bc296. Add what's new to describe bug. fix issue with original fix Added text to describe the bug. Fixed issue so that it works correctly in Python 2.7 Add round trip test Added round trip test and fixed error in writing sheets when merge_cells=false and columns have multi index DEPR: deprecate pandas.io.ga, pandas-dev#11308 DEPR: deprecate engine keyword from to_csv pandas-dev#11274 remove warnings from the tests for deprecation of engine in to_csv PERF: Checking monotonic-ness before sorting on an index pandas-dev#11080 BUG: Bug in list-like indexing with a mixed-integer Index, pandas-dev#11320 Add hex color strings test CLN: GH11271 move _get_handle, UTF encoders to io.common TST: tests for list skiprows in read_excel BUG: Fix to_dict() problem when using only datetime pandas-dev#11247 Fix a bug where to_dict() does not return Timestamp when there is only datetime dtype present. Undo change for when columns are multiindex There is still something wrong here in the format of the file when there are multiindex columns, but that's for another day Fix formatting in test_excel and remove spurious test See title BUG: bug in comparisons vs tuples, pandas-dev#11339 bug#10442 : fix, adding note and test BUG pandas-dev#10442(test) : Convert datetimelike index to strings with astype(str) BUG#10422: note added bug#10442 : tests added bug#10442 : note udated BUG pandas-dev#10442(test) : Convert datetimelike index to strings with astype(str) bug#10442: fix, adding note and test bug#10442: fix, adding note and test Adjust test so that merge_cells=False works correctly Adjust the test so that if merge_cells=false, it does a proper formatting of the columns in the single row header, and puts the row header in the first row Fix test for Python 2.7 and 3.5 The test is failing on Python 2.7 and 3.5, which appears to read in the values as floats, and I cannot replicate. So force the tests to pass by just making the column names equal when merge_cells=False Fix for openpyxl < 2, and for issue pandas-dev#11408 If using openpyxl < 2, and value is a string that could be a number, force a string to be written out. If using openpyxl >= 2.2, then fix issue pandas-dev#11408 to do with merging cells Use set_value_explicit instead of set_explicit_value set_value_explicit is in openpyxl 1.6, changed in openpyxl 1.8, but there is code in 1.8 to set set_value_explicit to set_explicit_value for compatibility Add line in whatsnew for issue 11408 ENH: added capability to handle Path/LocalPath objects, pandas-dev#11033 DOC: typo in whatsnew/0.17.1.txt PERF: Release GIL on some datetime ops BUG: Bug in DataFrame.replace with a datetime64[ns, tz] and a non-compat to_replace pandas-dev#11326 CLN: clean up internal impl of fillna/replace, xref pandas-dev#11153 PERF: fast inf checking in to_excel PERF: Series.dropna with non-nan dtypes fixed pathlib tests on windows DEPR: remove some SparsePanel deprecation warnings in testing DEPR: avoid numpy comparison to None warnings API: indexing with a null key will raise a TypeError rather than a ValueError, pandas-dev#11356 WARN: elementwise comparisons with index names, xref pandas-dev#11162 DEPR warning in io/data.py w.r.t. order->sort_values WARN: more elementwise comparisons to object WARN: more uncomparables of numeric array vs object BUG: quick fix for pandas-dev#10989 TST: add test case from Issue pandas-dev#10989 API: add _to_safe_for_reshape to allow safe insert/append with embedded CategoricalIndexes Signed-off-by: Jeff Reback <[email protected]> BLD: conda Revert "BLD: conda" This reverts commit 0c8a8e1. TST: remove invalid symbol warnings TST: move some tests to slow TST: fix some warnings filters TST: import pandas_datareader, use for tests TST: remove some deprecation warnings from imports DEPR: fix VisibleDeprecationWarnings in sparse TST: remove some warnings in test_nanops ENH: Improve the error message in to_gbq when the DataFrame schema does not match pandas-dev#11359 add libgfortran to 1.8.1 build binstar -> anaconda remove link to issue 11328 in whatsnew Fixes to document issue in code, small efficiency fix Try to resolve rebase conflict in whats new
86136fa
to
4f62b99
Compare
merged via d5a04c1 thanks! |
replaces #11317
closes #11408
This includes updates to 3 Excel files, plus a test in test_excel.py,
plus the fix in parsers.py