Fix for BUG: multi-index excel header fails if all numeric #11328

Dr-Irv · 2015-10-14T16:15:06Z

replaces #11317
closes #11408

This includes updates to 3 Excel files, plus a test in test_excel.py,
plus the fix in parsers.py

jreback · 2015-10-14T18:52:44Z

ok, pls add a whatsnew note in bug fixes (reference this PR number)

jreback · 2015-10-14T18:53:30Z

ideally I'd like a round-trip test, e.g. take a csv write to excel, read it back and compare

jreback · 2015-10-14T18:54:17Z

pandas/io/tests/test_excel.py

@@ -557,6 +557,12 @@ def test_read_excel_multiindex(self):
        actual = read_excel(mi_file, 'mi_column_name', header=[0,1], index_col=0)
        tm.assert_frame_equal(actual, expected)

+        # Issue #11317
+        expected.columns = mi.set_levels(['1','2'],level=1).set_names(['c1', 'c2'])


show the construction of expected here (I know it is probably above), but nice to have it actually here

I followed the style in that particular function. It keeps modifying the stuff above, as multiple sheets are read in and tested for different variations.

Dr-Irv · 2015-10-14T20:40:11Z

The check failed for Python 2.7, so I have committed a change for that to parsers.py and also updated what's new for v0.17.1.

jreback · 2015-10-15T22:22:28Z

pls squash. ping when green.

jreback · 2015-10-15T22:23:12Z

sorry, meant to have you add a round-trip test as well.

Dr-Irv · 2015-10-16T19:56:08Z

I've added a round trip test. Discovered an issue where the Excel file isn't properly formatted when you have a MultiIndex column and merge_cells=false. Not sure how to test for that. Submitted code and assuming the tests will kick in

jreback · 2015-10-17T15:12:59Z

pandas/io/tests/test_excel.py

+            reader = ExcelFile(path)
+            df = read_excel(reader, 'test1', header=[0,1],
+                             index_col=[0, 1],
+                              parse_dates=False)


can you fix formatting

Done. See next commit

jreback · 2015-10-17T15:14:45Z

@Dr-Irv minor fix squash

@chris-b1 can you have a look

chris-b1 · 2015-10-17T16:00:57Z

Seems fine to me. It's not introduced by this PR, but one inconsistency I just noticed is that read_csv seems to always converts numeric columns to strings (or maybe never converts to numeric in the first place), where read_excel will leave as numeric types.

jreback · 2015-10-17T16:55:09Z

@chris-b1 can u file an issue about your last
and show an example

jreback · 2015-10-18T14:01:06Z

pandas/io/tests/test_excel.py

+        frame = self.frame
+        arrays = np.arange(len(frame.index) * 2).reshape(2, -1)
+        new_index = MultiIndex.from_arrays(arrays,
+                                           names=['first', 'second'])


are the provies xls* files still used?

Yes they are. I also added some to illustrate the bug.

jreback · 2015-10-18T14:01:14Z

needs a squash

Dr-Irv · 2015-10-19T21:29:38Z

@jreback I think I have done everything that is needed. (Note - this is my first time doing any bug fixing with pandas or any open source project for that matter!) There is one issue and I don't know how I can debug it. In the test test_to_excel_multiindex_cols, if merge_cells is False, it fails on Python 2.7 and Python 3.5 according to the Travis CI testing. What is happening is that the multiindex is made from tuples [(40,1),(40,2),(50,1),(50,2)]. When merge_cells is False, the created column names are [ "40.1", "40.2", "50.1", "50.2" ]. On Python 2.7 and 3.5, it is reading those as floats rather than strings. I did set up a Python 2.7 environment for debugging, but could not replicate. I could not set up a Python 3.5 environment for debugging (This is all on Windows). So I have put something in there that makes the tests pass, but someone who knows how to do cross-platform debugging should look at this.

Thanks for your patience in letting me figure this out.

jreback · 2015-10-19T21:34:04Z

http://pandas.pydata.org/pandas-docs/stable/contributing.html#creating-a-development-environment for creating an environment on windows is quite easy (for any python).

As fair as the issue, maybe @chris-b1 has a thought?

chris-b1 · 2015-10-19T21:59:57Z

That is along the same lines as #11357.

Basically the type inference would need to happen after the splitting,
probably worth checking how csv handles it?

http://pandas.pydata.org/pandas-docs/stable/contributing.html#creating-a-development-environment
for creating an environment on windows is quite easy (for any python).

As fair as the issue, maybe @chris-b1 https://github.com/chris-b1 has a
thought?

—
Reply to this email directly or view it on GitHub
#11328 (comment).

Dr-Irv · 2015-10-20T17:53:05Z

@jreback I'm unsure how to figure this out. I built a Python 2.7 environment on Linux, and can't replicate the bug. But the travis-ci build failed. Here is the part that is confusing. One of the Python 2.7 runs succeeded and the other one failed. See https://travis-ci.org/Dr-Irv/pandas/builds/86225791 . Job 5.2 is Python 2.7 and succeeded. Job 5.3 is Python 2.7 and failed. How can I replicate the failure that corresponds to Job 5.3? When I run the nosetests on test_excel.py on Python 2.7, it passes (with my next-to-last version of test_excel.py).

chris-b1 · 2015-10-20T18:37:32Z

Do you have all the dependencies (xlrd, xlsxwriter, openpyxl, xlst) installed locally? Might be skipping the failing test.

Dr-Irv · 2015-10-20T18:54:10Z

@chris-b1 Yes. I even explicitly ran the test that failed (according to the log) and it passes (no skip message). I turned on the "-v" flag and see that it is run. Also, it doesn't explain why in the Travis-CI tests, one 2.7 run succeeded and the other 2.7 run failed.

chris-b1 · 2015-10-20T19:03:07Z

It looks like the passing 2.7 is just the "slow' tests. Try installing the specific version of openpyxl in travis openpyxl=1.6.2

https://github.com/pydata/pandas/blob/master/ci/requirements-2.7_LOCALE.run

Dr-Irv · 2015-10-20T19:25:09Z

@chris-b1 That was what I needed. I can reproduce so I will now work on it.

jreback · 2015-10-20T19:42:02Z

@Dr-Irv yep this is why conda works well for this, I just create a separate environment with the contents of a particular travis run - then easy to repro

Dr-Irv · 2015-10-20T19:59:03Z

@chris-b1 @jreback So with Chris's hint, I was able to get it to fail on Linux. But on Windows, ci/requirements-2.7_LOCALE.run is not working with conda. I'm trying to hack it to get it to work. Question for either of you - how do you tell from the Travis-CI runs which ci/requirements* file corresponds to which run?

jreback · 2015-10-20T20:02:07Z

in .travis.yml, their is a JOB_NAME tag which corresponds to the run

LOCALE should work (though setting the LOCALE itself is tricky), see ci/script.py

Dr-Irv · 2015-10-20T21:08:26Z

@jreback @chris-b1 I finally have it on Windows where it fails under a 2.7 configuration and succeeds under a 3.4 configuration. Something to do with openpyxl writer (I think). At least now I can compare things side by side to see what is going on. Thanks for the pointers.

Dr-Irv · 2015-10-21T13:42:50Z

@jreback I have figured out the issue, but am unsure what the right solution is. Here is a table that shows the combinations that succeeded and failed, which shows the Python version, whether OpenPyxl version was specified in the requirements file, which version of OpenPyxl was used, and whether the Travis-CI run succeeded or failed

Python Version	Openpyxl version specified	Openpyxl version used	Succeeded?
2.6	None	2.2.6	YES
2.7	None	2.2.6	YES
2.7	1.6.2	1.6.2	NO
3.4	None	2.0.3	YES
3.5	None	1.8.3	NO

What is happening is that Openpyxl versions < 2.0 convert the string '40.1' to a float before writing it to the Excel file, while Openpyxl versions > 2.0 leave the string '40.1' as '40.1' . The writer based on xlwt writes it out as a string.

These strings are created when merge_cells=false, where a '.' is used as a separator to join the values of the hierarchical index.

Possible solutions are
a) change the openpyxl version used in the tests, (but that is a call someone else should make)
b) change the separator used when merge_cells=False to '/' or '_'
c) change the test so that the MultiIndex used are the tuples [(40,'A'),(40,'B'),(50,'A'),(50,'B')] rather than [(40,1),(40,2),(50,1),(50,2)]
d) Skip the test if the writer is openpyxl

Let me know which direction to go to.

chris-b1 · 2015-10-22T17:56:29Z

@Dr-Irv - it looks like you could add in a workaround for openpyxl < 2 by using set_explicit_value - basically would call that instead of cell.value = .. when writing the un-merged columns.

Dr-Irv · 2015-10-23T16:43:10Z

@jreback I think this is ready to be merged into the main branch. Also, I included a fix for #11408 so once you do the merge, you can close that issue.

jreback · 2015-10-23T16:45:21Z

doc/source/whatsnew/v0.17.1.txt

@@ -115,4 +115,11 @@ Bug Fixes
 - Bugs in ``to_excel`` with duplicate columns (:issue:`11007`, :issue:`10982`, :issue:`10970`)
 - Fixed a bug that prevented the construction of an empty series of dtype
  ``datetime64[ns, tz]`` (:issue:`11245`).
+
+- Bug in ``read_excel`` with multi-index containing integers (:issue:`11317`, :issue:`11328`)


only list 11328

jreback · 2015-10-23T16:49:17Z

pls squash when you have make corrections

chris-b1 · 2015-10-24T00:09:01Z

I gave this another try, looks good to me!

Dr-Irv · 2015-10-24T01:50:30Z

@jreback I believe I've done everything you have asked, and the tests passed.

jreback · 2015-10-24T01:56:09Z

lgtm just need a squash

git rebase -i master

then s everything (or use f)

then

git push forkname branchname -f

This includes updates to 3 Excel files, plus a test in test_excel.py, plus the fix in parsers.py issue when read_html with previous fix With read_html, the fix didn't work on Python 2.7. Handle the string conversion correctly Add bug fixed to what's new Revert "Add bug fixed to what's new" This reverts commit 05b2344. Revert "issue when read_html with previous fix" This reverts commit d1bc296. Add what's new to describe bug. fix issue with original fix Added text to describe the bug. Fixed issue so that it works correctly in Python 2.7 Add round trip test Added round trip test and fixed error in writing sheets when merge_cells=false and columns have multi index DEPR: deprecate pandas.io.ga, pandas-dev#11308 DEPR: deprecate engine keyword from to_csv pandas-dev#11274 remove warnings from the tests for deprecation of engine in to_csv PERF: Checking monotonic-ness before sorting on an index pandas-dev#11080 BUG: Bug in list-like indexing with a mixed-integer Index, pandas-dev#11320 Add hex color strings test CLN: GH11271 move _get_handle, UTF encoders to io.common TST: tests for list skiprows in read_excel BUG: Fix to_dict() problem when using only datetime pandas-dev#11247 Fix a bug where to_dict() does not return Timestamp when there is only datetime dtype present. Undo change for when columns are multiindex There is still something wrong here in the format of the file when there are multiindex columns, but that's for another day Fix formatting in test_excel and remove spurious test See title BUG: bug in comparisons vs tuples, pandas-dev#11339 bug#10442 : fix, adding note and test BUG pandas-dev#10442(test) : Convert datetimelike index to strings with astype(str) BUG#10422: note added bug#10442 : tests added bug#10442 : note udated BUG pandas-dev#10442(test) : Convert datetimelike index to strings with astype(str) bug#10442: fix, adding note and test bug#10442: fix, adding note and test Adjust test so that merge_cells=False works correctly Adjust the test so that if merge_cells=false, it does a proper formatting of the columns in the single row header, and puts the row header in the first row Fix test for Python 2.7 and 3.5 The test is failing on Python 2.7 and 3.5, which appears to read in the values as floats, and I cannot replicate. So force the tests to pass by just making the column names equal when merge_cells=False Fix for openpyxl < 2, and for issue pandas-dev#11408 If using openpyxl < 2, and value is a string that could be a number, force a string to be written out. If using openpyxl >= 2.2, then fix issue pandas-dev#11408 to do with merging cells Use set_value_explicit instead of set_explicit_value set_value_explicit is in openpyxl 1.6, changed in openpyxl 1.8, but there is code in 1.8 to set set_value_explicit to set_explicit_value for compatibility Add line in whatsnew for issue 11408 ENH: added capability to handle Path/LocalPath objects, pandas-dev#11033 DOC: typo in whatsnew/0.17.1.txt PERF: Release GIL on some datetime ops BUG: Bug in DataFrame.replace with a datetime64[ns, tz] and a non-compat to_replace pandas-dev#11326 CLN: clean up internal impl of fillna/replace, xref pandas-dev#11153 PERF: fast inf checking in to_excel PERF: Series.dropna with non-nan dtypes fixed pathlib tests on windows DEPR: remove some SparsePanel deprecation warnings in testing DEPR: avoid numpy comparison to None warnings API: indexing with a null key will raise a TypeError rather than a ValueError, pandas-dev#11356 WARN: elementwise comparisons with index names, xref pandas-dev#11162 DEPR warning in io/data.py w.r.t. order->sort_values WARN: more elementwise comparisons to object WARN: more uncomparables of numeric array vs object BUG: quick fix for pandas-dev#10989 TST: add test case from Issue pandas-dev#10989 API: add _to_safe_for_reshape to allow safe insert/append with embedded CategoricalIndexes Signed-off-by: Jeff Reback <[email protected]> BLD: conda Revert "BLD: conda" This reverts commit 0c8a8e1. TST: remove invalid symbol warnings TST: move some tests to slow TST: fix some warnings filters TST: import pandas_datareader, use for tests TST: remove some deprecation warnings from imports DEPR: fix VisibleDeprecationWarnings in sparse TST: remove some warnings in test_nanops ENH: Improve the error message in to_gbq when the DataFrame schema does not match pandas-dev#11359 add libgfortran to 1.8.1 build binstar -> anaconda remove link to issue 11328 in whatsnew Fixes to document issue in code, small efficiency fix Try to resolve rebase conflict in whats new

Dr-Irv · 2015-10-24T21:24:35Z

@jreback OK, I think I did the squash correctly. If everything is OK, you should also close issue #11408

jreback · 2015-10-25T14:02:58Z

merged via d5a04c1

thanks!

Dr-Irv mentioned this pull request Oct 14, 2015

BUG: multi-index excel header fails if all numeric #11317

Closed

jreback added Bug IO CSV read_csv, to_csv IO Excel read_excel, to_excel MultiIndex labels Oct 14, 2015

jreback added this to the 0.17.1 milestone Oct 14, 2015

jreback reviewed Oct 14, 2015
View reviewed changes

Dr-Irv changed the title ~~Fix for issue #11317~~ Fix for BUG: multi-index excel header fails if all numeric Oct 15, 2015

jreback reviewed Oct 17, 2015
View reviewed changes

chris-b1 mentioned this pull request Oct 17, 2015

BUG? Consistency of column parsing #11357

Open

jreback reviewed Oct 18, 2015
View reviewed changes

Dr-Irv mentioned this pull request Oct 22, 2015

Pandas exporting to Excel (xls, xlsx) with multilevel columns #11408

Closed

jreback reviewed Oct 23, 2015
View reviewed changes

Dr-Irv force-pushed the Fix-for-#11317 branch from 86136fa to 4f62b99 Compare October 24, 2015 21:21

jreback closed this Oct 25, 2015

Dr-Irv deleted the Fix-for-#11317 branch October 26, 2015 14:00

Fix for BUG: multi-index excel header fails if all numeric #11328

Fix for BUG: multi-index excel header fails if all numeric #11328

Conversation

Dr-Irv commented Oct 14, 2015

jreback commented Oct 14, 2015

jreback commented Oct 14, 2015

jreback Oct 14, 2015

Choose a reason for hiding this comment

Dr-Irv Oct 14, 2015

Choose a reason for hiding this comment

Dr-Irv commented Oct 14, 2015

jreback commented Oct 15, 2015

jreback commented Oct 15, 2015

Dr-Irv commented Oct 16, 2015

jreback Oct 17, 2015

Choose a reason for hiding this comment

Dr-Irv Oct 17, 2015

Choose a reason for hiding this comment

jreback commented Oct 17, 2015

chris-b1 commented Oct 17, 2015

jreback commented Oct 17, 2015

jreback Oct 18, 2015

Choose a reason for hiding this comment

Dr-Irv Oct 19, 2015

Choose a reason for hiding this comment

jreback commented Oct 18, 2015

Dr-Irv commented Oct 19, 2015

jreback commented Oct 19, 2015

chris-b1 commented Oct 19, 2015

Dr-Irv commented Oct 20, 2015

chris-b1 commented Oct 20, 2015

Dr-Irv commented Oct 20, 2015

chris-b1 commented Oct 20, 2015

Dr-Irv commented Oct 20, 2015

jreback commented Oct 20, 2015

Dr-Irv commented Oct 20, 2015

jreback commented Oct 20, 2015

Dr-Irv commented Oct 20, 2015

Dr-Irv commented Oct 21, 2015

chris-b1 commented Oct 22, 2015

Dr-Irv commented Oct 23, 2015

jreback Oct 23, 2015

Choose a reason for hiding this comment

jreback commented Oct 23, 2015

chris-b1 commented Oct 24, 2015

Dr-Irv commented Oct 24, 2015

jreback commented Oct 24, 2015

Dr-Irv commented Oct 24, 2015

jreback commented Oct 25, 2015