Skip to content

Fix for BUG: multi-index excel header fails if all numeric #11328

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

Dr-Irv
Copy link
Contributor

@Dr-Irv Dr-Irv commented Oct 14, 2015

replaces #11317
closes #11408

This includes updates to 3 Excel files, plus a test in test_excel.py,
plus the fix in parsers.py

@jreback
Copy link
Contributor

jreback commented Oct 14, 2015

ok, pls add a whatsnew note in bug fixes (reference this PR number)

@jreback jreback added this to the 0.17.1 milestone Oct 14, 2015
@jreback
Copy link
Contributor

jreback commented Oct 14, 2015

ideally I'd like a round-trip test, e.g. take a csv write to excel, read it back and compare

@@ -557,6 +557,12 @@ def test_read_excel_multiindex(self):
actual = read_excel(mi_file, 'mi_column_name', header=[0,1], index_col=0)
tm.assert_frame_equal(actual, expected)

# Issue #11317
expected.columns = mi.set_levels(['1','2'],level=1).set_names(['c1', 'c2'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

show the construction of expected here (I know it is probably above), but nice to have it actually here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the style in that particular function. It keeps modifying the stuff above, as multiple sheets are read in and tested for different variations.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Oct 14, 2015

The check failed for Python 2.7, so I have committed a change for that to parsers.py and also updated what's new for v0.17.1.

@Dr-Irv Dr-Irv changed the title Fix for issue #11317 Fix for BUG: multi-index excel header fails if all numeric Oct 15, 2015
@jreback
Copy link
Contributor

jreback commented Oct 15, 2015

pls squash. ping when green.

@jreback
Copy link
Contributor

jreback commented Oct 15, 2015

sorry, meant to have you add a round-trip test as well.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Oct 16, 2015

I've added a round trip test. Discovered an issue where the Excel file isn't properly formatted when you have a MultiIndex column and merge_cells=false. Not sure how to test for that. Submitted code and assuming the tests will kick in

reader = ExcelFile(path)
df = read_excel(reader, 'test1', header=[0,1],
index_col=[0, 1],
parse_dates=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you fix formatting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. See next commit

@jreback
Copy link
Contributor

jreback commented Oct 17, 2015

@Dr-Irv minor fix squash

@chris-b1 can you have a look

@chris-b1
Copy link
Contributor

Seems fine to me. It's not introduced by this PR, but one inconsistency I just noticed is that read_csv seems to always converts numeric columns to strings (or maybe never converts to numeric in the first place), where read_excel will leave as numeric types.

@jreback
Copy link
Contributor

jreback commented Oct 17, 2015

@chris-b1 can u file an issue about your last
and show an example

frame = self.frame
arrays = np.arange(len(frame.index) * 2).reshape(2, -1)
new_index = MultiIndex.from_arrays(arrays,
names=['first', 'second'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the provies xls* files still used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes they are. I also added some to illustrate the bug.

@jreback
Copy link
Contributor

jreback commented Oct 18, 2015

needs a squash

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Oct 19, 2015

@jreback I think I have done everything that is needed. (Note - this is my first time doing any bug fixing with pandas or any open source project for that matter!) There is one issue and I don't know how I can debug it. In the test test_to_excel_multiindex_cols, if merge_cells is False, it fails on Python 2.7 and Python 3.5 according to the Travis CI testing. What is happening is that the multiindex is made from tuples [(40,1),(40,2),(50,1),(50,2)]. When merge_cells is False, the created column names are [ "40.1", "40.2", "50.1", "50.2" ]. On Python 2.7 and 3.5, it is reading those as floats rather than strings. I did set up a Python 2.7 environment for debugging, but could not replicate. I could not set up a Python 3.5 environment for debugging (This is all on Windows). So I have put something in there that makes the tests pass, but someone who knows how to do cross-platform debugging should look at this.

Thanks for your patience in letting me figure this out.

@jreback
Copy link
Contributor

jreback commented Oct 19, 2015

http://pandas.pydata.org/pandas-docs/stable/contributing.html#creating-a-development-environment for creating an environment on windows is quite easy (for any python).

As fair as the issue, maybe @chris-b1 has a thought?

@chris-b1
Copy link
Contributor

That is along the same lines as #11357.

Basically the type inference would need to happen after the splitting,
probably worth checking how csv handles it?

http://pandas.pydata.org/pandas-docs/stable/contributing.html#creating-a-development-environment
for creating an environment on windows is quite easy (for any python).

As fair as the issue, maybe @chris-b1 https://github.com/chris-b1 has a
thought?


Reply to this email directly or view it on GitHub
#11328 (comment).

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Oct 20, 2015

@jreback I'm unsure how to figure this out. I built a Python 2.7 environment on Linux, and can't replicate the bug. But the travis-ci build failed. Here is the part that is confusing. One of the Python 2.7 runs succeeded and the other one failed. See https://travis-ci.org/Dr-Irv/pandas/builds/86225791 . Job 5.2 is Python 2.7 and succeeded. Job 5.3 is Python 2.7 and failed. How can I replicate the failure that corresponds to Job 5.3? When I run the nosetests on test_excel.py on Python 2.7, it passes (with my next-to-last version of test_excel.py).

@chris-b1
Copy link
Contributor

Do you have all the dependencies (xlrd, xlsxwriter, openpyxl, xlst) installed locally? Might be skipping the failing test.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Oct 20, 2015

@chris-b1 Yes. I even explicitly ran the test that failed (according to the log) and it passes (no skip message). I turned on the "-v" flag and see that it is run. Also, it doesn't explain why in the Travis-CI tests, one 2.7 run succeeded and the other 2.7 run failed.

@chris-b1
Copy link
Contributor

It looks like the passing 2.7 is just the "slow' tests. Try installing the specific version of openpyxl in travis openpyxl=1.6.2

https://github.com/pydata/pandas/blob/master/ci/requirements-2.7_LOCALE.run

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Oct 20, 2015

@chris-b1 That was what I needed. I can reproduce so I will now work on it.

@jreback
Copy link
Contributor

jreback commented Oct 20, 2015

@Dr-Irv yep this is why conda works well for this, I just create a separate environment with the contents of a particular travis run - then easy to repro

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Oct 20, 2015

@chris-b1 @jreback So with Chris's hint, I was able to get it to fail on Linux. But on Windows, ci/requirements-2.7_LOCALE.run is not working with conda. I'm trying to hack it to get it to work. Question for either of you - how do you tell from the Travis-CI runs which ci/requirements* file corresponds to which run?

@jreback
Copy link
Contributor

jreback commented Oct 20, 2015

in .travis.yml, their is a JOB_NAME tag which corresponds to the run

LOCALE should work (though setting the LOCALE itself is tricky), see ci/script.py

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Oct 20, 2015

@jreback @chris-b1 I finally have it on Windows where it fails under a 2.7 configuration and succeeds under a 3.4 configuration. Something to do with openpyxl writer (I think). At least now I can compare things side by side to see what is going on. Thanks for the pointers.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Oct 21, 2015

@jreback I have figured out the issue, but am unsure what the right solution is. Here is a table that shows the combinations that succeeded and failed, which shows the Python version, whether OpenPyxl version was specified in the requirements file, which version of OpenPyxl was used, and whether the Travis-CI run succeeded or failed

Python Version Openpyxl version specified Openpyxl version used Succeeded?
2.6 None 2.2.6 YES
2.7 None 2.2.6 YES
2.7 1.6.2 1.6.2 NO
3.4 None 2.0.3 YES
3.5 None 1.8.3 NO

What is happening is that Openpyxl versions < 2.0 convert the string '40.1' to a float before writing it to the Excel file, while Openpyxl versions > 2.0 leave the string '40.1' as '40.1' . The writer based on xlwt writes it out as a string.

These strings are created when merge_cells=false, where a '.' is used as a separator to join the values of the hierarchical index.

Possible solutions are
a) change the openpyxl version used in the tests, (but that is a call someone else should make)
b) change the separator used when merge_cells=False to '/' or '_'
c) change the test so that the MultiIndex used are the tuples [(40,'A'),(40,'B'),(50,'A'),(50,'B')] rather than [(40,1),(40,2),(50,1),(50,2)]
d) Skip the test if the writer is openpyxl

Let me know which direction to go to.

@chris-b1
Copy link
Contributor

@Dr-Irv - it looks like you could add in a workaround for openpyxl < 2 by using set_explicit_value - basically would call that instead of cell.value = .. when writing the un-merged columns.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Oct 23, 2015

@jreback I think this is ready to be merged into the main branch. Also, I included a fix for #11408 so once you do the merge, you can close that issue.

@@ -115,4 +115,11 @@ Bug Fixes
- Bugs in ``to_excel`` with duplicate columns (:issue:`11007`, :issue:`10982`, :issue:`10970`)
- Fixed a bug that prevented the construction of an empty series of dtype
``datetime64[ns, tz]`` (:issue:`11245`).

- Bug in ``read_excel`` with multi-index containing integers (:issue:`11317`, :issue:`11328`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only list 11328

@jreback
Copy link
Contributor

jreback commented Oct 23, 2015

pls squash when you have make corrections

@chris-b1
Copy link
Contributor

I gave this another try, looks good to me!

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Oct 24, 2015

@jreback I believe I've done everything you have asked, and the tests passed.

@jreback
Copy link
Contributor

jreback commented Oct 24, 2015

lgtm just need a squash

git rebase -i master

then s everything (or use f)

then

git push forkname branchname -f

This includes updates to 3 Excel files, plus a test in test_excel.py,
plus the fix in parsers.py

issue when read_html with previous fix

With read_html, the fix didn't work on Python 2.7.  Handle the string
conversion correctly

Add bug fixed to what's new

Revert "Add bug fixed to what's new"

This reverts commit 05b2344.

Revert "issue when read_html with previous fix"

This reverts commit d1bc296.

Add what's new to describe bug.  fix issue with original fix

Added text to describe the bug.
Fixed issue so that it works correctly in Python 2.7

Add round trip test

Added round trip test and fixed error in writing sheets when
merge_cells=false and columns have multi index

DEPR: deprecate pandas.io.ga, pandas-dev#11308

DEPR: deprecate engine keyword from to_csv pandas-dev#11274

remove warnings from the tests for deprecation of engine in to_csv

PERF: Checking monotonic-ness before sorting on an index pandas-dev#11080

BUG: Bug in list-like indexing with a mixed-integer Index, pandas-dev#11320

Add hex color strings test

CLN: GH11271 move _get_handle, UTF encoders to io.common

TST: tests for list skiprows in read_excel

BUG: Fix to_dict() problem when using only datetime pandas-dev#11247

Fix a bug where to_dict() does not return Timestamp when there is only
datetime dtype present.

Undo change for when columns are multiindex

There is still something wrong here in the format of the file when there
are multiindex columns, but that's for another day

Fix formatting in test_excel and remove spurious test

See title

BUG: bug in comparisons vs tuples, pandas-dev#11339

bug#10442 : fix, adding note and test

BUG pandas-dev#10442(test) : Convert datetimelike index to strings with astype(str)

BUG#10422: note added

bug#10442 : tests added

bug#10442 : note udated

BUG pandas-dev#10442(test) : Convert datetimelike index to strings with astype(str)

bug#10442: fix, adding note and test

bug#10442: fix, adding note and test

Adjust test so that merge_cells=False works correctly

Adjust the test so that if merge_cells=false, it does a proper
formatting of the columns in the single row header, and puts the row
header in the first row

Fix test for Python 2.7 and 3.5

The test is failing on Python 2.7 and 3.5, which appears to read in the
values as floats, and I cannot replicate.  So force the tests to pass by
just making the column names equal when merge_cells=False

Fix for openpyxl < 2, and for issue pandas-dev#11408

If using openpyxl < 2, and value is a string that could be a number,
force a string to be written out.  If using openpyxl >= 2.2, then fix
issue pandas-dev#11408 to do with merging cells

Use set_value_explicit instead of set_explicit_value

set_value_explicit is in openpyxl 1.6, changed in openpyxl 1.8, but
there is code in 1.8 to set set_value_explicit to set_explicit_value for
compatibility

Add line in whatsnew for issue 11408

ENH: added capability to handle Path/LocalPath objects, pandas-dev#11033

DOC: typo in whatsnew/0.17.1.txt

PERF: Release GIL on some datetime ops

BUG: Bug in DataFrame.replace with a datetime64[ns, tz] and a non-compat to_replace pandas-dev#11326

CLN: clean up internal impl of fillna/replace, xref pandas-dev#11153

PERF: fast inf checking in to_excel

PERF: Series.dropna with non-nan dtypes

fixed pathlib tests on windows

DEPR: remove some SparsePanel deprecation warnings in testing

DEPR: avoid numpy comparison to None warnings

API: indexing with a null key will raise a TypeError rather than a ValueError, pandas-dev#11356

WARN: elementwise comparisons with index names, xref pandas-dev#11162

DEPR warning in io/data.py w.r.t. order->sort_values

WARN: more elementwise comparisons to object

WARN: more uncomparables of numeric array vs object

BUG: quick fix for pandas-dev#10989

TST: add test case from Issue pandas-dev#10989

API: add _to_safe_for_reshape to allow safe insert/append with embedded CategoricalIndexes

Signed-off-by: Jeff Reback <[email protected]>

BLD: conda

Revert "BLD: conda"

This reverts commit 0c8a8e1.

TST: remove invalid symbol warnings

TST: move some tests to slow

TST: fix some warnings filters

TST: import pandas_datareader, use for tests

TST: remove some deprecation warnings from imports

DEPR: fix VisibleDeprecationWarnings in sparse

TST: remove some warnings in test_nanops

ENH: Improve the error message in to_gbq when the DataFrame schema does not match pandas-dev#11359

add libgfortran to 1.8.1 build

binstar -> anaconda

remove link to issue 11328 in whatsnew

Fixes to document issue in code, small efficiency fix

Try to resolve rebase conflict in whats new
@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Oct 24, 2015

@jreback OK, I think I did the squash correctly. If everything is OK, you should also close issue #11408

@jreback
Copy link
Contributor

jreback commented Oct 25, 2015

merged via d5a04c1

thanks!

@jreback jreback closed this Oct 25, 2015
@Dr-Irv Dr-Irv deleted the Fix-for-#11317 branch October 26, 2015 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv IO Excel read_excel, to_excel MultiIndex
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pandas exporting to Excel (xls, xlsx) with multilevel columns
3 participants