-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New Excel changes cause an extra line to be generated in the Excel file #2396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
There's an ambiguous case: hard to distinguish an index name row and an empty first row. @locojay added an argument |
In the absence of |
Ugh, I don't have the stomach for this. Please, someone help |
I'll look into it. |
I suggest the index names always appear on the first row, even if columns.name exists. Excel is not a full-fledged serialization format anyway, It's should be allowed to be lossy Does that sound right to you, @wesm ? |
I think it's possible but just really messy.
|
I took the approach of not adding the indexnames in the first row as the HTML dumper uses the same. And when there is a MultiIndex in Column Names and Indexnames in the index its the way to go. The easiest would be to always dump on the same row as columns if the columns are not a Multindex |
I"m really not sure what the best thing to do is, I would go with:
|
then let's do 1) as I would really like to have the ability to dump MulitiIndex colums (pivot tables...) 1 woud be:
|
If you want the kitchen sink, to be really sure there is no ambiguity introduced:
egad. |
The situation is worse than it seems at first: now some files throw an exception when being saved: 1 for item in results:
----> 2 print item; results[item].to_excel(excel, sheet_name=item, na_rep="NA")
3
/usr/lib64/python2.7/site-packages/pandas-0.10.0.dev_100fb8f-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in to_excel(self, excel_writer, sheet_name, na_rep, float_format, cols, header, index, index_label, startrow, startcol)
1443 formatted_cells = formatter.get_formatted_cells()
1444 excel_writer.write_cells(formatted_cells, sheet_name,
-> 1445 startrow=startrow, startcol=startcol)
1446 if need_save:
1447 excel_writer.save()
/usr/lib64/python2.7/site-packages/pandas-0.10.0.dev_100fb8f-py2.7-linux-x86_64.egg/pandas/io/parsers.pyc in write_cells(self, cells, sheet_name, startrow, startcol)
2066 self._writecells_xlsx(cells, sheet_name, startrow, startcol)
2067 else:
-> 2068 self._writecells_xls(cells, sheet_name, startrow, startcol)
2069
2070 def _writecells_xlsx(self, cells, sheet_name, startrow, startcol):
/usr/lib64/python2.7/site-packages/pandas-0.10.0.dev_100fb8f-py2.7-linux-x86_64.egg/pandas/io/parsers.pyc in _writecells_xls(self, cells, sheet_name, startrow, startcol)
2129 wks.write(startrow + cell.row,
2130 startcol + cell.col,
-> 2131 val, style)
/usr/lib/python2.7/site-packages/xlwt/Worksheet.pyc in write(self, r, c, label, style)
1030
1031 def write(self, r, c, label="", style=Style.default_style):
-> 1032 self.row(r).write(c, label, style)
1033
1034 def write_rich_text(self, r, c, rich_text_list, style=Style.default_style):
/usr/lib/python2.7/site-packages/xlwt/Row.pyc in write(self, col, label, style)
234 self.__adjust_height(style)
235 self.__adjust_bound_col_idx(col)
--> 236 style_index = self.__parent_wb.add_style(style)
237 if isinstance(label, basestring):
238 if len(label) > 0:
/usr/lib/python2.7/site-packages/xlwt/Workbook.pyc in add_style(self, style)
301
302 def add_style(self, style):
--> 303 return self.__styles.add(style)
304
305 def add_font(self, font):
/usr/lib/python2.7/site-packages/xlwt/Style.pyc in add(self, style)
88 if style == None:
89 return 0x10
---> 90 return self._add_style(style)[1]
91
92 def _add_style(self, style):
/usr/lib/python2.7/site-packages/xlwt/Style.pyc in _add_style(self, style)
147 if xf_index >= 0xFFF:
148 # 12 bits allowed, 0xFFF is a sentinel value
--> 149 raise ValueError("More than 4094 XFs (styles)")
150
151 return xf, xf_index
ValueError: More than 4094 XFs (styles) Needless to say, they worked OK before this change. |
@cswegger : seems like a problem with xlwt. (https://groups.google.com/forum/?fromgroups=#!topic/python-excel/WtkTgE08OFk) . Do you mind uploading your data? what xlwt version are you running (0.7.4) ? |
@y-p https://github.com/locojay/pandas/tree/2396. does not require to add has_index_names as it outputs indexlabels on the same rows as columns as long as the columns are not a multiindex |
@locojay , sorry, I don't follow. is that a revised version of the feature? if so, which of the above @cswegger, if you can provide steps to reproduce we'll add a testcase for it. |
@yp: the empty first line problem is fixed in master it just requires you to do: The branch i created just does not require this extra argument as it dumps the indexlabels in the same row as the column names. So its really a matter of taste:
I do find 1 more pleasing but this would require many users to add the extra argument... |
In data giovedì 06 dicembre 2012 10:30:40, y-p ha scritto:
Would you mind if I sent it privately? It's not meant to be public. If you are Luca Beltrame - KDE Forums team |
It'd be easiest (well, for me) if you could find a small synthetic example that If that's not possible: [email protected], be warned that @wesm might like to look |
In data giovedì 06 dicembre 2012 13:28:18, y-p ha scritto:
I'll check xlwt first. In any case, I observed this behavior only for Luca Beltrame - KDE Forums team |
I was able to reproduce your problem and will open a new issue and send a pull request later |
Status on this issue? |
@cswegger problem about the large dump is solved by 2445 So its really a matter of taste:
I do find 1 more pleasing but this would require many users to add the extra argument... |
This has produced a number of regression and suggests some parsing ambiguity Probably won't be ready for 0.10.0 though, so I still think this needs to be reverted |
I'm fine with pushing it off to 0.11. Let's hack it out with the least pain possible so I can get the release out soon-- as soon as this is dealt with I can make a beta release and maybe 1 week later 0.10 final. |
hack it out as in "excise" or as in "hack, hack,hack, commit"? |
excise |
on it. |
https://github.com/y-p/pandas/tree/revert_excel rolls excel support back to before #2370. The following were reverted: 55b3911 also 428803a |
Hmm. I'll take a look-- some of that refactoring I want, e.g. moving everything excel-related to |
Even though i suggested it, I realize now that the revert is too big, and rebasing back all that work will be horrible. |
Are you saying to just pull his commits and leave things as they are now (+ fixes) for 0.10? |
if the final is a week away, yes. merge them in, and i'll kick the tires during the week. |
Are there any commits of @locojay's outside #2445? |
cp a30c847 is a fix for this extra line issue. |
that's it |
Still riddled with issues: In [1]: def roundtrip(df,header=True):
...: path = '%s.xls' % "a"
...: df.to_excel(path,header=header)
...: xf = pd.ExcelFile(path)
...: return xf.parse(xf.sheet_names[0])
In [2]:
...: #still produces empty row when column is multindex
...: df = mkdf(5,3,r_idx_nlevels=1,c_idx_nlevels=2)
...: roundtrip(df)
...:
Out[2]:
C0 C_l0_g0 C_l0_g1 C_l0_g2
0 C1 C_l1_g0 C_l1_g1 C_l1_g2
1 R0 NaN NaN NaN
2 R_l0_g0 R0C0 R0C1 R0C2
3 R_l0_g1 R1C0 R1C1 R1C2
4 R_l0_g2 R2C0 R2C1 R2C2
5 R_l0_g3 R3C0 R3C1 R3C2
6 R_l0_g4 R4C0 R4C1 R4C2
In [3]:
...: #header=False is not respected when columns is a multindex
...: #and it also adds an addition nCols-1 nans to the index
...: df = mkdf(5,3,r_idx_nlevels=2,c_idx_nlevels=2)
...: roundtrip(df,False)
...:
Out[3]:
C0 C_l0_g0 C_l0_g1 C_l0_g2
nan C1 C_l1_g0 C_l1_g1 C_l1_g2
nan NaN NaN NaN NaN
R0 R1 NaN NaN NaN
R_l0_g0 R_l1_g0 R0C0 R0C1 R0C2
R_l0_g1 R_l1_g1 R1C0 R1C1 R1C2
R_l0_g2 R_l1_g2 R2C0 R2C1 R2C2
R_l0_g3 R_l1_g3 R3C0 R3C1 R3C2
R_l0_g4 R_l1_g4 R4C0 R4C1 R4C2
Still riddled with issues:
In [4]:
...: # for singleton dataframe with header=False,
...: # parse misreads the data as a column label
...: # even though there is an empty row
...: # (more a difficult corner cases then a bug)
...: #and single index columns if col_index=0
...: df= pd.DataFrame([0])
...: roundtrip(df,False)
...:
Out[4]:
Empty DataFrame
Columns: []
Index: [(0.0, 0.0)]
In [6]:
...: #parse dies on singleton dataframe with header=False,
...: #and single index columns if col_index=0
...: def roundtrip(df,header=True):
...: path = '%s.xls' % "a"
...: df.to_excel(path,header=header)
...: xf = pd.ExcelFile(path)
...: return xf.parse(xf.sheet_names[0],index_col=0)
...: df= pd.DataFrame([0])
...: roundtrip(df,False)
---------------------------------------------------------------------------
/home/user1/src/pandas/pandas/io/parsers.pyc in read(self, rows)
1230 content = content[1:]
1231
-> 1232 alldata = self._rows_to_cols(content)
1233 data = self._exclude_implicit_index(alldata)
1234
/home/user1/src/pandas/pandas/io/parsers.pyc in _rows_to_cols(self, content)
1443 msg = ('Expected %d fields in line %d, saw %d' %
1444 (col_len, row_num + 1, zip_len))
-> 1445 raise ValueError(msg)
1446
1447 return zipped_content
ValueError: Expected 1 fields in line 2, saw 2 |
damn eager GH with no delete mention button. ignore 11933f1. 0e25816 is my hack and slash attempt at making things useable enough for a 0.10 release, what this does: squashes multindex columns into single row (dot-join level labels), always place row index added some roundtrip tests for combinations of nlevels on row/column. Probably sigh. |
OK, cherry picked 0e25816 |
moved to #2478 for further work in the 0.11 cycle. |
thanks |
An example:
To be more clear,
ID
is the original index name, but it is shifted down one row when saving to Excel, and then it is picked up as a regular row by the Excel parser.This kinds of complicates parsing if I'm expecting to read something generated by to_excel()...
The text was updated successfully, but these errors were encountered: