Skip to content

New Excel changes cause an extra line to be generated in the Excel file #2396

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lbeltrame opened this issue Nov 30, 2012 · 39 comments
Closed
Labels
Milestone

Comments

@lbeltrame
Copy link
Contributor

An example:

In [9]: data = pandas.ExcelFile("sample.xls") # generated with df.to_excel()

In [10]: data.parse(data.sheet_names[0], index_col=0)
Out[10]: 
         Value
ID         NaN
ID1  38.625700
ID2  44.691054

To be more clear, ID is the original index name, but it is shifted down one row when saving to Excel, and then it is picked up as a regular row by the Excel parser.

This kinds of complicates parsing if I'm expecting to read something generated by to_excel()...

@ghost ghost assigned changhiskhan Nov 30, 2012
@wesm
Copy link
Member

wesm commented Dec 2, 2012

There's an ambiguous case: hard to distinguish an index name row and an empty first row. @locojay added an argument has_index_names to ExcelFile.parse that makes it explicit. A somewhat tricky situation, hmm

@wesm
Copy link
Member

wesm commented Dec 2, 2012

In the absence of columns.name, I think it would be better to put the index name on the first line. Anyone disagree?

@wesm
Copy link
Member

wesm commented Dec 2, 2012

Ugh, I don't have the stomach for this. Please, someone help

@ghost ghost self-assigned this Dec 2, 2012
@ghost
Copy link

ghost commented Dec 2, 2012

I'll look into it.

@ghost
Copy link

ghost commented Dec 3, 2012

I suggest the index names always appear on the first row, even if columns.name exists.
Being able to unambigously reconsturct the data, columns labels and index labels is the
vital bit. Other things are nice, but should not be allowed to violate that, certainly not as the
default.

Excel is not a full-fledged serialization format anyway, It's should be allowed to be lossy
with regards to some things, but not with the essential bits.
(FWIW, I don't think parsing out a multiindex from an excel sheet is a good idea either).

Does that sound right to you, @wesm ?
I can code that up if you don't want to fiddle with this. If not, it may be better to revert, and
make the decision either way in 0.11.

@changhiskhan
Copy link
Contributor

I think it's possible but just really messy.
@y-p what do you think the following:

  1. If there are explicit index_col then it's easy to check the first row
  2. If the first row has data where the implicit indices are, then also use those names as index names
  3. Otherwise treat as empty first row.

@jassinm
Copy link

jassinm commented Dec 3, 2012

I took the approach of not adding the indexnames in the first row as the HTML dumper uses the same. And when there is a MultiIndex in Column Names and Indexnames in the index its the way to go.
We could dump it on the same row as the column names if the column is not a Multindex (not supported really by the old version --> no merge ... dumps tuple).
I don't see any other way then adding an argument if one want's to have indexnames the same way as the Html dumper and beeing able to parse an empty first row with no names. Maybe using range names in excel or adding the indexnnames in the metada when dumping would not require the argument...

The easiest would be to always dump on the same row as columns if the columns are not a Multindex

@ghost
Copy link

ghost commented Dec 3, 2012

to_html() is strictly for presentation, while to_excel() needs to be read back in.
I'd take the niceties of the former only if they don't hurt the latter. Complexity has
it's costs as well. I think we're all struggling to convince ourselves we're not missing
any corner cases, I know I am.

I"m really not sure what the best thing to do is, I would go with:

  1. Do the simplest thing possible.
  2. Or, if not sure, push the change in default behaviour back a release.

@jassinm
Copy link

jassinm commented Dec 3, 2012

then let's do 1) as I would really like to have the ability to dump MulitiIndex colums (pivot tables...)

1 woud be:

  • check if columns is a MultiIndex. If not dump indexnames in same row as columnames , remove the has_idnexnames argument from the parser.

@ghost
Copy link

ghost commented Dec 5, 2012

If you want the kitchen sink, to be really sure there is no ambiguity introduced:

  1. you need to test to_exce/read_excel gets the data back without losing anything.
    and
  2. you need to make sure it looks ok
    For all combinations of
  • multindex on rows/columns/both/none.
  • name on rows index/columns/both/none.
  • Int64Index, DateTimeIndex, dtype=object index.
  • presence of NaNs in one or or more locations.
  • empty first line or not

egad.

@lbeltrame
Copy link
Contributor Author

The situation is worse than it seems at first: now some files throw an exception when being saved:

      1 for item in results:
----> 2     print item; results[item].to_excel(excel, sheet_name=item, na_rep="NA")
      3 

/usr/lib64/python2.7/site-packages/pandas-0.10.0.dev_100fb8f-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in to_excel(self, excel_writer, sheet_name, na_rep, float_format, cols, header, index, index_label, startrow, startcol)
   1443         formatted_cells = formatter.get_formatted_cells()
   1444         excel_writer.write_cells(formatted_cells, sheet_name,
-> 1445                                  startrow=startrow, startcol=startcol)
   1446         if need_save:
   1447             excel_writer.save()

/usr/lib64/python2.7/site-packages/pandas-0.10.0.dev_100fb8f-py2.7-linux-x86_64.egg/pandas/io/parsers.pyc in write_cells(self, cells, sheet_name, startrow, startcol)
   2066             self._writecells_xlsx(cells, sheet_name, startrow, startcol)
   2067         else:
-> 2068             self._writecells_xls(cells, sheet_name, startrow, startcol)
   2069 
   2070     def _writecells_xlsx(self, cells, sheet_name, startrow, startcol):

/usr/lib64/python2.7/site-packages/pandas-0.10.0.dev_100fb8f-py2.7-linux-x86_64.egg/pandas/io/parsers.pyc in _writecells_xls(self, cells, sheet_name, startrow, startcol)
   2129                 wks.write(startrow + cell.row,
   2130                           startcol + cell.col,
-> 2131                           val, style)

/usr/lib/python2.7/site-packages/xlwt/Worksheet.pyc in write(self, r, c, label, style)
   1030 
   1031     def write(self, r, c, label="", style=Style.default_style):
-> 1032         self.row(r).write(c, label, style)
   1033 
   1034     def write_rich_text(self, r, c, rich_text_list, style=Style.default_style):

/usr/lib/python2.7/site-packages/xlwt/Row.pyc in write(self, col, label, style)
    234         self.__adjust_height(style)
    235         self.__adjust_bound_col_idx(col)
--> 236         style_index = self.__parent_wb.add_style(style)
    237         if isinstance(label, basestring):
    238             if len(label) > 0:

/usr/lib/python2.7/site-packages/xlwt/Workbook.pyc in add_style(self, style)
    301 
    302     def add_style(self, style):
--> 303         return self.__styles.add(style)
    304 
    305     def add_font(self, font):

/usr/lib/python2.7/site-packages/xlwt/Style.pyc in add(self, style)
     88         if style == None:
     89             return 0x10
---> 90         return self._add_style(style)[1]
     91 
     92     def _add_style(self, style):

/usr/lib/python2.7/site-packages/xlwt/Style.pyc in _add_style(self, style)
    147         if xf_index >= 0xFFF:
    148             # 12 bits allowed, 0xFFF is a sentinel value
--> 149             raise ValueError("More than 4094 XFs (styles)")
    150 
    151         return xf, xf_index

ValueError: More than 4094 XFs (styles)

Needless to say, they worked OK before this change. excel is an ExcelWriter instance.

@jassinm
Copy link

jassinm commented Dec 6, 2012

@cswegger : seems like a problem with xlwt. (https://groups.google.com/forum/?fromgroups=#!topic/python-excel/WtkTgE08OFk) . Do you mind uploading your data? what xlwt version are you running (0.7.4) ?

@jassinm
Copy link

jassinm commented Dec 6, 2012

@y-p https://github.com/locojay/pandas/tree/2396. does not require to add has_index_names as it outputs indexlabels on the same rows as columns as long as the columns are not a multiindex

@ghost
Copy link

ghost commented Dec 6, 2012

@locojay , sorry, I don't follow. is that a revised version of the feature? if so, which of the above
scenarios are covered. specifically, does it solve the "empty first" line problem in all cases?

@cswegger, if you can provide steps to reproduce we'll add a testcase for it.

@jassinm
Copy link

jassinm commented Dec 6, 2012

@yp: the empty first line problem is fixed in master it just requires you to do:
data.parse(data.sheet_names[0], index_col=0, has_index_names=True)

The branch i created just does not require this extra argument as it dumps the indexlabels in the same row as the column names.

So its really a matter of taste:

  • index labels offseted by one row (same as Htmldumper) + require argument has_index_names
    vs
  • index labels in same row as columns + does not requires extra argument.

I do find 1 more pleasing but this would require many users to add the extra argument...

@lbeltrame
Copy link
Contributor Author

In data giovedì 06 dicembre 2012 10:30:40, y-p ha scritto:

@cswegger, if you can provide steps to reproduce we'll add a testcase for

Would you mind if I sent it privately? It's not meant to be public. If you are
OK, I'll send it by Monday.

Luca Beltrame - KDE Forums team
KDE Science supporter
GPG key ID: 6E1A4E79

@ghost
Copy link

ghost commented Dec 6, 2012

It'd be easiest (well, for me) if you could find a small synthetic example that
reproduces the error .I'm assuming you verified that you are using the latest version of xlwt,
as locojay suggested.

If that's not possible: [email protected], be warned that @wesm might like to look
at it as well. But I'll ask first in any case.

@lbeltrame
Copy link
Contributor Author

In data giovedì 06 dicembre 2012 13:28:18, y-p ha scritto:

It'd be easiest (well, for me) if you could find a small synthetic example
that reproduces the error .I'm assuming you verified that you are using the
latest version of xlwt, as locojay suggested.

I'll check xlwt first. In any case, I observed this behavior only for
DataFrames at least longer than 1500 rows or so.

Luca Beltrame - KDE Forums team
KDE Science supporter
GPG key ID: 6E1A4E79

@jassinm
Copy link

jassinm commented Dec 6, 2012

I was able to reproduce your problem and will open a new issue and send a pull request later
can you give https://github.com/locojay/pandas/tree/xlwtlarge a try it should work

@wesm
Copy link
Member

wesm commented Dec 7, 2012

Status on this issue?

@jassinm
Copy link

jassinm commented Dec 7, 2012

@cswegger problem about the large dump is solved by 2445
as for the other (first post):

So its really a matter of taste:

  • index labels offseted by one row (same as Htmldumper) + require argument has_index_names vs
  • index labels in same row as columns + does not requires extra argument.

I do find 1 more pleasing but this would require many users to add the extra argument...
for 2 i have this branch https://github.com/locojay/pandas/tree/2396.

@ghost
Copy link

ghost commented Dec 7, 2012

This has produced a number of regression and suggests some parsing ambiguity
was introduced,
I think it's best to go TDD on this issue and create tests to cover all the cases
mentioned in my comment above.
Using #2446 I'm going to write the tests that touch all the corners.

Probably won't be ready for 0.10.0 though, so I still think this needs to be reverted
and reintroduced in 0.11.0 with more confidence.

@wesm
Copy link
Member

wesm commented Dec 8, 2012

I'm fine with pushing it off to 0.11. Let's hack it out with the least pain possible so I can get the release out soon-- as soon as this is dealt with I can make a beta release and maybe 1 week later 0.10 final.

@ghost
Copy link

ghost commented Dec 8, 2012

hack it out as in "excise" or as in "hack, hack,hack, commit"?

@wesm
Copy link
Member

wesm commented Dec 8, 2012

excise

@ghost
Copy link

ghost commented Dec 8, 2012

on it.

@ghost
Copy link

ghost commented Dec 8, 2012

https://github.com/y-p/pandas/tree/revert_excel rolls excel support back to before #2370.
I tried not to break anything , but there were ~100 manual conflicts to clear, too many to
be sure. Tests pass though.

The following were reverted:

55b3911
a0bed85
ea75b4b
749318f
13080c8 - the original PR merge

also 428803a

@wesm
Copy link
Member

wesm commented Dec 8, 2012

Hmm. I'll take a look-- some of that refactoring I want, e.g. moving everything excel-related to pandas/io/tests/test_excel.py. I may just manually revert the excel reading/writing code to right before the merge and skip the new tests until they can be re-enabled at a later time

@ghost
Copy link

ghost commented Dec 8, 2012

Even though i suggested it, I realize now that the revert is too big, and rebasing back all that work will be horrible.
@locojay says he's fixed both known issues and I believe the new commits have sane defaults.
Perhaps it is best to go ahead and fix things in 0.11 along with adding MultiIndex
parsing from excel and support for names on both axis. chang had a good suggestion and I have ideas as well,
I'm sure there will be more of this in 0.11 anyway.

@wesm
Copy link
Member

wesm commented Dec 8, 2012

Are you saying to just pull his commits and leave things as they are now (+ fixes) for 0.10?

@ghost
Copy link

ghost commented Dec 8, 2012

if the final is a week away, yes. merge them in, and i'll kick the tires during the week.
barring more surprises, it'll be less painful then reverting and then rewriting everything.

@wesm
Copy link
Member

wesm commented Dec 8, 2012

Are there any commits of @locojay's outside #2445?

@ghost
Copy link

ghost commented Dec 8, 2012

cp a30c847 is a fix for this extra line issue.

@jassinm
Copy link

jassinm commented Dec 8, 2012

that's it

wesm added a commit that referenced this issue Dec 9, 2012
@wesm wesm closed this as completed Dec 9, 2012
@ghost
Copy link

ghost commented Dec 10, 2012

Still riddled with issues:

In [1]: def roundtrip(df,header=True):
   ...:     path = '%s.xls' % "a"
   ...:     df.to_excel(path,header=header)
   ...:     xf = pd.ExcelFile(path)
   ...:     return xf.parse(xf.sheet_names[0])

In [2]: 
   ...: #still produces empty row when column is multindex
   ...: df = mkdf(5,3,r_idx_nlevels=1,c_idx_nlevels=2)
   ...: roundtrip(df)
   ...: 
Out[2]: 
        C0  C_l0_g0  C_l0_g1  C_l0_g2
0       C1  C_l1_g0  C_l1_g1  C_l1_g2
1       R0      NaN      NaN      NaN
2  R_l0_g0     R0C0     R0C1     R0C2
3  R_l0_g1     R1C0     R1C1     R1C2
4  R_l0_g2     R2C0     R2C1     R2C2
5  R_l0_g3     R3C0     R3C1     R3C2
6  R_l0_g4     R4C0     R4C1     R4C2

In [3]: 
   ...: #header=False is not respected when columns is a multindex
   ...: #and it also adds an addition nCols-1 nans to the index
   ...: df = mkdf(5,3,r_idx_nlevels=2,c_idx_nlevels=2)
   ...: roundtrip(df,False)
   ...: 
Out[3]: 
              C0  C_l0_g0  C_l0_g1  C_l0_g2
nan           C1  C_l1_g0  C_l1_g1  C_l1_g2
nan          NaN      NaN      NaN      NaN
R0            R1      NaN      NaN      NaN
R_l0_g0  R_l1_g0     R0C0     R0C1     R0C2
R_l0_g1  R_l1_g1     R1C0     R1C1     R1C2
R_l0_g2  R_l1_g2     R2C0     R2C1     R2C2
R_l0_g3  R_l1_g3     R3C0     R3C1     R3C2
R_l0_g4  R_l1_g4     R4C0     R4C1     R4C2
Still riddled with issues:

In [4]: 
   ...: # for singleton dataframe with header=False,
   ...: # parse misreads the data as a column label
   ...: # even though there is an empty row
   ...: # (more a difficult corner cases  then a bug)
   ...: #and single index columns if col_index=0
   ...: df= pd.DataFrame([0])
   ...: roundtrip(df,False)
   ...: 
Out[4]: 
Empty DataFrame
Columns: []
Index: [(0.0, 0.0)]

In [6]: 
   ...: #parse dies on singleton dataframe with header=False,
   ...: #and single index columns if col_index=0
   ...: def roundtrip(df,header=True):
   ...:     path = '%s.xls' % "a"
   ...:     df.to_excel(path,header=header)
   ...:     xf = pd.ExcelFile(path)
   ...:     return xf.parse(xf.sheet_names[0],index_col=0)
   ...: df= pd.DataFrame([0])
   ...: roundtrip(df,False)
---------------------------------------------------------------------------
/home/user1/src/pandas/pandas/io/parsers.pyc in read(self, rows)
   1230             content = content[1:]
   1231 
-> 1232         alldata = self._rows_to_cols(content)
   1233         data = self._exclude_implicit_index(alldata)
   1234 

/home/user1/src/pandas/pandas/io/parsers.pyc in _rows_to_cols(self, content)
   1443             msg = ('Expected %d fields in line %d, saw %d' %
   1444                    (col_len, row_num + 1, zip_len))
-> 1445             raise ValueError(msg)
   1446 
   1447         return zipped_content

ValueError: Expected 1 fields in line 2, saw 2

@ghost ghost reopened this Dec 10, 2012
@ghost
Copy link

ghost commented Dec 10, 2012

damn eager GH with no delete mention button. ignore 11933f1.

0e25816 is my hack and slash attempt at making things useable enough for a 0.10 release,
but there are too many operation modes to cover what with all the arguments, and
I'm getting pretty tired of this issue.

what this does: squashes multindex columns into single row (dot-join level labels), always place row index
names in the same row of column names, respect header=False no matter the
columns index nlevels, and also supress row index names if header=False.
also supress empty first line whean header=False.

added some roundtrip tests for combinations of nlevels on row/column. Probably
haven't covered everything, so If anyone would like to do better , jump
right in there.

sigh.

@wesm
Copy link
Member

wesm commented Dec 10, 2012

OK, cherry picked 0e25816

@ghost
Copy link

ghost commented Dec 10, 2012

moved to #2478 for further work in the 0.11 cycle.

@ghost ghost closed this as completed Dec 10, 2012
@jassinm
Copy link

jassinm commented Dec 11, 2012

thanks

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants