Skip to content

Q: correct behavior for read_html with rowspan/colspan for DataFrames? #17073

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jowens opened this issue Jul 25, 2017 · 7 comments
Closed

Q: correct behavior for read_html with rowspan/colspan for DataFrames? #17073

jowens opened this issue Jul 25, 2017 · 7 comments
Labels
Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap

Comments

@jowens
Copy link

jowens commented Jul 25, 2017

No code, just a question for proper behavior for rowspan/colspan with read_html of an HTML table into a DataFrame. (I'm not asking what currently happens with read_html now. I'm asking what should happen.)

Below is a simple HTML table that uses both colspan (a,e,i,o,u) and rowspan (Fruit, schwa, honk, and the rightmost 0 in the table). It renders identically on each of {Chrome, Firefox, Safari}. With these renderers, both rowspans and colspans are basically rendered midway through the span, either vertically (rowspan) or horizontally (colspan).

fruit_html

Now, let's say we wanted to import this into pandas with read_html. It seems to me the behavior should be different for a pandas DataFrame than for a renderer:

  • The header should have a MultiIndex, where the first column is Fruit and the second column is a combination of a and Long, etc. We don't "fill" a rowspan (the first column shouldn't be two Fruits), but we do "fill" a colspan (a would appear in the 2nd and 3rd columns).
  • The body should "fill" a rowspan or colspan with the provided values. So the rightmost column, instead of having one zero and two blanks on the three rows, should have a zero for each of the three rows. One would think a span in a DataFrame context within a body would mean "fill in the value for each cell in the span".

If this was the case, it would imply that we treat rowspans differently in header and body.

  • If we "fill" a rowspan in a header, then we just repeat the header value in the MultiIndex output, which doesn't seem like what we want.
  • If we don't "fill" a rowspan in the body, we leave some cells in the DataFrame blank, which also seems misguided.

I put the DataFrame that I think we want below. It incorporates different behavior for rowspan for header and body. One thing I don't know, though: If I don't "fill" the rowspan name for rowspan > 1, what do I put instead? None? empty string? False? What does the input to TextParser look like when some column names are "taller" than others?

Thoughts? @chris-b1? (relevant to #17054)

             a          e          i          o          u               
    Fruit Long Short Long Short Long Short Long Short Long Short schwa honk
0   Apple    0     1    0     0    0     0    0     0    0     0     1    0
1  Banana    0     3    0     0    0     0    0     0    0     0     0    0
2    Kiwi    0     0    2     0    0     0    0     0    0     0     0    0
<table>
  <thead>
    <tr>
      <th rowspan=2>Fruit</th>
      <th colspan=2>a</th>
      <th colspan=2>e</th>
      <th colspan=2>i</th>
      <th colspan=2>o</th>
      <th colspan=2>u</th>
      <th rowspan=2>schwa</th>
      <th rowspan=2>honk</th>
    </tr>
    <tr>
      <th>Long</th>
      <th>Short</th>
      <th>Long</th>
      <th>Short</th>
      <th>Long</th>
      <th>Short</th>
      <th>Long</th>
      <th>Short</th>
      <th>Long</th>
      <th>Short</th>
    </tr>
</thead>
  <tbody>
    <tr>
      <td>Apple</td>
      <td>0</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
      <td rowspan=3>0</td>
    </tr>
    <tr>
      <td>Banana</td>
      <td>0</td>
      <td>3</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <td>Kiwi</td>
      <td>0</td>
      <td>0</td>
      <td>2</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
    </tr>
  </tbody>
</table>
@gfyoung gfyoung added IO HTML read_html, to_html, Styler.apply, Styler.applymap Usage Question labels Jul 25, 2017
@chris-b1
Copy link
Contributor

I think the philosophy with read_html is that it is a "good enough for a first pass" parser, not necessarily that it will handle every messy real life table.

That said the behavior you proposed seems reasonable - in other parsers "Fruit" would be placed at the top level, with an unnamed second level TextParser logic should already handle this for you.

In [18]: pd.read_csv(StringIO("""
    ...: Fruit,a,a,e,e
    ...: ,Long,Short,Long,Short
    ...: Apple,0,1,0
    ...: Banana,0,3,0"""), header=[0,1])
Out[18]: 
               Fruit    a          e      
  Unnamed: 0_level_1 Long Short Long Short
0              Apple    0     1    0   NaN
1             Banana    0     3    0   NaN

@jowens
Copy link
Author

jowens commented Jul 25, 2017

OK. I can match the behavior of the other parsers. "Unnamed second level" evidently means "empty string".

@gfyoung
Copy link
Member

gfyoung commented Jul 25, 2017

Judging from the previous conversation, I am changing the labeling. PR is welcome!

@jowens
Copy link
Author

jowens commented Jul 25, 2017

If you'd like, @gfyoung, just close this in favor of #17054.

@gfyoung
Copy link
Member

gfyoung commented Jul 25, 2017

If you'd like, @gfyoung, just close this in favor of #17054.

@jowens : Are you planning to address this discussion in that issue then?

@jowens
Copy link
Author

jowens commented Jul 25, 2017

It'll be implemented as discussed in this issue when I submit a pull request for #17054.

@gfyoung
Copy link
Member

gfyoung commented Jul 25, 2017

Okay, sounds good. Closing in favor of that issue then.

@gfyoung gfyoung closed this as completed Jul 25, 2017
@gfyoung gfyoung added this to the No action milestone Jul 25, 2017
adamhooper added a commit to adamhooper/pandas that referenced this issue Jun 14, 2018
This is essentially a rebased and squashed pandas-dev#17054 (mad props to @jowens
for doing all the hard thinking). My tweaks:

* test_computer_sales_page (see pandas-dev#17074) no longer tests for ParserError,
  because the ParserError was a bug caused by missing colspan support.
  Now, test that MultiIndex works as expected.
* I respectfully removed the fill_rowspan argument from pandas-dev#17073. Instead,
  the virtual cells created by rowspan/colspan are always copies of the
  real cells' text. This prevents _infer_columns() from naming virtual
  cells as "Unnamed: ..."
* I removed a small layer of abstraction to respect pandas-dev#20891 (multiple
  <tbody> support), which was implemented after @jowens' pull request.
  Now _HtmlFrameParser has _parse_thead_trs, _parse_tbody_trs and
  _parse_tfoot_trs, each returning a list of <tr>s. That let me remove
  _parse_tr, Making All The Tests Pass.
* That caused a snowball effect. lxml does not fix malformed <thead>, as
  tested by spam.html. The previous hacky workaround was in
  _parse_raw_thead, but the new _parse_thead_trs signature returns nodes
  instead of text. The new hacky solution: return the <thead> itself,
  pretending it's a <tr>. This works in all the tests. A better solution
  is to use html5lib with lxml; but that might belong in a separate pull
  request.
adamhooper added a commit to adamhooper/pandas that referenced this issue Jun 26, 2018
This is essentially a rebased and squashed pandas-dev#17054 (mad props to @jowens
for doing all the hard thinking). My tweaks:

* test_computer_sales_page (see pandas-dev#17074) no longer tests for ParserError,
  because the ParserError was a bug caused by missing colspan support.
  Now, test that MultiIndex works as expected.
* I respectfully removed the fill_rowspan argument from pandas-dev#17073. Instead,
  the virtual cells created by rowspan/colspan are always copies of the
  real cells' text. This prevents _infer_columns() from naming virtual
  cells as "Unnamed: ..."
* I removed a small layer of abstraction to respect pandas-dev#20891 (multiple
  <tbody> support), which was implemented after @jowens' pull request.
  Now _HtmlFrameParser has _parse_thead_trs, _parse_tbody_trs and
  _parse_tfoot_trs, each returning a list of <tr>s. That let me remove
  _parse_tr, Making All The Tests Pass.
* That caused a snowball effect. lxml does not fix malformed <thead>, as
  tested by spam.html. The previous hacky workaround was in
  _parse_raw_thead, but the new _parse_thead_trs signature returns nodes
  instead of text. The new hacky solution: return the <thead> itself,
  pretending it's a <tr>. This works in all the tests. A better solution
  is to use html5lib with lxml; but that might belong in a separate pull
  request.
adamhooper added a commit to adamhooper/pandas that referenced this issue Jun 27, 2018
This is essentially a rebased and squashed pandas-dev#17054 (mad props to @jowens
for doing all the hard thinking). My tweaks:

* test_computer_sales_page (see pandas-dev#17074) no longer tests for ParserError,
  because the ParserError was a bug caused by missing colspan support.
  Now, test that MultiIndex works as expected.
* I respectfully removed the fill_rowspan argument from pandas-dev#17073. Instead,
  the virtual cells created by rowspan/colspan are always copies of the
  real cells' text. This prevents _infer_columns() from naming virtual
  cells as "Unnamed: ..."
* I removed a small layer of abstraction to respect pandas-dev#20891 (multiple
  <tbody> support), which was implemented after @jowens' pull request.
  Now _HtmlFrameParser has _parse_thead_trs, _parse_tbody_trs and
  _parse_tfoot_trs, each returning a list of <tr>s. That let me remove
  _parse_tr, Making All The Tests Pass.
* That caused a snowball effect. lxml does not fix malformed <thead>, as
  tested by spam.html. The previous hacky workaround was in
  _parse_raw_thead, but the new _parse_thead_trs signature returns nodes
  instead of text. The new hacky solution: return the <thead> itself,
  pretending it's a <tr>. This works in all the tests. A better solution
  is to use html5lib with lxml; but that might belong in a separate pull
  request.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

No branches or pull requests

3 participants