Q: correct behavior for read_html with rowspan/colspan for DataFrames?

No code, just a question for proper behavior for rowspan/colspan with `read_html` of an HTML table into a DataFrame. (I'm not asking what currently happens with `read_html` now. I'm asking what _should_ happen.)

Below is a simple HTML table that uses both colspan (a,e,i,o,u) and rowspan (Fruit, schwa, honk, and the rightmost 0 in the table). It renders identically on each of {Chrome, Firefox, Safari}. With these renderers, both rowspans and colspans are basically rendered midway through the span, either vertically (rowspan) or horizontally (colspan).

<img width="446" alt="fruit_html" src="https://user-images.githubusercontent.com/976970/28595080-25038b14-7147-11e7-9e27-bb57ca9fb043.png">

Now, let's say we wanted to import this into pandas with `read_html`. It seems to me the behavior should be different for a pandas DataFrame than for a renderer:
- The header should have a MultiIndex, where the first column is `Fruit` and the second column is a combination of `a` and `Long`, etc. We don't "fill" a rowspan (the first column shouldn't be two `Fruit`s), but we do "fill" a colspan (`a` would appear in the 2nd and 3rd columns).
- The body should "fill" a rowspan or colspan with the provided values. So the rightmost column, instead of having one zero and two blanks on the three rows, should have a zero for each of the three rows. One would think a span in a DataFrame context within a body would mean "fill in the value for each cell in the span".

If this was the case, it would imply that we treat rowspans differently in header and body.

- If we "fill" a rowspan in a header, then we just repeat the header value in the MultiIndex output, which doesn't seem like what we want.
- If we don't "fill" a rowspan in the body, we leave some cells in the DataFrame blank, which also seems misguided.

I put the DataFrame that I _think_ we want below. It incorporates different behavior for rowspan for header and body. One thing I don't know, though: If I don't "fill" the rowspan name for rowspan > 1, what do I put instead? `None`? empty string? `False`? What does the input to `TextParser` look like when some column names are "taller" than others?

Thoughts? @chris-b1?   (relevant to https://github.com/pandas-dev/pandas/issues/17054)

```
             a          e          i          o          u               
    Fruit Long Short Long Short Long Short Long Short Long Short schwa honk
0   Apple    0     1    0     0    0     0    0     0    0     0     1    0
1  Banana    0     3    0     0    0     0    0     0    0     0     0    0
2    Kiwi    0     0    2     0    0     0    0     0    0     0     0    0
```

```html
<table>
  <thead>
    <tr>
      <th rowspan=2>Fruit</th>
      <th colspan=2>a</th>
      <th colspan=2>e</th>
      <th colspan=2>i</th>
      <th colspan=2>o</th>
      <th colspan=2>u</th>
      <th rowspan=2>schwa</th>
      <th rowspan=2>honk</th>
    </tr>
    <tr>
      <th>Long</th>
      <th>Short</th>
      <th>Long</th>
      <th>Short</th>
      <th>Long</th>
      <th>Short</th>
      <th>Long</th>
      <th>Short</th>
      <th>Long</th>
      <th>Short</th>
    </tr>
</thead>
  <tbody>
    <tr>
      <td>Apple</td>
      <td>0</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
      <td rowspan=3>0</td>
    </tr>
    <tr>
      <td>Banana</td>
      <td>0</td>
      <td>3</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <td>Kiwi</td>
      <td>0</td>
      <td>0</td>
      <td>2</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
    </tr>
  </tbody>
</table>
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Q: correct behavior for read_html with rowspan/colspan for DataFrames? #17073

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Q: correct behavior for read_html with rowspan/colspan for DataFrames? #17073

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions