Skip to content

read_html() doesn't handle tables with multiple header rows #13434

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tdszyman opened this issue Jun 13, 2016 · 2 comments
Closed

read_html() doesn't handle tables with multiple header rows #13434

tdszyman opened this issue Jun 13, 2016 · 2 comments
Labels
Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap
Milestone

Comments

@tdszyman
Copy link

tdszyman commented Jun 13, 2016

The read_html() function seems to treat every <th> in a table as a column, even if they occur in separate <tr>s. This means that it breaks even on simple tables generated by pandas' to_html() function.

Code Sample, a copy-pastable example if possible

df = pd.DataFrame(
    columns=["Name", "Age", "Party"], 
    data = [("Hillary", 68, "D"), ("Bernie", 74, "D"), ("Donald", 69, "R")])
df = df.set_index("Name")
html = df.to_html()
df2 = pd.read_html(html)[0]
print df2

This is the value of html, generated by the to_html() function on the original data frame:

Age Party
Name
Hillary 68 D
Bernie 74 D
Donald 69 R
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Age</th>
      <th>Party</th>
    </tr>
    <tr>
      <th>Name</th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Hillary</th>
      <td>68</td>
      <td>D</td>
    </tr>
...
  </tbody>
</table>

And this is the printed output of the newly-parsed dataframe df2:

  Unnamed: 0  Age Party  Name  Unnamed: 4  Unnamed: 5
0    Hillary   68     D   NaN         NaN         NaN
1     Bernie   74     D   NaN         NaN         NaN
2     Donald   69     R   NaN         NaN         NaN

What happens is that the to_html() function produces an html table with two header rows, one for the column names and one with the index name. However the read_html() parser interprets each individual th cell as an expected column, resulting in twice the number of columns. Even worse, this produces a column with the same name as the original index but without any data.

Expected Output

The read_html parser could either treat the multi-row header fully correctly:

         Age Party
Name              
Hillary   68     D
Bernie    74     D
Donald    69     R

Or it could just ignore any rows after the first one:

  Unnamed: 0  Age Party
0    Hillary   68     D
1     Bernie   74     D
2     Donald   69     R

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_IE.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.1
setuptools: 21.0.0
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.1.1
sphinx: 1.3.5
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5
matplotlib: 1.5.1
openpyxl: 2.1.2
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 3.3.5
bs4: 4.4.1
html5lib: 1.0b3
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None

@tdszyman tdszyman changed the title from_html() doesn't handle tables with multiple header rows read_html() doesn't handle tables with multiple header rows Jun 13, 2016
@TomAugspurger
Copy link
Contributor

Correct me if I'm wrong here... would you be able to differentiate between HTML where the first row really is two blank strings, and a table with a header spanning multiple rows? My thoughts are read_html are that the user should expect to have a bit of cleanup work to do. But if the change to handle this case doesn't break anything and isn't too complicated, I'd say it'd be a good addition.

@TomAugspurger TomAugspurger added the IO HTML read_html, to_html, Styler.apply, Styler.applymap label Jun 14, 2016
@TomAugspurger TomAugspurger added this to the 0.19.0 milestone Jun 14, 2016
@jreback jreback modified the milestones: Next Major Release, 0.19.0 Jun 14, 2016
@tdszyman
Copy link
Author

@TomAugspurger the case I'm thinking of is where the first two rows are in the <thead> part of the <table>, and the other rows are in the <tbody> part. So yes they can clearly be distinguished from a row that is simply empty. Also, in the example I gave, every single <tr> element contains the same number of cells/columns (whether they are <th> or <td>), so there is no reason to generate a data frame with a different number of columns.

@jreback jreback modified the milestones: 0.20.0, Next Major Release Mar 29, 2017
mattip pushed a commit to mattip/pandas that referenced this issue Apr 3, 2017
…13434

closes pandas-dev#13434

Author: Brian <[email protected]>
Author: S. Brian Huey <[email protected]>

Closes pandas-dev#15242 from brianhuey/thead-improvement and squashes the following commits:

fc1c80e [S. Brian Huey] Merge branch 'master' into thead-improvement
b54aa0c [Brian] removed duplicate test case
6ae2860 [Brian] updated docstring and io.rst
41fe8cd [Brian] review changes
873ea58 [Brian] switched from range to lrange
cd70225 [Brian] ENH:read_html() handles tables with multiple header rows pandas-dev#13434
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
3 participants