-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
read_html() doesn't handle tables with multiple header rows #13434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Correct me if I'm wrong here... would you be able to differentiate between HTML where the first row really is two blank strings, and a table with a header spanning multiple rows? My thoughts are |
@TomAugspurger the case I'm thinking of is where the first two rows are in the |
…13434 closes pandas-dev#13434 Author: Brian <[email protected]> Author: S. Brian Huey <[email protected]> Closes pandas-dev#15242 from brianhuey/thead-improvement and squashes the following commits: fc1c80e [S. Brian Huey] Merge branch 'master' into thead-improvement b54aa0c [Brian] removed duplicate test case 6ae2860 [Brian] updated docstring and io.rst 41fe8cd [Brian] review changes 873ea58 [Brian] switched from range to lrange cd70225 [Brian] ENH:read_html() handles tables with multiple header rows pandas-dev#13434
The
read_html()
function seems to treat every<th>
in a table as a column, even if they occur in separate<tr>
s. This means that it breaks even on simple tables generated by pandas'to_html()
function.Code Sample, a copy-pastable example if possible
This is the value of
html
, generated by theto_html()
function on the original data frame:And this is the printed output of the newly-parsed dataframe
df2
:What happens is that the
to_html()
function produces an html table with two header rows, one for the column names and one with the index name. However theread_html()
parser interprets each individualth
cell as an expected column, resulting in twice the number of columns. Even worse, this produces a column with the same name as the original index but without any data.Expected Output
The
read_html
parser could either treat the multi-row header fully correctly:Or it could just ignore any rows after the first one:
output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_IE.UTF-8
pandas: 0.18.1
nose: 1.3.7
pip: 8.1.1
setuptools: 21.0.0
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.1.1
sphinx: 1.3.5
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5
matplotlib: 1.5.1
openpyxl: 2.1.2
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 3.3.5
bs4: 4.4.1
html5lib: 1.0b3
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None
The text was updated successfully, but these errors were encountered: