Skip to content

header argument in read_html() ignores empty trs, only within thead #21641

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
adamhooper opened this issue Jun 26, 2018 · 1 comment · Fixed by #21487
Closed

header argument in read_html() ignores empty trs, only within thead #21641

adamhooper opened this issue Jun 26, 2018 · 1 comment · Fixed by #21487
Labels
IO HTML read_html, to_html, Styler.apply, Styler.applymap
Milestone

Comments

@adamhooper
Copy link
Contributor

Code Sample, a copy-pastable example if possible

With a table that has no <thead>, read_html()'s header argument looks at the correct row:

>>> pandas.read_html('''<table><tbody><tr><th></th><th></th></tr><tr><th>A</th><th>B</th></tr><tr><td>a</td><td>b</td></tr></tbody></table>''', header=0)
[  Unnamed: 0 Unnamed: 1
0          A          B
1          a          b]

... but when a row is in a <thead>, Pandas behaves differently:

>>> pandas.read_html('''<table><thead><tr><th></th><th></th></tr><tr><th>A</th><th>B</th></tr></thead><tbody><tr><td>a</td><td>b</td></tr></tbody></table>''', header=0)
[   A  B
0  a  b]

Problem description

Structurally, within HTML, <thead> and <tbody> serve the same purpose. It's unnatural to specify different <tr> indices depending on which element contains them. XPath and CSS selectors (e.g., //table//tr) are consistent, and read_html() isn't.

In code, the error is that Pandas deletes empty <tr>s within <thead> before passing them to pandas.io.html._data_to_frame(), which is where the header argument is parsed.

If the issue has not been resolved there, go ahead and file it in the issue tracker.

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

pandas.show_versions()
INSTALLED VERSIONS


commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.0-37-generic
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.0
pytest: None
pip: 10.0.1
setuptools: 39.2.0
Cython: None
numpy: 1.11.3
scipy: None
pyarrow: 0.8.0
xarray: None
IPython: None
sphinx: 1.7.5
patsy: None
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.5.4
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@adamhooper
Copy link
Contributor Author

#21487 fixes this issue; I'm writing it here for posterity.

@WillAyd WillAyd added the IO HTML read_html, to_html, Styler.apply, Styler.applymap label Jun 26, 2018
@jreback jreback added this to the 0.24.0 milestone Jul 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants