read_html omits a column when reading a wikipedia table #51629

manzikki · 2023-02-25T09:50:40Z

https://en.wikipedia.org/wiki/List_of_countries_by_road_network_size
#Copy/paste the table HTML in a file
df=pd.read_html(table)
df=pd.DataFrame(df[0])
#The result is shown in the image. The Density column values are missing.

phofl · 2023-02-25T18:03:36Z

Hi, thanks for your report.

Can you please provide a minimal and reproducible example? You can define your html table as

data = """Put your html table here
"""
pd.read_html(StringIO(data))

Also, please provide your pandas versions and dependencies, e.g. fill out the issue template

manzikki · 2023-02-26T09:29:53Z

#here's a few first lines of the table in question. Pandas 1.3.5. I'm running it on Google Colab.
import pandas as pd
from io import StringIO;
roadtable = """

Country	Total <style data-mw-deduplicate="TemplateStyles:r886047488">.mw-parser-output .nobold{font-weight:normal}</style>(km)	Density (km/100 km²)	Paved (km)		Unpaved (km)		Controlled-access (km)		Source & Year

World	64,285,009	47	—		—		—		2021
United States *	6,803,479	69	4,304,715	63%	2,581,895	38%	95,932	1.4%	^[3] 2019

""" df = pd.read_html(StringIO(roadtable))[0]

phofl · 2023-02-26T11:08:54Z

Can you try on the newest pandas version? 2.0.0rc0

manzikki · 2023-02-27T04:45:27Z

Unfortunately the issue is there with 2.0.0rc0
pd.version
'2.0.0rc0'

print(df.columns)
Index(['Country', 'Total .mw-parser-output .nobold{font-weight:normal}(km)',
'Density (km/100 km2)', 'Paved (km)', 'Paved (km).1', 'Unpaved (km)',
'Unpaved (km).1', 'Controlled-access (km)', 'Controlled-access (km).1',
'Source & Year'],
dtype='object')
print(df['Density (km/100 km2)'])
0 NaN
1 NaN
2 NaN
Name: Density (km/100 km2), dtype: float64

phofl · 2023-02-27T11:55:48Z

Good, now please reduce everything from your html table that is not necessary to reproduce. The example should be minimal

m-ganko · 2023-03-20T10:43:29Z

Hi, it seems the problem is related to style="display:none". When one of elements in table cell has this style attribute, pandas returns NaN, while it should only skip this element. Below reproducible example:

html_table = """
    <table>
  <tr>
    <th>Col 1</th>
    <th>Col 2</th>
    <th>Col 3</th>
  </tr>
  <tr>
    <td>1</td>
    <td>2</td>
    <td>3</td>
  </tr>
  <tr>
    <td><span style="display:none"></span>4</td>
    <td>5</td>
    <td>6</td>
  </tr>
</table>
"""

pd.read_html(html_table)[0]

Out[1]:
   Col 1  Col 2  Col 3
0    1.0      2      3
1    NaN      5      6

I can try to fix this problem.

m-ganko · 2023-03-20T10:45:46Z

take

* for removing elements when display:none in read_html * test added

m-ganko · 2023-03-23T12:54:53Z

I have created PR which should resolve main issue.

But in this wikipedia example we can see another one, read_html reads also <style> element text. I would suggest adding new argument to read_html function e.g. skip_style_elements. I could also work on this, just let me know if it make sense for you and if I should create new issue.

* BUG: change lxml remove to drop_tree (#51629) * for removing elements when display:none in read_html * test added * DOC: #51629 added to whatsnew

phofl added the Needs Info Clarification about behavior needed to assess issue label Feb 25, 2023

github-actions bot assigned m-ganko Mar 20, 2023

m-ganko added a commit to m-ganko/pandas that referenced this issue Mar 23, 2023

BUG: change lxml remove to drop_tree (pandas-dev#51629)

ddd102c

* for removing elements when display:none in read_html * test added

m-ganko mentioned this issue Mar 23, 2023

BUG: change lxml remove to drop_tree (#51629) #52135

Merged

5 tasks

m-ganko added a commit to m-ganko/pandas that referenced this issue Mar 23, 2023

DOC: pandas-dev#51629 added to whatsnew

0755f22

mroeschke closed this as completed in #52135 Mar 24, 2023

mroeschke pushed a commit that referenced this issue Mar 24, 2023

BUG: change lxml remove to drop_tree (#51629) (#52135)

525f1ef

* BUG: change lxml remove to drop_tree (#51629) * for removing elements when display:none in read_html * test added * DOC: #51629 added to whatsnew

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_html omits a column when reading a wikipedia table #51629

read_html omits a column when reading a wikipedia table #51629

manzikki commented Feb 25, 2023

phofl commented Feb 25, 2023 •

edited

Loading

manzikki commented Feb 26, 2023

phofl commented Feb 26, 2023

manzikki commented Feb 27, 2023

phofl commented Feb 27, 2023

m-ganko commented Mar 20, 2023 •

edited

Loading

m-ganko commented Mar 20, 2023

m-ganko commented Mar 23, 2023

read_html omits a column when reading a wikipedia table #51629

read_html omits a column when reading a wikipedia table #51629

Comments

manzikki commented Feb 25, 2023

phofl commented Feb 25, 2023 • edited Loading

manzikki commented Feb 26, 2023

phofl commented Feb 26, 2023

manzikki commented Feb 27, 2023

phofl commented Feb 27, 2023

m-ganko commented Mar 20, 2023 • edited Loading

m-ganko commented Mar 20, 2023

m-ganko commented Mar 23, 2023

phofl commented Feb 25, 2023 •

edited

Loading

m-ganko commented Mar 20, 2023 •

edited

Loading