Skip to content

read_html omits a column when reading a wikipedia table #51629

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
manzikki opened this issue Feb 25, 2023 · 8 comments · Fixed by #52135
Closed

read_html omits a column when reading a wikipedia table #51629

manzikki opened this issue Feb 25, 2023 · 8 comments · Fixed by #52135
Assignees
Labels
Needs Info Clarification about behavior needed to assess issue

Comments

@manzikki
Copy link

https://en.wikipedia.org/wiki/List_of_countries_by_road_network_size
#Copy/paste the table HTML in a file
df=pd.read_html(table)
df=pd.DataFrame(df[0])
#The result is shown in the image. The Density column values are missing.
Screen Shot 2566-02-25 at 16 49 16

@phofl
Copy link
Member

phofl commented Feb 25, 2023

Hi, thanks for your report.

Can you please provide a minimal and reproducible example? You can define your html table as

data = """Put your html table here
"""
pd.read_html(StringIO(data))

Also, please provide your pandas versions and dependencies, e.g. fill out the issue template

@phofl phofl added the Needs Info Clarification about behavior needed to assess issue label Feb 25, 2023
@manzikki
Copy link
Author

#here's a few first lines of the table in question. Pandas 1.3.5. I'm running it on Google Colab.
import pandas as pd
from io import StringIO;
roadtable = """

Country Total
<style data-mw-deduplicate="TemplateStyles:r886047488">.mw-parser-output .nobold{font-weight:normal}</style>(km)
Density
(km/100 km2)
Paved
(km)
Unpaved
(km)
Controlled-access
(km)
Source
& Year
World 64,285,009 47 2021
United States * 6,803,479 69 4,304,715 63% 2,581,895 38% 95,932 1.4% [3] 2019
""" df = pd.read_html(StringIO(roadtable))[0]

@phofl
Copy link
Member

phofl commented Feb 26, 2023

Can you try on the newest pandas version? 2.0.0rc0

@manzikki
Copy link
Author

Unfortunately the issue is there with 2.0.0rc0
pd.version
'2.0.0rc0'

print(df.columns)
Index(['Country', 'Total .mw-parser-output .nobold{font-weight:normal}(km)',
'Density (km/100 km2)', 'Paved (km)', 'Paved (km).1', 'Unpaved (km)',
'Unpaved (km).1', 'Controlled-access (km)', 'Controlled-access (km).1',
'Source & Year'],
dtype='object')
print(df['Density (km/100 km2)'])
0 NaN
1 NaN
2 NaN
Name: Density (km/100 km2), dtype: float64

@phofl
Copy link
Member

phofl commented Feb 27, 2023

Good, now please reduce everything from your html table that is not necessary to reproduce. The example should be minimal

@m-ganko
Copy link
Contributor

m-ganko commented Mar 20, 2023

Hi, it seems the problem is related to style="display:none". When one of elements in table cell has this style attribute, pandas returns NaN, while it should only skip this element. Below reproducible example:

html_table = """
    <table>
  <tr>
    <th>Col 1</th>
    <th>Col 2</th>
    <th>Col 3</th>
  </tr>
  <tr>
    <td>1</td>
    <td>2</td>
    <td>3</td>
  </tr>
  <tr>
    <td><span style="display:none"></span>4</td>
    <td>5</td>
    <td>6</td>
  </tr>
</table>
"""

pd.read_html(html_table)[0]

Out[1]:
   Col 1  Col 2  Col 3
0    1.0      2      3
1    NaN      5      6

I can try to fix this problem.

@m-ganko
Copy link
Contributor

m-ganko commented Mar 20, 2023

take

m-ganko added a commit to m-ganko/pandas that referenced this issue Mar 23, 2023
* for removing elements when display:none in read_html
* test added
@m-ganko
Copy link
Contributor

m-ganko commented Mar 23, 2023

I have created PR which should resolve main issue.

But in this wikipedia example we can see another one, read_html reads also <style> element text. I would suggest adding new argument to read_html function e.g. skip_style_elements. I could also work on this, just let me know if it make sense for you and if I should create new issue.

m-ganko added a commit to m-ganko/pandas that referenced this issue Mar 23, 2023
mroeschke pushed a commit that referenced this issue Mar 24, 2023
* BUG: change lxml remove to drop_tree (#51629)

* for removing elements when display:none in read_html
* test added

* DOC: #51629 added to whatsnew
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Info Clarification about behavior needed to assess issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants