Skip to content

read_html providing title from a attribute as well as the text - in effect duplicating output #20027

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
MikeWoodward opened this issue Mar 7, 2018 · 5 comments · Fixed by #20047
Labels
IO HTML read_html, to_html, Styler.apply, Styler.applymap Output-Formatting __repr__ of pandas objects, to_string
Milestone

Comments

@MikeWoodward
Copy link

Code Sample

url = """https://en.wikipedia.org/wiki/List_of_winners_of_the_Boston_Marathon"""
tables = pd.read_html(url, header=0)
print(tables[0].head())

Problem description

The above code ''should' just extract the displayed text in the HTML table; what's in the dataframe should be what's displayed on screen. This isn't what happens. If the HTML contains a hyperlink with a title attribute, this is picked up and added to the dataframe, duplicating the data.

Expected Output

   Year                   Athlete  \
0  1897          John J. McDermott   
1  1898         Ronald J. MacDonald   
2  1899          Lawrence Brignolia   
3  1900         John "Jack" Caffery   
4  1901         John "Jack" Caffery   

                      Country/State     Time        Notes  
0                United States (NY)  2:55:10          NaN  
1                     Canada Canada  2:42:00          NaN  
2                United States (MA)  2:54:38          NaN  
3                            Canada  2:39:44          NaN  
4                            Canada  2:29:23  2nd victory 

Output

Here's the actual output, the duplication is in the Athlete and Country/State columns.

   Year                                  Athlete  
0  1897      McDermott, John J.John J. McDermott   
1  1898  MacDonald, Ronald J.Ronald J. MacDonald   
2  1899    Brignolia, LawrenceLawrence Brignolia   
3  1900         Caffery, JohnJohn "Jack" Caffery   
4  1901         Caffery, JohnJohn "Jack" Caffery   

                      Country/State     Time        Notes  
0  United States United States (NY)  2:55:10          NaN  
1                     Canada Canada  2:42:00          NaN  
2  United States United States (MA)  2:54:38          NaN  
3                     Canada Canada  2:39:44          NaN  
4                     Canada Canada  2:29:23  2nd victory 
@WillAyd
Copy link
Member

WillAyd commented Mar 7, 2018

Are you seeing the issue on any other sites? Just glancing at the source I would think this has to do more with the span elements with display:none that are on the Wikipedia than the links (see screenshot below_ but curious if there's another source that was leading you to think the link is responsible

screen shot 2018-03-07 at 11 35 18 am

@jreback jreback added Output-Formatting __repr__ of pandas objects, to_string IO HTML read_html, to_html, Styler.apply, Styler.applymap labels Mar 8, 2018
@MikeWoodward
Copy link
Author

I think you're right, I think this is an error on my part and it's the display:none setting that's doing it. I need a bit more time to investigate and I'll post again when I have some results.

@WillAyd
Copy link
Member

WillAyd commented Mar 9, 2018

OK thanks Mike. For what it's worth I already put what I believe to be the fix here in #20047 - might want to give that a look

@MikeWoodward
Copy link
Author

Confirmed, this is the display:none doing this. Here's some example HTML that shows the issue.

<title>Example</title>

This is a H1

This is a paragraph

Column1 Column2 Column3
Span text display attribute:noneJohn J. McDermottText not in span or a Plain text - no elements Plain text - no elements
Span text no display attributeJohn J. McDermottText not in span or a Plain text - no elements
Plain text - no elements

Some text




@WillAyd
Copy link
Member

WillAyd commented Mar 9, 2018

Great thanks. Feel free to mess around with the fix I put in the above PR. Targeting v0.23 if all works out

@TomAugspurger TomAugspurger reopened this Mar 9, 2018
@jreback jreback added this to the 0.23.0 milestone Mar 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO HTML read_html, to_html, Styler.apply, Styler.applymap Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants