Skip to content

pd.read_html() convert <br> to space #29528

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
abitrolly opened this issue Nov 10, 2019 · 5 comments · Fixed by #45972
Closed

pd.read_html() convert <br> to space #29528

abitrolly opened this issue Nov 10, 2019 · 5 comments · Fixed by #45972
Labels
Bug IO HTML read_html, to_html, Styler.apply, Styler.applymap
Milestone

Comments

@abitrolly
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
pd.read_html('http://vybary2019.by/regions/49.html',header=0)[0]

Problem description

In original web page the first cell of table contains three separate words.

image

In pandas table the first cell contains only two words, with first word being the join of the former two first.

image

[why the current behaviour is a problem]
Because during pandas parsing there is a data loss that is non-trivial to fix.

The reason for this behaviour is the presence of <br> tag between first two words in original content.

image

<br> should be converted to space or at least to a new line when converting to text representation.

I found the similar reference in #14608 (comment) but no separate GitHub issue.

Expected Output

image

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.6.8.final.0 python-bits : 64 OS : Linux OS-release : 4.14.137+ machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 0.25.3
numpy : 1.17.3
pytz : 2018.9
dateutil : 2.6.1
pip : 19.3.1
setuptools : 41.4.0
Cython : 0.29.13
pytest : 3.6.4
hypothesis : None
sphinx : 1.8.5
blosc : None
feather : 0.4.0
xlsxwriter : None
lxml.etree : 4.2.6
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.7.6.1 (dt dec pq3 ext lo64)
jinja2 : 2.10.3
IPython : 5.5.0
pandas_datareader: 0.7.4
bs4 : 4.6.3
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.2.6
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 2.5.9
pandas_gbq : 0.11.0
pyarrow : 0.14.1
pytables : None
s3fs : 0.3.5
scipy : 1.3.1
sqlalchemy : 1.3.10
tables : 3.4.4
xarray : 0.11.3
xlrd : 1.1.0
xlwt : 1.3.0
xlsxwriter : None

@jbrockmendel jbrockmendel added the IO HTML read_html, to_html, Styler.apply, Styler.applymap label Nov 12, 2019
@Franklinluo17

This comment has been minimized.

@Franklinluo17
Copy link

This could potentially be an issue with BeautifulSoup, as it appears that a lot of the HTML parsing is deferred to this external library.

@Franklinluo17 Franklinluo17 removed their assignment Dec 8, 2019
@abitrolly
Copy link
Author

@Franklinluo17 is it possible to draft a minimal call that does the conversion to ensure that this is indeed an external problem?

@mroeschke mroeschke added the Bug label May 7, 2020
@mroeschke
Copy link
Member

Thanks for the report, but I can no longer reproduce this behavior with the example site.

In [10]: import pandas as pd
    ...: pd.read_html('http://vybary2019.by/regions/49.html',header=0)[0]
ValueError: No tables found

Going to close as needing a stable, reproducible example (that hopefully doesn't depend on an external site), but happy to reopen if someone can provide a new example

@abitrolly
Copy link
Author

I can't reopen the issue, but here is the saved file - https://raw.githubusercontent.com/opendataby/vybary2019/master/cache/region49.html also attached.

region49.zip

In [1]: import pandas as pd                                                                                                       

In [2]: pd.read_html('https://raw.githubusercontent.com/opendataby/vybary2019/master/cache/region49.html', header=0)[0]           
Out[2]: 
  Фамилия, имя, отчество (в алфавитном порядке по округу)  ...  Место жительства
0                           ГОСТЯЕВАЖанна Михайловна       ...         г. Гродно
1                         ЖИГАРЬЛариса Александровна       ...         г. Гродно
2                          ЛУКАНСКАЯИрина Эдуардовна       ...         г. Гродно
3                       МАЛЬЦЕВАнатолий Владимирович       ...         г. Гродно
4                             РИМШААртур Анатольевич       ...         г. Гродно

[5 rows x 6 columns]

@mroeschke mroeschke reopened this Jul 23, 2021
@jreback jreback added this to the 1.5 milestone Apr 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants