Skip to content

BUG: Convert <br> to space in pd.read_html #45972

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 10, 2022
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.5.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -353,6 +353,7 @@ I/O
- Bug in :func:`read_csv` not recognizing line break for ``on_bad_lines="warn"`` for ``engine="c"`` (:issue:`41710`)
- Bug in :func:`read_parquet` when ``engine="pyarrow"`` which caused partial write to disk when column of unsupported datatype was passed (:issue:`44914`)
- Bug in :func:`DataFrame.to_excel` and :class:`ExcelWriter` would raise when writing an empty DataFrame to a ``.ods`` file (:issue:`45793`)
- Bug in :func:`read_html` where elements surrounding ``<br>`` were joined without a space between them (:issue:`29528`)

Period
^^^^^^
Expand Down
12 changes: 11 additions & 1 deletion pandas/io/html.py
Original file line number Diff line number Diff line change
Expand Up @@ -622,7 +622,13 @@ def _build_doc(self):
else:
udoc = bdoc
from_encoding = self.encoding
return BeautifulSoup(udoc, features="html5lib", from_encoding=from_encoding)

soup = BeautifulSoup(udoc, features="html5lib", from_encoding=from_encoding)

for br in soup.find_all("br"):
br.replace_with("\n" + br.text)

return soup


def _build_xpath_expr(attrs) -> str:
Expand Down Expand Up @@ -759,6 +765,10 @@ def _build_doc(self):
else:
if not hasattr(r, "text_content"):
raise XMLSyntaxError("no text parsed from document", 0, 0, 0)

for br in r.xpath("*//br"):
br.tail = "\n" + (br.tail or "")

return r

def _parse_thead_tr(self, table):
Expand Down
39 changes: 29 additions & 10 deletions pandas/tests/io/test_html.py
Original file line number Diff line number Diff line change
Expand Up @@ -611,17 +611,17 @@ def try_remove_ws(x):
)
assert df.shape == ground_truth.shape
old = [
"First Vietnamese American BankIn Vietnamese",
"Westernbank Puerto RicoEn Espanol",
"R-G Premier Bank of Puerto RicoEn Espanol",
"EurobankEn Espanol",
"Sanderson State BankEn Espanol",
"Washington Mutual Bank(Including its subsidiary Washington "
"First Vietnamese American Bank In Vietnamese",
"Westernbank Puerto Rico En Espanol",
"R-G Premier Bank of Puerto Rico En Espanol",
"Eurobank En Espanol",
"Sanderson State Bank En Espanol",
"Washington Mutual Bank (Including its subsidiary Washington "
"Mutual Bank FSB)",
"Silver State BankEn Espanol",
"AmTrade International BankEn Espanol",
"Hamilton Bank, NAEn Espanol",
"The Citizens Savings BankPioneer Community Bank, Inc.",
"Silver State Bank En Espanol",
"AmTrade International Bank En Espanol",
"Hamilton Bank, NA En Espanol",
"The Citizens Savings Bank Pioneer Community Bank, Inc.",
]
new = [
"First Vietnamese American Bank",
Expand Down Expand Up @@ -1286,3 +1286,22 @@ def test_parse_path_object(self, datapath):
df1 = self.read_html(file_path_string)[0]
df2 = self.read_html(file_path)[0]
tm.assert_frame_equal(df1, df2)

def test_parse_br_as_space(self):
# GH 29528: pd.read_html() convert <br> to space
result = self.read_html(
"""
<table>
<tr>
<th>A</th>
</tr>
<tr>
<td>word1<br>word2</td>
</tr>
</table>
"""
)[0]

expected = DataFrame(data=[["word1 word2"]], columns=["A"])

tm.assert_frame_equal(result, expected)