Skip to content

Commit c0bde88

Browse files
authored
PERF: read_html: Reading one HTML file with multiple tables is much slower than loading each table separatly (#49942)
* change xpath expression - perf improvement * gh ref * add whatsnew
1 parent dc5e00d commit c0bde88

File tree

2 files changed

+3
-2
lines changed

2 files changed

+3
-2
lines changed

doc/source/whatsnew/v2.0.0.rst

+1
Original file line numberDiff line numberDiff line change
@@ -613,6 +613,7 @@ Performance improvements
613613
- Performance improvement in :func:`read_stata` with parameter ``index_col`` set to ``None`` (the default). Now the index will be a :class:`RangeIndex` instead of :class:`Int64Index` (:issue:`49745`)
614614
- Performance improvement in :func:`merge` when not merging on the index - the new index will now be :class:`RangeIndex` instead of :class:`Int64Index` (:issue:`49478`)
615615
- Performance improvement in :meth:`DataFrame.to_dict` and :meth:`Series.to_dict` when using any non-object dtypes (:issue:`46470`)
616+
- Performance improvement in :func:`read_html` when there are multiple tables (:issue:`49929`)
616617

617618
.. ---------------------------------------------------------------------------
618619
.. _whatsnew_200.bug_fixes:

pandas/io/html.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -744,8 +744,8 @@ def _parse_tables(self, doc, match, kwargs):
744744
pattern = match.pattern
745745

746746
# 1. check all descendants for the given pattern and only search tables
747-
# 2. go up the tree until we find a table
748-
xpath_expr = f"//table//*[re:test(text(), {repr(pattern)})]/ancestor::table"
747+
# GH 49929
748+
xpath_expr = f"//table[.//text()[re:test(., {repr(pattern)})]]"
749749

750750
# if any table attributes were given build an xpath expression to
751751
# search for them

0 commit comments

Comments
 (0)