PERF: read_html: Reading one HTML file with multiple tables is much slower than loading each table separatly (#49942)

asishm · web-flow · commit c0bde88e4dc5 · 2022-11-29T09:30:36.000-08:00
* change xpath expression - perf improvement

* gh ref

* add whatsnew
diff --git a/doc/source/whatsnew/v2.0.0.rst b/doc/source/whatsnew/v2.0.0.rst
@@ -613,6 +613,7 @@ Performance improvements
 - Performance improvement in :func:`read_stata` with parameter ``index_col`` set to ``None`` (the default). Now the index will be a :class:`RangeIndex` instead of :class:`Int64Index` (:issue:`49745`)
 - Performance improvement in :func:`merge` when not merging on the index - the new index will now be :class:`RangeIndex` instead of :class:`Int64Index` (:issue:`49478`)
 - Performance improvement in :meth:`DataFrame.to_dict` and :meth:`Series.to_dict` when using any non-object dtypes (:issue:`46470`)
+- Performance improvement in :func:`read_html` when there are multiple tables (:issue:`49929`)
 
 .. ---------------------------------------------------------------------------
 .. _whatsnew_200.bug_fixes:
diff --git a/pandas/io/html.py b/pandas/io/html.py
@@ -744,8 +744,8 @@ def _parse_tables(self, doc, match, kwargs):
         pattern = match.pattern
 
         # 1. check all descendants for the given pattern and only search tables
-        # 2. go up the tree until we find a table
-        xpath_expr = f"//table//*[re:test(text(), {repr(pattern)})]/ancestor::table"
+        # GH 49929
+        xpath_expr = f"//table[.//text()[re:test(., {repr(pattern)})]]"
 
         # if any table attributes were given build an xpath expression to
         # search for them