Skip to content

Commit e72b7e5

Browse files
adamhooperjreback
authored andcommitted
read_html: Handle colspan and rowspan (#21487)
1 parent 7733530 commit e72b7e5

File tree

3 files changed

+660
-301
lines changed

3 files changed

+660
-301
lines changed

doc/source/whatsnew/v0.24.0.txt

+43-2
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ New features
1010

1111
- ``ExcelWriter`` now accepts ``mode`` as a keyword argument, enabling append to existing workbooks when using the ``openpyxl`` engine (:issue:`3441`)
1212

13-
.. _whatsnew_0240.enhancements.extension_array_operators
13+
.. _whatsnew_0240.enhancements.extension_array_operators:
1414

1515
``ExtensionArray`` operator support
1616
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -26,6 +26,46 @@ See the :ref:`ExtensionArray Operator Support
2626
<extending.extension.operator>` documentation section for details on both
2727
ways of adding operator support.
2828

29+
.. _whatsnew_0240.enhancements.read_html:
30+
31+
``read_html`` Enhancements
32+
^^^^^^^^^^^^^^^^^^^^^^^^^^
33+
34+
:func:`read_html` previously ignored ``colspan`` and ``rowspan`` attributes.
35+
Now it understands them, treating them as sequences of cells with the same
36+
value. (:issue:`17054`)
37+
38+
.. ipython:: python
39+
40+
result = pd.read_html("""
41+
<table>
42+
<thead>
43+
<tr>
44+
<th>A</th><th>B</th><th>C</th>
45+
</tr>
46+
</thead>
47+
<tbody>
48+
<tr>
49+
<td colspan="2">1</td><td>2</td>
50+
</tr>
51+
</tbody>
52+
</table>""")
53+
54+
Previous Behavior:
55+
56+
.. code-block:: ipython
57+
58+
In [13]: result
59+
Out [13]:
60+
[ A B C
61+
0 1 2 NaN]
62+
63+
Current Behavior:
64+
65+
.. ipython:: python
66+
67+
result
68+
2969
.. _whatsnew_0240.enhancements.other:
3070

3171
Other Enhancements
@@ -40,6 +80,7 @@ Other Enhancements
4080
<https://pandas-gbq.readthedocs.io/en/latest/changelog.html#changelog-0-5-0>`__.
4181
(:issue:`21627`)
4282
- New method :meth:`HDFStore.walk` will recursively walk the group hierarchy of an HDF5 file (:issue:`10932`)
83+
- :func:`read_html` copies cell data across ``colspan``s and ``rowspan``s, and it treats all-``th`` table rows as headers if ``header`` kwarg is not given and there is no ``thead`` (:issue:`17054`)
4384
- :meth:`Series.nlargest`, :meth:`Series.nsmallest`, :meth:`DataFrame.nlargest`, and :meth:`DataFrame.nsmallest` now accept the value ``"all"`` for the ``keep` argument. This keeps all ties for the nth largest/smallest value (:issue:`16818`)
4485
- :class:`IntervalIndex` has gained the :meth:`~IntervalIndex.set_closed` method to change the existing ``closed`` value (:issue:`21670`)
4586
-
@@ -329,7 +370,7 @@ MultiIndex
329370
I/O
330371
^^^
331372

332-
-
373+
- :func:`read_html()` no longer ignores all-whitespace ``<tr>`` within ``<thead>`` when considering the ``skiprows`` and ``header`` arguments. Previously, users had to decrease their ``header`` and ``skiprows`` values on such tables to work around the issue. (:issue:`21641`)
333374
-
334375
-
335376

0 commit comments

Comments
 (0)