Skip to content

read_html: Handle colspan and rowspan #21487

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jul 5, 2018
117 changes: 116 additions & 1 deletion doc/source/whatsnew/v0.24.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ Other Enhancements
<https://pandas-gbq.readthedocs.io/en/latest/changelog.html#changelog-0-5-0>`__.
(:issue:`21627`)
- New method :meth:`HDFStore.walk` will recursively walk the group hierarchy of an HDF5 file (:issue:`10932`)
- :func:`read_html` copies cell data across ``colspan``s and ``rowspan``s, and it treats all-``th`` table rows as headers if ``header`` kwarg is not given and there is no ``thead`` (:issue:`17054`)
- :meth:`Series.nlargest`, :meth:`Series.nsmallest`, :meth:`DataFrame.nlargest`, and :meth:`DataFrame.nsmallest` now accept the value ``"all"`` for the ``keep` argument. This keeps all ties for the nth largest/smallest value (:issue:`16818`)
-

Expand Down Expand Up @@ -167,6 +168,120 @@ Current Behavior:
...
OverflowError: Trying to coerce negative values to unsigned integers

read_html Incompatibilities
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about read_html enhancements!

---------------------------

:func:`read_html` previously ignored ``colspan`` and ``rowspan`` attributes.
Now it understands them, treating them as a sequence of cells with the same
value.

Previous Behavior:

.. code-block:: ipython

In [1]: pd.read_html("""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to avoid repeating the code, make an ipython block above where you define the html itself). then the preivious uses a code-block to show the output, and the new uses an ipython block

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I get it now. (I actually read the docs to preview what it looks like.) Thanks -- looks like a super improvement.

<table>
<thead>
<tr>
<th>A</th><th>B</th><th>C</th>
</tr>
</thead>
<tbody>
<tr>
<td colspan="2">1</td><td>2</td>
</tr>
</tbody>
</table>
""")
Out [1]:
[ A B C
0 1 2 NaN]

Current Behavior:

.. code-block:: ipython
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can make this an ipython block (e.g. actually run it for current)


In [1]: pd.read_html("""
<table>
<thead>
<tr>
<th>A</th><th>B</th><th>C</th>
</tr>
</thead>
<tbody>
<tr>
<td colspan="2">1</td><td>2</td>
</tr>
</tbody>
</table>
""")
Out [1]:
[ A B C
0 1 2 2]

Calls that relied on the previous behavior will need to be changed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sentence is not needed


Also, :func:`read_html` previously ignored some ``<tr>`` elements when called
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See what @jreback thinks but I don't consider this second half necessary as it addresses a minor bug. Otherwise lgtm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah you can just list as a 1-liner in the io section is ok, no example for this needed (the latter half)

with ``header=`` or ``skiprows=`` on some unusual HTML tables.
(:issue:`21641`)

Previous Behavior:

.. code-block:: ipython

In [1]: pd.read_html("""
<table>
<thead>
<tr>
<!-- empty header row, was ignored -->
<th></th><th></th><th></th>
</tr>
<tr>
<th>A</th><th>B</th><th>C</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td><td>2</td><td>3</td>
</tr>
</tbody>
</table>
""", header=2)
Out [1]:
[Empty DataFrame
Columns: [1, 2, 3]
Index: []]

Current Behavior:

.. code-block:: ipython

In [1]: pd.read_html("""
<table>
<thead>
<tr>
<!-- empty header row, was ignored -->
<th></th><th></th><th></th>
</tr>
<tr>
<th>A</th><th>B</th><th>C</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td><td>2</td><td>3</td>
</tr>
</tbody>
</table>
""", header=2)
Out [1]:
[ A B C
0 1 2 3]

Previously, the workaround was to write ``header=0`` instead of ``header=1``
for this example table. Now, that workaround must be removed. This should not
affect many users, since most HTML tables do not have empty header rows.

- :class:`DatetimeIndex` now accepts :class:`Int64Index` arguments as epoch timestamps (:issue:`20997`)
-
-
Expand Down Expand Up @@ -297,7 +412,7 @@ MultiIndex
I/O
^^^

-
- :func:`read_html()` no longer ignores all-whitespace ``<tr>`` within ``<thead>`` when considering the ``skiprows`` and ``header`` arguments. Previously, users had to decrease their ``header`` and ``skiprows`` values on such tables to work around the issue. (:issue:`21641`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is all ok, but would a subsection with a mini-example be instructive to the user about the revised functinaility? not saying 100% need to, but if a simple enough example (or even an image) would be helpful here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to display my ignorance of the whatsnew format: ... where would this go?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, you make a new subsection in the top somewhere, e.g. model after this

Series and Index Data-Dtype Incompatibilities

just add a new one below and you can make an extended example

-
-

Expand Down
Loading