Skip to content

read_html: Handle colspan and rowspan #21487

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jul 5, 2018
114 changes: 114 additions & 0 deletions doc/source/whatsnew/v0.24.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,120 @@ Current Behavior:
...
OverflowError: Trying to coerce negative values to unsigned integers

read_html Incompatibilities
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about read_html enhancements!

---------------------------

:func:`read_html` previously ignored ``colspan`` and ``rowspan`` attributes.
Now it understands them, treating them as a sequence of cells with the same
value.

Previous Behavior:

.. code-block:: ipython

In [1]: pd.read_html("""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to avoid repeating the code, make an ipython block above where you define the html itself). then the preivious uses a code-block to show the output, and the new uses an ipython block

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I get it now. (I actually read the docs to preview what it looks like.) Thanks -- looks like a super improvement.

<table>
<thead>
<tr>
<th>A</th><th>B</th><th>C</th>
</tr>
</thead>
<tbody>
<tr>
<td colspan="2">1</td><td>2</td>
</tr>
</tbody>
</table>
""")
Out [1]:
[ A B C
0 1 2 NaN]

Current Behavior:

.. code-block:: ipython
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can make this an ipython block (e.g. actually run it for current)


In [1]: pd.read_html("""
<table>
<thead>
<tr>
<th>A</th><th>B</th><th>C</th>
</tr>
</thead>
<tbody>
<tr>
<td colspan="2">1</td><td>2</td>
</tr>
</tbody>
</table>
""")
Out [1]:
[ A B C
0 1 2 2]

Calls that relied on the previous behavior will need to be changed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sentence is not needed


Also, :func:`read_html` previously ignored some ``<tr>`` elements when called
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See what @jreback thinks but I don't consider this second half necessary as it addresses a minor bug. Otherwise lgtm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah you can just list as a 1-liner in the io section is ok, no example for this needed (the latter half)

with ``header=`` or ``skiprows=`` on some unusual HTML tables.
(:issue:`21641`)

Previous Behavior:

.. code-block:: ipython

In [1]: pd.read_html("""
<table>
<thead>
<tr>
<!-- empty header row, was ignored -->
<th></th><th></th><th></th>
</tr>
<tr>
<th>A</th><th>B</th><th>C</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td><td>2</td><td>3</td>
</tr>
</tbody>
</table>
""", header=2)
Out [1]:
[Empty DataFrame
Columns: [1, 2, 3]
Index: []]

Current Behavior:

.. code-block:: ipython

In [1]: pd.read_html("""
<table>
<thead>
<tr>
<!-- empty header row, was ignored -->
<th></th><th></th><th></th>
</tr>
<tr>
<th>A</th><th>B</th><th>C</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td><td>2</td><td>3</td>
</tr>
</tbody>
</table>
""", header=2)
Out [1]:
[ A B C
0 1 2 3]

Previously, the workaround was to write ``header=0`` instead of ``header=1``
for this example table. Now, that workaround must be removed. This should not
affect many users, since most HTML tables do not have empty header rows.

- :class:`DatetimeIndex` now accepts :class:`Int64Index` arguments as epoch timestamps (:issue:`20997`)
-
-
Expand Down