-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
read_html: Handle colspan and rowspan #21487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 11 commits
3e58794
f89b32a
34f87cb
582c86b
d2f0b83
74c2384
ad6e869
6fa0489
86b2dea
d4f4bb1
e296bd1
f0f91c3
95ce993
5fd863b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -40,6 +40,7 @@ Other Enhancements | |
<https://pandas-gbq.readthedocs.io/en/latest/changelog.html#changelog-0-5-0>`__. | ||
(:issue:`21627`) | ||
- New method :meth:`HDFStore.walk` will recursively walk the group hierarchy of an HDF5 file (:issue:`10932`) | ||
- :func:`read_html` copies cell data across ``colspan``s and ``rowspan``s, and it treats all-``th`` table rows as headers if ``header`` kwarg is not given and there is no ``thead`` (:issue:`17054`) | ||
- :meth:`Series.nlargest`, :meth:`Series.nsmallest`, :meth:`DataFrame.nlargest`, and :meth:`DataFrame.nsmallest` now accept the value ``"all"`` for the ``keep` argument. This keeps all ties for the nth largest/smallest value (:issue:`16818`) | ||
- | ||
|
||
|
@@ -167,6 +168,120 @@ Current Behavior: | |
... | ||
OverflowError: Trying to coerce negative values to unsigned integers | ||
|
||
read_html Incompatibilities | ||
--------------------------- | ||
|
||
:func:`read_html` previously ignored ``colspan`` and ``rowspan`` attributes. | ||
Now it understands them, treating them as a sequence of cells with the same | ||
value. | ||
|
||
Previous Behavior: | ||
|
||
.. code-block:: ipython | ||
|
||
In [1]: pd.read_html(""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. to avoid repeating the code, make an ipython block above where you define the html itself). then the preivious uses a code-block to show the output, and the new uses an ipython block There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, I get it now. (I actually read the docs to preview what it looks like.) Thanks -- looks like a super improvement. |
||
<table> | ||
<thead> | ||
<tr> | ||
<th>A</th><th>B</th><th>C</th> | ||
</tr> | ||
</thead> | ||
<tbody> | ||
<tr> | ||
<td colspan="2">1</td><td>2</td> | ||
</tr> | ||
</tbody> | ||
</table> | ||
""") | ||
Out [1]: | ||
[ A B C | ||
0 1 2 NaN] | ||
|
||
Current Behavior: | ||
|
||
.. code-block:: ipython | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you can make this an ipython block (e.g. actually run it for current) |
||
|
||
In [1]: pd.read_html(""" | ||
<table> | ||
<thead> | ||
<tr> | ||
<th>A</th><th>B</th><th>C</th> | ||
</tr> | ||
</thead> | ||
<tbody> | ||
<tr> | ||
<td colspan="2">1</td><td>2</td> | ||
</tr> | ||
</tbody> | ||
</table> | ||
""") | ||
Out [1]: | ||
[ A B C | ||
0 1 2 2] | ||
|
||
Calls that relied on the previous behavior will need to be changed. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this sentence is not needed |
||
|
||
Also, :func:`read_html` previously ignored some ``<tr>`` elements when called | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See what @jreback thinks but I don't consider this second half necessary as it addresses a minor bug. Otherwise lgtm There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah you can just list as a 1-liner in the io section is ok, no example for this needed (the latter half) |
||
with ``header=`` or ``skiprows=`` on some unusual HTML tables. | ||
(:issue:`21641`) | ||
|
||
Previous Behavior: | ||
|
||
.. code-block:: ipython | ||
|
||
In [1]: pd.read_html(""" | ||
<table> | ||
<thead> | ||
<tr> | ||
<!-- empty header row, was ignored --> | ||
<th></th><th></th><th></th> | ||
</tr> | ||
<tr> | ||
<th>A</th><th>B</th><th>C</th> | ||
</tr> | ||
</thead> | ||
<tbody> | ||
<tr> | ||
<td>1</td><td>2</td><td>3</td> | ||
</tr> | ||
</tbody> | ||
</table> | ||
""", header=2) | ||
Out [1]: | ||
[Empty DataFrame | ||
Columns: [1, 2, 3] | ||
Index: []] | ||
|
||
Current Behavior: | ||
|
||
.. code-block:: ipython | ||
|
||
In [1]: pd.read_html(""" | ||
<table> | ||
<thead> | ||
<tr> | ||
<!-- empty header row, was ignored --> | ||
<th></th><th></th><th></th> | ||
</tr> | ||
<tr> | ||
<th>A</th><th>B</th><th>C</th> | ||
</tr> | ||
</thead> | ||
<tbody> | ||
<tr> | ||
<td>1</td><td>2</td><td>3</td> | ||
</tr> | ||
</tbody> | ||
</table> | ||
""", header=2) | ||
Out [1]: | ||
[ A B C | ||
0 1 2 3] | ||
|
||
Previously, the workaround was to write ``header=0`` instead of ``header=1`` | ||
for this example table. Now, that workaround must be removed. This should not | ||
affect many users, since most HTML tables do not have empty header rows. | ||
|
||
- :class:`DatetimeIndex` now accepts :class:`Int64Index` arguments as epoch timestamps (:issue:`20997`) | ||
- | ||
- | ||
|
@@ -297,7 +412,7 @@ MultiIndex | |
I/O | ||
^^^ | ||
|
||
- | ||
- :func:`read_html()` no longer ignores all-whitespace ``<tr>`` within ``<thead>`` when considering the ``skiprows`` and ``header`` arguments. Previously, users had to decrease their ``header`` and ``skiprows`` values on such tables to work around the issue. (:issue:`21641`) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is all ok, but would a subsection with a mini-example be instructive to the user about the revised functinaility? not saying 100% need to, but if a simple enough example (or even an image) would be helpful here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm going to display my ignorance of the whatsnew format: ... where would this go? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh, you make a new subsection in the top somewhere, e.g. model after this
just add a new one below and you can make an extended example |
||
- | ||
- | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about read_html enhancements!