-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
read_html: Handle colspan and rowspan #21487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @adamhooper! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on July 05, 2018 at 17:45 Hours UTC |
Er, I mean, a rebased and squashed #17089. |
You're a hero, @adamhooper. |
pandas/io/html.py
Outdated
while body_rows and all(self._equals_tag(t, 'th') for t in | ||
self._parse_td(body_rows[0])): | ||
# this row should be a header row, move it from body to header | ||
header_rows.append(body_rows.pop(0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm not so sure about this. What happens if there's no table header but the user explicitly provides a header
argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great question, but I don't think this pull request affects that one way or another.
From my reading of _data_to_frame(**kwargs)
: _data_to_frame()
creates one big table, head + body + foot
. If the user does not supply a header
kwarg, then the function sets header = lrange(len(head))
That makes this more-reliable header inference backwards-incompatible: after this pull request, users who didn't supply header
on ugly HTML tables that had multi-row headers will get a MultiIndex instead of an Index. Those users will be negatively-affected, and they may want to add an explicit header
argument to mimic previous behavior.
Let's put that in perspective. Correct handling of rowspan/colspan is also backwards-incompatible. It could break existing code for a giant fraction of read_html()
users who have coded workarounds for headers pandas misplaced.
I suspect most tables in the wild with multi-row headers also use rowspan/colspan. (Otherwise, why would they have multiple rows?)
Header inference doesn't change pandas' behavior for either of its sample HTML files (spam.html and computer_sales_page.html); the rowspan/colspan fix makes pandas change behavior with both. That's a random sample, but it matches my suspicion: rowspan/colspan is going to change pandas' behavior to an enormous degree, eclipsing the change in header inference.
pandas/io/html.py
Outdated
raw_data.extend(self._parse_tr(table)) | ||
if not header_rows: | ||
# The table has no <thead>. Treat first all-<th> rows as headers. | ||
while body_rows and all(self._equals_tag(t, 'th') for t in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be overlooking something but why are you using while
here - are you not just testing the truthiness of body_rows
? If so, if
would be preferable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
body_rows
changes length each iteration. I just made the conditional shorter to make that clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK that makes more sense. Is that a safe operation though? Can't say I've been personally bitten by this but SO seems to advise against modifying the container that you loop over:
https://stackoverflow.com/questions/1637807/modifying-list-while-iterating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK that makes more sense. Is that a safe operation though? Can't say I've been personally bitten by this but SO seems to advise against modifying the container that you loop over:
https://stackoverflow.com/questions/1637807/modifying-list-while-iterating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd argue that this loop doesn't iterate over the list. It just iterates over a condition.
pandas/io/html.py
Outdated
while body_rows and all(self._equals_tag(t, 'th') for t in | ||
self._parse_td(body_rows[-1])): | ||
# this row should be a footer row, move it from body to footer | ||
footer_rows.insert(0, body_rows.pop()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stylistically would prefer append, since that's also what you do above with header_rows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment on why not to do that
pandas/io/html.py
Outdated
for col in extracted_row] | ||
# expand cols using col_colspans | ||
# maybe this can be done with a list comprehension, dunno | ||
cols = list(zip( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how to read this on initial glance, though it seems like you are conscious of that from the comment. If you can't think of a better way to do via code any way you can speak to what you are trying to accomplish?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rewrote this logic as procedural.
pandas/io/html.py
Outdated
|
||
def _parse_tr(self, table): | ||
return table.xpath('.//tr') | ||
# Look for direct descendents only: the "row" element here may be a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you looking for children or descendants? Conflict between comment and code here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the word I was looking for! d'oh, thanks :)
pandas/io/html.py
Outdated
def _parse_thead(self, table): | ||
return table.xpath('.//thead') | ||
for thead in table.xpath('.//thead'): | ||
rows.extend(thead.xpath('./tr')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work if the thead
has multiple rows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. (Previous code returned a list of , treating it as a list of lists. New code effectively flattens the return value.)
pandas/tests/io/test_html.py
Outdated
@@ -359,7 +349,7 @@ def test_thousands_macau_stats(self): | |||
attrs={'class': 'style1'}) | |||
df = dfs[all_non_nan_table_index] | |||
|
|||
assert not any(s.isna().any() for _, s in df.iteritems()) | |||
assert not any(s.isnull().any() for _, s in df.iteritems()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was this changed for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bad merge. Reverted.
pandas/io/html.py
Outdated
column_finder : callable | ||
A callable that takes a row node as input and returns a list of the | ||
column node in that row. This must be defined by subclasses. | ||
"""Parse and return all tables from the DOM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstrings should always start on the line after the triple quotes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed throughout the file.
pandas/io/html.py
Outdated
|
||
Returns | ||
------- | ||
data : list of list of strings | ||
tables : list of parsed (header, body, footer) tuples from tables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically for Returns unless you return more than one object you don't need to provide name(s), just types on the first line and a description on the subsequent line (also add period at end of description)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed throughout the file
pandas/io/html.py
Outdated
@@ -257,7 +237,7 @@ def _parse_td(self, obj): | |||
|
|||
Parameters | |||
---------- | |||
obj : node-like | |||
obj : an HTML row element |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to be a type or a description? Not sure the exactly terminology but I feel like the type of node-like
was preferable with this as the description (on subsequent line)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
Codecov Report
@@ Coverage Diff @@
## master #21487 +/- ##
==========================================
+ Coverage 91.92% 91.93% +<.01%
==========================================
Files 158 158
Lines 49705 49716 +11
==========================================
+ Hits 45693 45704 +11
Misses 4012 4012
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty good. can you add a whatsnew note (0.24.0). if you can do it in 1 line great (e.g. what a user would care about the changes, what can it do now that it couldn't before). also a sub-section would be ok too.
pandas/io/html.py
Outdated
Header and body are lists-of-lists. Top level list is a list of | ||
rows. Each row is a list of str text. | ||
|
||
Logic: Use <thead>, <tbody>, <tfoot> elements to identify |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put this in Notes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -590,6 +697,12 @@ def _parse_tables(self, doc, match, kwargs): | |||
.format(patt=pattern)) | |||
return tables | |||
|
|||
def _equals_tag(self, obj, tag): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aren't these the same as in the super class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this too on initial glance but the attribute of obj
used to determine equality differs across implementations. An alternate approach could be to make that attribute a property that the subclasses need to implement, though not sure that is that much cleaner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My answer is either "no" or "I don't understand the question" :).
But I did notice and nix _contains_tag()
, which did not belong. So thanks for focusing here :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bizarre inheritance used in this module makes "node-like" values different across implementations. It isn't duck-typing: they're completely different beasts. That's what it was like before I started this pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok happy to take follow-ups to make this simpler / easier for implementation.
@jreback Thanks for the review! There was already an entry in whatsnew, but I rephrased it such that users will better be able to predict how the change will affect their data. Do you think that's enough, or should I add a subsection? |
pandas/io/html.py
Outdated
tfoot : node-like | ||
A <tfoot>...</tfoot> element. | ||
boolean | ||
Whether the object is equal to tag 'tag' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parameter references are typically done in backticks on docstrings, so you can change "tag 'tag'" to simply "`tag`"
pandas/io/html.py
Outdated
raw_data.extend(self._parse_tr(table)) | ||
def _expand_colspan_rowspan(self, rows): | ||
""" | ||
Given a list of <tr>s, return a list of text rows that copy cell |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically first line should fit on one sentence
pandas/io/html.py
Outdated
|
||
Parameters | ||
---------- | ||
rows : list of <tr>s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just say list after hyphen on the first line to denote type. I suppose you can add a comment on the subsequent line for a description
for td in tds: | ||
# Append texts from previous rows with rowspan>1 that come | ||
# before this <td> | ||
while remainder and remainder[0][0] <= index: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment before about iteration. Can you help point me to where remainder
first gets values populated for it? I'm still somewhat concerned about the number of mutations done to that variable.
I have the feeling some of this can be better and more safely expressed using enumerate but direction on where to begin with this would be helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remainder
is initialized on line 447 and flipped with next_remainder
on line 485.
It's valid to worry about an evil web page forcing lots of remainder
mutations ... to a point. If I were generating a website to foil Pandas, here's what I'd do:
<table>
<tr>
<td colspan="2147483647">This will crash your puny parser.</td>
</tr>
</table>
I don't expect this to spin forever because of mutations: I expect it to crash with out-of-memory as it grows an enormous list.
Real-world tables will be fine. Usually colspan/rowspan are only in header cells, and there are rarely many items here.
Big-O analysis: for a table with N
output cells (N = WxH), each .pop(0)
generates a cell. That means there are at most N
list-removals, meaning at most N
list-insertions. Each insertion is O(1)
but each removal is O(W)
. (According to https://wiki.python.org/moin/TimeComplexity, list pop is O(k).) That removal is the slowest operation, so running time is O(NW)
.
Shall I switch to a collection.deque
to make it O(N)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incidentally: I, too, started by writing enumerate()
. But that fell apart. The index of a cell in the output depends on two inputs: both td
's index in row
and the number of items in remainder
that pre-empt it.
An alternate strategy is be to make remainder
a list of mostly-None
, sometimes-tuple values and modify them in-place. That, too, would be O(N)
... but it wouldn't involve as many allocations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have to think through that in more detail. I'm less concerned about the choice of container; I just feel like there's a more readable and concise way to approach this that would be easier for future contributors / maintainers to use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the iterations - getting closer
pandas/io/html.py
Outdated
header_rows.append(body_rows.pop(0)) | ||
|
||
if not footer_rows: | ||
# The table has no <tfoot>. Treat last all-<th> rows as footers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm what is the point in doing this? In the previous implementation it would just return an empty list for the footer right? Did that cause some kind of issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch! It's cruft -- just nixed it. _data_to_frame
simply does body += foot
anyway.
pandas/io/html.py
Outdated
|
||
if not footer_rows: | ||
# The table has no <tfoot>. Treat last all-<th> rows as footers. | ||
while body_rows and row_is_all_th(body_rows[-1]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this doing? I'm under the impression this would always be false on account of the row_is_all_th
condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Footer rows are often all-<th>
: in a typical table, there may be an all-<th>
"totals" row.
But this code cannot impact the user (because _data_to_frame
simply does body += foot
anyway), so I've nixed it.
for td in tds: | ||
# Append texts from previous rows with rowspan>1 that come | ||
# before this <td> | ||
while remainder and remainder[0][0] <= index: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have to think through that in more detail. I'm less concerned about the choice of container; I just feel like there's a more readable and concise way to approach this that would be easier for future contributors / maintainers to use
|
||
# Append the text from this <td>, colspan times | ||
text = _remove_whitespace(self._text_getter(td)) | ||
rowspan = int(self._attr_getter(td, 'rowspan') or 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be simplified in one of two ways:
- Simply use
td.get('rowspan', 1)
if td allows for that in both implementations - Move the implementation of
_attr_getter
into the base class
For point 2 above it looks like the subclasses both have the same implementation so it would be easier to just move to the base class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay -- pulled up to the base class (though that seems strange to me)
pandas/tests/io/test_html.py
Outdated
</tr> | ||
</tbody> | ||
</table>''') | ||
data2 = StringIO('''<table> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally can add spacing between variables to make the tests more readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in lots and lots of tests
pandas/tests/io/test_html.py
Outdated
</tbody> | ||
</table>''') | ||
res1 = self.read_html(data1) | ||
res2 = self.read_html(data2, header=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for header=0
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the comment right there -- it's a regression test for #5048
pandas/tests/io/test_html.py
Outdated
</table>''') | ||
res1 = self.read_html(data1) | ||
res2 = self.read_html(data2, header=0) | ||
assert_framelist_equal(res1, res2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit but there's a pseudo-standard to use result
and expected
when evaluating tests. Would be preferable to say result1
and result2
here (and in other tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I've renamed variables (and fiddled with whitespace) in many, many tests.
pandas/tests/io/test_html.py
Outdated
</table>""" | ||
expected = self.read_html(expected)[0] | ||
res = self.read_html(out)[0] | ||
tm.assert_frame_equal(expected, res) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically we use result
as the first argument to this function and expected
as the second. Would be nice to do that in these tests so failure output follows that pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I've renamed variables (and fiddled with whitespace) in many, many tests.
@jreback, @WillAyd I've implemented all the changes you requested. Could you please take a look? @WillAyd I tested out an alternative algorithm, as you had concerns about the one in this pull request. The alternative is at adamhooper@67dea69. The diff is +53, -38. Maybe you find it more readable, though? |
pandas/tests/io/test_html.py
Outdated
""" | ||
Ensure parser adds <tr> within <thead> on malformed HTML. | ||
""" | ||
expected = self.read_html('''<table> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally the expected
value should be constructed directly and not via read_html - can you update any of the tests you've added to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! Looks much nicer now.
@@ -765,11 +1054,11 @@ def test_keep_default_na(self): | |||
</table>""" | |||
|
|||
expected_df = DataFrame({'a': ['N/A', 'NA']}) | |||
html_df = read_html(html_data, keep_default_na=False)[0] | |||
html_df = self.read_html(html_data, keep_default_na=False)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the point of changing some of these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are all run twice: once with lxml
and once with bs4
. But read_html()
's logic is, "try lxml
, and if it fails use bs4
." So it overcomes the entire purpose of the unit tests -- if the lxml backend crashes while we're testing it, we won't know because read_html()
will fall back to bs4
.
From my reading of the code, self.read_html()
is always correct, and read_html()
is always a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK thanks for the clarification. I suppose these are fine to leave as is though if there are more better to do in a separate change to keep that cleanup orthogonal to the original issue
pandas/tests/io/test_html.py
Outdated
self.read_html(data, header=[0, 1]) | ||
# This table is unique because it has no <thead>, and its <th>-only | ||
# rows are underneath an initial <td>-only row that has no content. | ||
# After #kipping the empty row, header=[0,1] picks the two <th>-only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor typo of skipping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
pandas/tests/io/test_html.py
Outdated
r"too many rows for this " | ||
r"multi_index of columns"): | ||
self.read_html(data, header=[0, 1]) | ||
# This table is unique because it has no <thead>, and its <th>-only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm not sure about this. Certainly a crazy example but this changes the behavior of the arguments passed to header
now, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request doesn't change the behavior of the header
argument. It changes the behavior one gets when the header
argument is omitted. That change has no effect on this specific test.
When it comes to this specific unit test ("Test some old table in the wild"):
- Prior to this pull request, pandas erroneously crashed on this table. The unit test was bad -- it tested for a bug.
- This pull request includes a hack involving th-only rows. That hack would seem to infer the correct
header=[0,1]
for this specific table, but it doesn't because this specific table is weird
This unit test now tests that pandas works, instead of testing that it is broken. The comment explains why the header=[0,1]
is needed -- since the HTML makes it look like it shouldn't be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like it broke previously for header=[0, 1]
but worked for header=[1, 2]
. Now it is working for header=[0,1]
and that returns the same value that previously was returned with header=[1,2]
, right?
Not saying that is necessarily a bad thing but looking for confirmation as in the off-chance something like this does occur in the wild that is a pretty subtle change for users that may have been expecting the former behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's not what changed. It broke before because it didn't support rowspan/colspan. The deletion of empty rows is existing behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
urgh, looking at the diff with master instead of my own. I see that header=[1,2]
used to not crash. I'll investigate more to figure out exactly why that worked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, you're absolutely right -- the behavior changed.
Previously, pandas removed all-empty rows from the <thead>
but not from the <tbody>
.
Now, pandas removes all-empty rows from anywhere in the table.
So header
changed on this table because its header rows are inside the <tbody>
.
Should I revert that change? It was unintentional. But the original behavior appears to be unintentional to me, too -- why would <thead>
and <tbody>
behave differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
References:
- read_html() doesn't handle tables with multiple header rows #13434 suggests read_html() should infer multiple table rows
- ENH:read_html() handles tables with multiple header rows #13434 #15242 implements by deleting all-empty rows from
<thead>
From these discussions, I gather that pandas shouldn't auto-delete data, but auto-deleting in the <thead>
is kinda okay because otherwise we'd get a ParserError by default.
I'll auto-delete only empty header rows. It seems to me _data_to_frame()
is the correct place to do that. @WillAyd does that sound appropriate? Then I can revert the change to this unit test -- I was misunderstanding the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in commit ed120d8. I added new tests for behavior implied in #15242.
Unfortunately, test_spam_header
tested for a bug -- it specified header=1
instead of the intent, header=2
. The test passed because of a bug in the previous implementation: indexing happened after deleting empty header rows.
Let's track the bugs through history to see why header=2
is correct:
- Original implementation, commit b31c033 (2013) -- uses
header=0
to select<td>
elements within<thead>
that are not part of a<tr>
. The bug that made this work: pandas erroneously treated<thead>
as equivalent to<tr>
, merging all<th>
and<td>
within the<thead>
into one big row. - commit a8723a4 (2013): changed the test to
header=1
and changed the value it's looking for to be the value from the first<tr>
in the<tbody>
. Seems like a silly change. The author probably tested a body cell because lxml and bs4 began treating the (malformed) header differently as a result of some completely-unrelated change. - commit 0ab0813 (ENH:read_html() handles tables with multiple header rows #13434 #15242, 2017): kept
header=1
unchanged, because two new behaviors cancelled out. First,<thead>
handling was corrected to look for<tr>
s inside the<thead>
(meaning one would need to setheader=2
). Second,_parse_raw_thead
was incorrectly changed to delete empty rows (deleting one row from the header brings us back toheader=1
). That change made the unit test pass, but the unit test was testing the wrong behavior all along. - Today, I fixed the bug introduced in 2017, meaning
header=2
finds the cell that's been referenced since a8723a4 in 2013.
@WillAyd what do you think of this change? I think anybody would agree the previous behavior was a bug. And it seems silly to re-implement the bug in this pull request in the name of avoiding feature creep.
I imagine this applies to only a tiny number of tables in the wild.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification. I do agree that header=2
appears to be more correct. Do you think there's an issue with other index-based keywords (like skiprows
) in the current implementation?
OK to address in a separate issue but it would be nice to have mention of this in a whatsnew in some fashion (again separate issue is fine)
pandas/io/html.py
Outdated
if any(col != '' for col in cols): | ||
res.append(cols) | ||
return res | ||
# HACK: lxml does not clean up the clearly-erroneous |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment here mentions a situation we are missing tr
elements but test_header_inferred_from_th_elements
is constructed without a thead
element instead - is that by design?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That test is unrelated to this hack. This hack is tested in test_thead_without_tr
.
|
||
def _parse_tr(self, table): | ||
return table.xpath('.//tr') | ||
# Look for direct children only: the "row" element here may be a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conceptually makes sense but is this comment necessary? We aren't enforcing that row be of a certain tag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is nonsensical that _parse_td()
be called with a thead
element, because it is nonsensical that a thead
element contain td
or th
children. The _parse_td
comment says parameter is a DOM <tr>
node.
It's an ugly hack. I think that nonsense deserves a comment.
Need to clean up merge conflicts and maybe add whatsnew reference for proper indexing of tables with NA data. Otherwise lgtm |
ed120d8
to
1773983
Compare
Urgh. I did a rebase on upstream/master and pushed with |
I don't think the AppVeyor issue is a result of your rebase. That said, you'd be better off fetching master and merging it into your branch rather than rebase (maintains better comment history). Looks like you have some file conflicts still anyway, so might was well resolve them locally and try merging from master into your branch this time before next push |
This is essentially a rebased and squashed pandas-dev#17054 (mad props to @jowens for doing all the hard thinking). My tweaks: * test_computer_sales_page (see pandas-dev#17074) no longer tests for ParserError, because the ParserError was a bug caused by missing colspan support. Now, test that MultiIndex works as expected. * I respectfully removed the fill_rowspan argument from pandas-dev#17073. Instead, the virtual cells created by rowspan/colspan are always copies of the real cells' text. This prevents _infer_columns() from naming virtual cells as "Unnamed: ..." * I removed a small layer of abstraction to respect pandas-dev#20891 (multiple <tbody> support), which was implemented after @jowens' pull request. Now _HtmlFrameParser has _parse_thead_trs, _parse_tbody_trs and _parse_tfoot_trs, each returning a list of <tr>s. That let me remove _parse_tr, Making All The Tests Pass. * That caused a snowball effect. lxml does not fix malformed <thead>, as tested by spam.html. The previous hacky workaround was in _parse_raw_thead, but the new _parse_thead_trs signature returns nodes instead of text. The new hacky solution: return the <thead> itself, pretending it's a <tr>. This works in all the tests. A better solution is to use html5lib with lxml; but that might belong in a separate pull request.
Mostly involved reformatting test_html.py
... but _ignore_ empty rows when inferring columns. This changes the behavior of test_spam_header, which previously ignored an empty row when the user explicitly stated the row number to use as header.
1773983
to
6fa0489
Compare
Indeed -- I rebased over adamhooper/master instead of pandas/master. I've learned my lesson: merge, don't rebase. Fixed by ... er ... rebasing. Correctly, this time. |
^^ I've noticed a few other pull requests cause ResourceWarning failures on Travis. Does that error relate to this pull request? |
Unrelated - that intermittently seems to pop up on Travis |
@adamhooper can you fix merge conflict? @jreback thoughts on your end? |
Merged :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just some doc comments, otherwise lgtm.
@@ -297,7 +298,7 @@ MultiIndex | |||
I/O | |||
^^^ | |||
|
|||
- | |||
- :func:`read_html()` no longer ignores all-whitespace ``<tr>`` within ``<thead>`` when considering the ``skiprows`` and ``header`` arguments. Previously, users had to decrease their ``header`` and ``skiprows`` values on such tables to work around the issue. (:issue:`21641`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is all ok, but would a subsection with a mini-example be instructive to the user about the revised functinaility? not saying 100% need to, but if a simple enough example (or even an image) would be helpful here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to display my ignorance of the whatsnew format: ... where would this go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, you make a new subsection in the top somewhere, e.g. model after this
Series and Index Data-Dtype Incompatibilities
just add a new one below and you can make an extended example
pandas/io/html.py
Outdated
""" | ||
Given a table, return parsed header, body, and foot. | ||
|
||
Notes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put Notes at the end (after Parameters / Returns)
pandas/io/html.py
Outdated
|
||
return self._parse_raw_data(raw_data) | ||
Notes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same generally
@@ -590,6 +697,12 @@ def _parse_tables(self, doc, match, kwargs): | |||
.format(patt=pattern)) | |||
return tables | |||
|
|||
def _equals_tag(self, obj, tag): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok happy to take follow-ups to make this simpler / easier for implementation.
@jreback Thanks -- added whatsnew entry. Excited to see: is this last commit the final one? |
doc/source/whatsnew/v0.24.0.txt
Outdated
|
||
Calls that relied on the previous behavior will need to be changed. | ||
|
||
Also, :func:`read_html` previously ignored some ``<tr>`` elements when called |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See what @jreback thinks but I don't consider this second half necessary as it addresses a minor bug. Otherwise lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah you can just list as a 1-liner in the io section is ok, no example for this needed (the latter half)
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -168,6 +168,120 @@ Current Behavior: | |||
... | |||
OverflowError: Trying to coerce negative values to unsigned integers | |||
|
|||
read_html Incompatibilities |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about read_html enhancements!
doc/source/whatsnew/v0.24.0.txt
Outdated
|
||
Current Behavior: | ||
|
||
.. code-block:: ipython |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can make this an ipython block (e.g. actually run it for current)
doc/source/whatsnew/v0.24.0.txt
Outdated
|
||
.. code-block:: ipython | ||
|
||
In [1]: pd.read_html(""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to avoid repeating the code, make an ipython block above where you define the html itself). then the preivious uses a code-block to show the output, and the new uses an ipython block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I get it now. (I actually read the docs to preview what it looks like.) Thanks -- looks like a super improvement.
doc/source/whatsnew/v0.24.0.txt
Outdated
[ A B C | ||
0 1 2 2] | ||
|
||
Calls that relied on the previous behavior will need to be changed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this sentence is not needed
doc/source/whatsnew/v0.24.0.txt
Outdated
|
||
Calls that relied on the previous behavior will need to be changed. | ||
|
||
Also, :func:`read_html` previously ignored some ``<tr>`` elements when called |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah you can just list as a 1-liner in the io section is ok, no example for this needed (the latter half)
thanks @adamhooper very nice! |
Woo-hoo! Thank you all for your guidance. What a thrill :). |
Hello, when will this be included into a release? I see it's being included in v0.24, but when will that happen? I'm having this exact problem and would love to benefit from this fix. Cheers! |
This is essentially a rebased and squashed #17054 (mad props to @jowens
for doing all the hard thinking). My tweaks:
because the ParserError was a bug caused by missing colspan support.
Now, test that MultiIndex works as expected.
the virtual cells created by rowspan/colspan are always copies of the
real cells' text. This prevents _infer_columns() from naming virtual
cells as "Unnamed: ..."
tbody support), which was implemented after @jowens' pull request.
Now _HtmlFrameParser has _parse_thead_trs, _parse_tbody_trs and
_parse_tfoot_trs, each returning a list of trs. That let me remove
_parse_tr, Making All The Tests Pass.
tested by spam.html. The previous hacky workaround was in
_parse_raw_thead, but the new _parse_thead_trs signature returns nodes
instead of text. The new hacky solution: return the thead itself,
pretending it's a tr. This works in all the tests. A better solution
is to use html5lib with lxml; but that might belong in a separate pull
request.
header
argument inread_html()
ignores empty trs, only within thead #21641git diff upstream/master -u -- "*.py" | flake8 --diff