Skip to content

read_html: Handle colspan and rowspan #21487

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jul 5, 2018

Conversation

adamhooper
Copy link
Contributor

@adamhooper adamhooper commented Jun 14, 2018

This is essentially a rebased and squashed #17054 (mad props to @jowens
for doing all the hard thinking). My tweaks:

  • test_computer_sales_page (see TestReadHtml.test_computer_sales_page - what's it doing? #17074) no longer tests for ParserError,
    because the ParserError was a bug caused by missing colspan support.
    Now, test that MultiIndex works as expected.
  • I respectfully removed the fill_rowspan argument from Q: correct behavior for read_html with rowspan/colspan for DataFrames? #17073. Instead,
    the virtual cells created by rowspan/colspan are always copies of the
    real cells' text. This prevents _infer_columns() from naming virtual
    cells as "Unnamed: ..."
  • I removed a small layer of abstraction to respect Read from multiple <tbody> within a <table> #20891 (multiple
    tbody support), which was implemented after @jowens' pull request.
    Now _HtmlFrameParser has _parse_thead_trs, _parse_tbody_trs and
    _parse_tfoot_trs, each returning a list of trs. That let me remove
    _parse_tr, Making All The Tests Pass.
  • That caused a snowball effect. lxml does not fix malformed thead, as
    tested by spam.html. The previous hacky workaround was in
    _parse_raw_thead, but the new _parse_thead_trs signature returns nodes
    instead of text. The new hacky solution: return the thead itself,
    pretending it's a tr. This works in all the tests. A better solution
    is to use html5lib with lxml; but that might belong in a separate pull
    request.

@pep8speaks
Copy link

pep8speaks commented Jun 14, 2018

Hello @adamhooper! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on July 05, 2018 at 17:45 Hours UTC

@adamhooper
Copy link
Contributor Author

Er, I mean, a rebased and squashed #17089.

@jowens
Copy link

jowens commented Jun 14, 2018

You're a hero, @adamhooper.

while body_rows and all(self._equals_tag(t, 'th') for t in
self._parse_td(body_rows[0])):
# this row should be a header row, move it from body to header
header_rows.append(body_rows.pop(0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not so sure about this. What happens if there's no table header but the user explicitly provides a header argument?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great question, but I don't think this pull request affects that one way or another.

From my reading of _data_to_frame(**kwargs): _data_to_frame() creates one big table, head + body + foot. If the user does not supply a header kwarg, then the function sets header = lrange(len(head))

That makes this more-reliable header inference backwards-incompatible: after this pull request, users who didn't supply header on ugly HTML tables that had multi-row headers will get a MultiIndex instead of an Index. Those users will be negatively-affected, and they may want to add an explicit header argument to mimic previous behavior.

Let's put that in perspective. Correct handling of rowspan/colspan is also backwards-incompatible. It could break existing code for a giant fraction of read_html() users who have coded workarounds for headers pandas misplaced.

I suspect most tables in the wild with multi-row headers also use rowspan/colspan. (Otherwise, why would they have multiple rows?)

Header inference doesn't change pandas' behavior for either of its sample HTML files (spam.html and computer_sales_page.html); the rowspan/colspan fix makes pandas change behavior with both. That's a random sample, but it matches my suspicion: rowspan/colspan is going to change pandas' behavior to an enormous degree, eclipsing the change in header inference.

raw_data.extend(self._parse_tr(table))
if not header_rows:
# The table has no <thead>. Treat first all-<th> rows as headers.
while body_rows and all(self._equals_tag(t, 'th') for t in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be overlooking something but why are you using while here - are you not just testing the truthiness of body_rows? If so, if would be preferable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

body_rows changes length each iteration. I just made the conditional shorter to make that clearer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK that makes more sense. Is that a safe operation though? Can't say I've been personally bitten by this but SO seems to advise against modifying the container that you loop over:

https://stackoverflow.com/questions/1637807/modifying-list-while-iterating

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK that makes more sense. Is that a safe operation though? Can't say I've been personally bitten by this but SO seems to advise against modifying the container that you loop over:

https://stackoverflow.com/questions/1637807/modifying-list-while-iterating

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue that this loop doesn't iterate over the list. It just iterates over a condition.

while body_rows and all(self._equals_tag(t, 'th') for t in
self._parse_td(body_rows[-1])):
# this row should be a footer row, move it from body to footer
footer_rows.insert(0, body_rows.pop())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stylistically would prefer append, since that's also what you do above with header_rows

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment on why not to do that

for col in extracted_row]
# expand cols using col_colspans
# maybe this can be done with a list comprehension, dunno
cols = list(zip(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how to read this on initial glance, though it seems like you are conscious of that from the comment. If you can't think of a better way to do via code any way you can speak to what you are trying to accomplish?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote this logic as procedural.


def _parse_tr(self, table):
return table.xpath('.//tr')
# Look for direct descendents only: the "row" element here may be a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you looking for children or descendants? Conflict between comment and code here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the word I was looking for! d'oh, thanks :)

def _parse_thead(self, table):
return table.xpath('.//thead')
for thead in table.xpath('.//thead'):
rows.extend(thead.xpath('./tr'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work if the thead has multiple rows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. (Previous code returned a list of , treating it as a list of lists. New code effectively flattens the return value.)

@@ -359,7 +349,7 @@ def test_thousands_macau_stats(self):
attrs={'class': 'style1'})
df = dfs[all_non_nan_table_index]

assert not any(s.isna().any() for _, s in df.iteritems())
assert not any(s.isnull().any() for _, s in df.iteritems())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was this changed for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad merge. Reverted.

column_finder : callable
A callable that takes a row node as input and returns a list of the
column node in that row. This must be defined by subclasses.
"""Parse and return all tables from the DOM.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstrings should always start on the line after the triple quotes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed throughout the file.


Returns
-------
data : list of list of strings
tables : list of parsed (header, body, footer) tuples from tables
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically for Returns unless you return more than one object you don't need to provide name(s), just types on the first line and a description on the subsequent line (also add period at end of description)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed throughout the file

@@ -257,7 +237,7 @@ def _parse_td(self, obj):

Parameters
----------
obj : node-like
obj : an HTML row element
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be a type or a description? Not sure the exactly terminology but I feel like the type of node-like was preferable with this as the description (on subsequent line)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

@gfyoung gfyoung added Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap labels Jun 15, 2018
@codecov
Copy link

codecov bot commented Jun 15, 2018

Codecov Report

Merging #21487 into master will increase coverage by <.01%.
The diff coverage is 97.89%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21487      +/-   ##
==========================================
+ Coverage   91.92%   91.93%   +<.01%     
==========================================
  Files         158      158              
  Lines       49705    49716      +11     
==========================================
+ Hits        45693    45704      +11     
  Misses       4012     4012
Flag Coverage Δ
#multiple 90.3% <97.89%> (ø) ⬆️
#single 41.97% <18.94%> (-0.03%) ⬇️
Impacted Files Coverage Δ
pandas/io/html.py 89.17% <97.89%> (+0.35%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 62a0ebc...5fd863b. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good. can you add a whatsnew note (0.24.0). if you can do it in 1 line great (e.g. what a user would care about the changes, what can it do now that it couldn't before). also a sub-section would be ok too.

Header and body are lists-of-lists. Top level list is a list of
rows. Each row is a list of str text.

Logic: Use <thead>, <tbody>, <tfoot> elements to identify
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put this in Notes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -590,6 +697,12 @@ def _parse_tables(self, doc, match, kwargs):
.format(patt=pattern))
return tables

def _equals_tag(self, obj, tag):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't these the same as in the super class?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this too on initial glance but the attribute of obj used to determine equality differs across implementations. An alternate approach could be to make that attribute a property that the subclasses need to implement, though not sure that is that much cleaner

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My answer is either "no" or "I don't understand the question" :).

But I did notice and nix _contains_tag(), which did not belong. So thanks for focusing here :).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bizarre inheritance used in this module makes "node-like" values different across implementations. It isn't duck-typing: they're completely different beasts. That's what it was like before I started this pull request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok happy to take follow-ups to make this simpler / easier for implementation.

@adamhooper
Copy link
Contributor Author

@jreback Thanks for the review! There was already an entry in whatsnew, but I rephrased it such that users will better be able to predict how the change will affect their data. Do you think that's enough, or should I add a subsection?

tfoot : node-like
A <tfoot>...</tfoot> element.
boolean
Whether the object is equal to tag 'tag'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameter references are typically done in backticks on docstrings, so you can change "tag 'tag'" to simply "`tag`"

raw_data.extend(self._parse_tr(table))
def _expand_colspan_rowspan(self, rows):
"""
Given a list of <tr>s, return a list of text rows that copy cell
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically first line should fit on one sentence


Parameters
----------
rows : list of <tr>s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just say list after hyphen on the first line to denote type. I suppose you can add a comment on the subsequent line for a description

for td in tds:
# Append texts from previous rows with rowspan>1 that come
# before this <td>
while remainder and remainder[0][0] <= index:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment before about iteration. Can you help point me to where remainder first gets values populated for it? I'm still somewhat concerned about the number of mutations done to that variable.

I have the feeling some of this can be better and more safely expressed using enumerate but direction on where to begin with this would be helpful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remainder is initialized on line 447 and flipped with next_remainder on line 485.

It's valid to worry about an evil web page forcing lots of remainder mutations ... to a point. If I were generating a website to foil Pandas, here's what I'd do:

<table>
  <tr>
    <td colspan="2147483647">This will crash your puny parser.</td>
  </tr>
</table>

I don't expect this to spin forever because of mutations: I expect it to crash with out-of-memory as it grows an enormous list.

Real-world tables will be fine. Usually colspan/rowspan are only in header cells, and there are rarely many items here.

Big-O analysis: for a table with N output cells (N = WxH), each .pop(0) generates a cell. That means there are at most N list-removals, meaning at most N list-insertions. Each insertion is O(1) but each removal is O(W). (According to https://wiki.python.org/moin/TimeComplexity, list pop is O(k).) That removal is the slowest operation, so running time is O(NW).

Shall I switch to a collection.deque to make it O(N)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incidentally: I, too, started by writing enumerate(). But that fell apart. The index of a cell in the output depends on two inputs: both td's index in row and the number of items in remainder that pre-empt it.

An alternate strategy is be to make remainder a list of mostly-None, sometimes-tuple values and modify them in-place. That, too, would be O(N) ... but it wouldn't involve as many allocations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have to think through that in more detail. I'm less concerned about the choice of container; I just feel like there's a more readable and concise way to approach this that would be easier for future contributors / maintainers to use

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the iterations - getting closer

header_rows.append(body_rows.pop(0))

if not footer_rows:
# The table has no <tfoot>. Treat last all-<th> rows as footers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm what is the point in doing this? In the previous implementation it would just return an empty list for the footer right? Did that cause some kind of issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! It's cruft -- just nixed it. _data_to_frame simply does body += foot anyway.


if not footer_rows:
# The table has no <tfoot>. Treat last all-<th> rows as footers.
while body_rows and row_is_all_th(body_rows[-1]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this doing? I'm under the impression this would always be false on account of the row_is_all_th condition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Footer rows are often all-<th>: in a typical table, there may be an all-<th> "totals" row.

But this code cannot impact the user (because _data_to_frame simply does body += foot anyway), so I've nixed it.

for td in tds:
# Append texts from previous rows with rowspan>1 that come
# before this <td>
while remainder and remainder[0][0] <= index:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have to think through that in more detail. I'm less concerned about the choice of container; I just feel like there's a more readable and concise way to approach this that would be easier for future contributors / maintainers to use


# Append the text from this <td>, colspan times
text = _remove_whitespace(self._text_getter(td))
rowspan = int(self._attr_getter(td, 'rowspan') or 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be simplified in one of two ways:

  1. Simply use td.get('rowspan', 1) if td allows for that in both implementations
  2. Move the implementation of _attr_getter into the base class

For point 2 above it looks like the subclasses both have the same implementation so it would be easier to just move to the base class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay -- pulled up to the base class (though that seems strange to me)

</tr>
</tbody>
</table>''')
data2 = StringIO('''<table>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally can add spacing between variables to make the tests more readable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in lots and lots of tests

</tbody>
</table>''')
res1 = self.read_html(data1)
res2 = self.read_html(data2, header=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for header=0 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the comment right there -- it's a regression test for #5048

</table>''')
res1 = self.read_html(data1)
res2 = self.read_html(data2, header=0)
assert_framelist_equal(res1, res2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit but there's a pseudo-standard to use result and expected when evaluating tests. Would be preferable to say result1 and result2 here (and in other tests)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I've renamed variables (and fiddled with whitespace) in many, many tests.

</table>"""
expected = self.read_html(expected)[0]
res = self.read_html(out)[0]
tm.assert_frame_equal(expected, res)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically we use result as the first argument to this function and expected as the second. Would be nice to do that in these tests so failure output follows that pattern

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I've renamed variables (and fiddled with whitespace) in many, many tests.

@adamhooper
Copy link
Contributor Author

@jreback, @WillAyd I've implemented all the changes you requested. Could you please take a look?

@WillAyd I tested out an alternative algorithm, as you had concerns about the one in this pull request. The alternative is at adamhooper@67dea69. The diff is +53, -38. Maybe you find it more readable, though?

"""
Ensure parser adds <tr> within <thead> on malformed HTML.
"""
expected = self.read_html('''<table>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally the expected value should be constructed directly and not via read_html - can you update any of the tests you've added to do that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Looks much nicer now.

@@ -765,11 +1054,11 @@ def test_keep_default_na(self):
</table>"""

expected_df = DataFrame({'a': ['N/A', 'NA']})
html_df = read_html(html_data, keep_default_na=False)[0]
html_df = self.read_html(html_data, keep_default_na=False)[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the point of changing some of these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are all run twice: once with lxml and once with bs4. But read_html()'s logic is, "try lxml, and if it fails use bs4." So it overcomes the entire purpose of the unit tests -- if the lxml backend crashes while we're testing it, we won't know because read_html() will fall back to bs4.

From my reading of the code, self.read_html() is always correct, and read_html() is always a bug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks for the clarification. I suppose these are fine to leave as is though if there are more better to do in a separate change to keep that cleanup orthogonal to the original issue

self.read_html(data, header=[0, 1])
# This table is unique because it has no <thead>, and its <th>-only
# rows are underneath an initial <td>-only row that has no content.
# After #kipping the empty row, header=[0,1] picks the two <th>-only
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor typo of skipping

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

r"too many rows for this "
r"multi_index of columns"):
self.read_html(data, header=[0, 1])
# This table is unique because it has no <thead>, and its <th>-only
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not sure about this. Certainly a crazy example but this changes the behavior of the arguments passed to header now, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request doesn't change the behavior of the header argument. It changes the behavior one gets when the header argument is omitted. That change has no effect on this specific test.

When it comes to this specific unit test ("Test some old table in the wild"):

  • Prior to this pull request, pandas erroneously crashed on this table. The unit test was bad -- it tested for a bug.
  • This pull request includes a hack involving th-only rows. That hack would seem to infer the correct header=[0,1] for this specific table, but it doesn't because this specific table is weird

This unit test now tests that pandas works, instead of testing that it is broken. The comment explains why the header=[0,1] is needed -- since the HTML makes it look like it shouldn't be.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it broke previously for header=[0, 1] but worked for header=[1, 2]. Now it is working for header=[0,1] and that returns the same value that previously was returned with header=[1,2], right?

Not saying that is necessarily a bad thing but looking for confirmation as in the off-chance something like this does occur in the wild that is a pretty subtle change for users that may have been expecting the former behavior

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's not what changed. It broke before because it didn't support rowspan/colspan. The deletion of empty rows is existing behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

urgh, looking at the diff with master instead of my own. I see that header=[1,2] used to not crash. I'll investigate more to figure out exactly why that worked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, you're absolutely right -- the behavior changed.

Previously, pandas removed all-empty rows from the <thead> but not from the <tbody>.

Now, pandas removes all-empty rows from anywhere in the table.

So header changed on this table because its header rows are inside the <tbody>.

Should I revert that change? It was unintentional. But the original behavior appears to be unintentional to me, too -- why would <thead> and <tbody> behave differently?

Copy link
Contributor Author

@adamhooper adamhooper Jun 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

References:

From these discussions, I gather that pandas shouldn't auto-delete data, but auto-deleting in the <thead> is kinda okay because otherwise we'd get a ParserError by default.

I'll auto-delete only empty header rows. It seems to me _data_to_frame() is the correct place to do that. @WillAyd does that sound appropriate? Then I can revert the change to this unit test -- I was misunderstanding the error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit ed120d8. I added new tests for behavior implied in #15242.

Unfortunately, test_spam_header tested for a bug -- it specified header=1 instead of the intent, header=2. The test passed because of a bug in the previous implementation: indexing happened after deleting empty header rows.

Let's track the bugs through history to see why header=2 is correct:

  1. Original implementation, commit b31c033 (2013) -- uses header=0 to select <td> elements within <thead> that are not part of a <tr>. The bug that made this work: pandas erroneously treated <thead> as equivalent to <tr>, merging all <th> and <td> within the <thead> into one big row.
  2. commit a8723a4 (2013): changed the test to header=1 and changed the value it's looking for to be the value from the first <tr> in the <tbody>. Seems like a silly change. The author probably tested a body cell because lxml and bs4 began treating the (malformed) header differently as a result of some completely-unrelated change.
  3. commit 0ab0813 (ENH:read_html() handles tables with multiple header rows #13434 #15242, 2017): kept header=1 unchanged, because two new behaviors cancelled out. First, <thead> handling was corrected to look for <tr>s inside the <thead> (meaning one would need to set header=2). Second, _parse_raw_thead was incorrectly changed to delete empty rows (deleting one row from the header brings us back to header=1). That change made the unit test pass, but the unit test was testing the wrong behavior all along.
  4. Today, I fixed the bug introduced in 2017, meaning header=2 finds the cell that's been referenced since a8723a4 in 2013.

@WillAyd what do you think of this change? I think anybody would agree the previous behavior was a bug. And it seems silly to re-implement the bug in this pull request in the name of avoiding feature creep.

I imagine this applies to only a tiny number of tables in the wild.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. I do agree that header=2 appears to be more correct. Do you think there's an issue with other index-based keywords (like skiprows) in the current implementation?

OK to address in a separate issue but it would be nice to have mention of this in a whatsnew in some fashion (again separate issue is fine)

if any(col != '' for col in cols):
res.append(cols)
return res
# HACK: lxml does not clean up the clearly-erroneous
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment here mentions a situation we are missing tr elements but test_header_inferred_from_th_elements is constructed without a thead element instead - is that by design?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That test is unrelated to this hack. This hack is tested in test_thead_without_tr.


def _parse_tr(self, table):
return table.xpath('.//tr')
# Look for direct children only: the "row" element here may be a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually makes sense but is this comment necessary? We aren't enforcing that row be of a certain tag

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is nonsensical that _parse_td() be called with a thead element, because it is nonsensical that a thead element contain td or th children. The _parse_td comment says parameter is a DOM <tr> node.

It's an ugly hack. I think that nonsense deserves a comment.

@WillAyd
Copy link
Member

WillAyd commented Jun 26, 2018

Need to clean up merge conflicts and maybe add whatsnew reference for proper indexing of tables with NA data. Otherwise lgtm

@adamhooper
Copy link
Contributor Author

Urgh. I did a rebase on upstream/master and pushed with --force and now Appveyor is complaining. Was that a bad idea on my part? How should I fix?

@WillAyd
Copy link
Member

WillAyd commented Jun 26, 2018

I don't think the AppVeyor issue is a result of your rebase. That said, you'd be better off fetching master and merging it into your branch rather than rebase (maintains better comment history).

Looks like you have some file conflicts still anyway, so might was well resolve them locally and try merging from master into your branch this time before next push

This is essentially a rebased and squashed pandas-dev#17054 (mad props to @jowens
for doing all the hard thinking). My tweaks:

* test_computer_sales_page (see pandas-dev#17074) no longer tests for ParserError,
  because the ParserError was a bug caused by missing colspan support.
  Now, test that MultiIndex works as expected.
* I respectfully removed the fill_rowspan argument from pandas-dev#17073. Instead,
  the virtual cells created by rowspan/colspan are always copies of the
  real cells' text. This prevents _infer_columns() from naming virtual
  cells as "Unnamed: ..."
* I removed a small layer of abstraction to respect pandas-dev#20891 (multiple
  <tbody> support), which was implemented after @jowens' pull request.
  Now _HtmlFrameParser has _parse_thead_trs, _parse_tbody_trs and
  _parse_tfoot_trs, each returning a list of <tr>s. That let me remove
  _parse_tr, Making All The Tests Pass.
* That caused a snowball effect. lxml does not fix malformed <thead>, as
  tested by spam.html. The previous hacky workaround was in
  _parse_raw_thead, but the new _parse_thead_trs signature returns nodes
  instead of text. The new hacky solution: return the <thead> itself,
  pretending it's a <tr>. This works in all the tests. A better solution
  is to use html5lib with lxml; but that might belong in a separate pull
  request.
Mostly involved reformatting test_html.py
... but _ignore_ empty rows when inferring columns. This changes the
behavior of test_spam_header, which previously ignored an empty row when
the user explicitly stated the row number to use as header.
@adamhooper
Copy link
Contributor Author

Indeed -- I rebased over adamhooper/master instead of pandas/master. I've learned my lesson: merge, don't rebase.

Fixed by ... er ... rebasing. Correctly, this time.

@adamhooper
Copy link
Contributor Author

^^ I've noticed a few other pull requests cause ResourceWarning failures on Travis. Does that error relate to this pull request?

@WillAyd
Copy link
Member

WillAyd commented Jun 27, 2018

Unrelated - that intermittently seems to pop up on Travis

@WillAyd
Copy link
Member

WillAyd commented Jul 2, 2018

@adamhooper can you fix merge conflict?

@jreback thoughts on your end?

@adamhooper
Copy link
Contributor Author

Merged :)

@jreback jreback added this to the 0.24.0 milestone Jul 3, 2018
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some doc comments, otherwise lgtm.

@@ -297,7 +298,7 @@ MultiIndex
I/O
^^^

-
- :func:`read_html()` no longer ignores all-whitespace ``<tr>`` within ``<thead>`` when considering the ``skiprows`` and ``header`` arguments. Previously, users had to decrease their ``header`` and ``skiprows`` values on such tables to work around the issue. (:issue:`21641`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is all ok, but would a subsection with a mini-example be instructive to the user about the revised functinaility? not saying 100% need to, but if a simple enough example (or even an image) would be helpful here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to display my ignorance of the whatsnew format: ... where would this go?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, you make a new subsection in the top somewhere, e.g. model after this

Series and Index Data-Dtype Incompatibilities

just add a new one below and you can make an extended example

"""
Given a table, return parsed header, body, and foot.

Notes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put Notes at the end (after Parameters / Returns)


return self._parse_raw_data(raw_data)
Notes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same generally

@@ -590,6 +697,12 @@ def _parse_tables(self, doc, match, kwargs):
.format(patt=pattern))
return tables

def _equals_tag(self, obj, tag):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok happy to take follow-ups to make this simpler / easier for implementation.

@adamhooper
Copy link
Contributor Author

@jreback Thanks -- added whatsnew entry.

Excited to see: is this last commit the final one?


Calls that relied on the previous behavior will need to be changed.

Also, :func:`read_html` previously ignored some ``<tr>`` elements when called
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See what @jreback thinks but I don't consider this second half necessary as it addresses a minor bug. Otherwise lgtm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah you can just list as a 1-liner in the io section is ok, no example for this needed (the latter half)

@@ -168,6 +168,120 @@ Current Behavior:
...
OverflowError: Trying to coerce negative values to unsigned integers

read_html Incompatibilities
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about read_html enhancements!


Current Behavior:

.. code-block:: ipython
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can make this an ipython block (e.g. actually run it for current)


.. code-block:: ipython

In [1]: pd.read_html("""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to avoid repeating the code, make an ipython block above where you define the html itself). then the preivious uses a code-block to show the output, and the new uses an ipython block

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I get it now. (I actually read the docs to preview what it looks like.) Thanks -- looks like a super improvement.

[ A B C
0 1 2 2]

Calls that relied on the previous behavior will need to be changed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sentence is not needed


Calls that relied on the previous behavior will need to be changed.

Also, :func:`read_html` previously ignored some ``<tr>`` elements when called
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah you can just list as a 1-liner in the io section is ok, no example for this needed (the latter half)

@jreback jreback merged commit e72b7e5 into pandas-dev:master Jul 5, 2018
@jreback
Copy link
Contributor

jreback commented Jul 5, 2018

thanks @adamhooper very nice!

@adamhooper
Copy link
Contributor Author

Woo-hoo! Thank you all for your guidance. What a thrill :).

@clytemnestra
Copy link

clytemnestra commented Jul 16, 2018

Hello, when will this be included into a release? I see it's being included in v0.24, but when will that happen? I'm having this exact problem and would love to benefit from this fix. Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
7 participants