read_html: Handle colspan and rowspan #21487

adamhooper · 2018-06-14T20:38:24Z

This is essentially a rebased and squashed #17054 (mad props to @jowens
for doing all the hard thinking). My tweaks:

test_computer_sales_page (see TestReadHtml.test_computer_sales_page - what's it doing? #17074) no longer tests for ParserError,
because the ParserError was a bug caused by missing colspan support.
Now, test that MultiIndex works as expected.
I respectfully removed the fill_rowspan argument from Q: correct behavior for read_html with rowspan/colspan for DataFrames? #17073. Instead,
the virtual cells created by rowspan/colspan are always copies of the
real cells' text. This prevents _infer_columns() from naming virtual
cells as "Unnamed: ..."
I removed a small layer of abstraction to respect Read from multiple <tbody> within a <table> #20891 (multiple
tbody support), which was implemented after @jowens' pull request.
Now _HtmlFrameParser has _parse_thead_trs, _parse_tbody_trs and
_parse_tfoot_trs, each returning a list of trs. That let me remove
_parse_tr, Making All The Tests Pass.
That caused a snowball effect. lxml does not fix malformed thead, as
tested by spam.html. The previous hacky workaround was in
_parse_raw_thead, but the new _parse_thead_trs signature returns nodes
instead of text. The new hacky solution: return the thead itself,
pretending it's a tr. This works in all the tests. A better solution
is to use html5lib with lxml; but that might belong in a separate pull
request.

closes ENH: read_html to handle rowspan, colspan #17054
closes header argument in read_html() ignores empty trs, only within thead #21641
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

pep8speaks · 2018-06-14T20:38:27Z

Hello @adamhooper! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on July 05, 2018 at 17:45 Hours UTC

adamhooper · 2018-06-14T20:43:47Z

Er, I mean, a rebased and squashed #17089.

jowens · 2018-06-14T20:46:50Z

You're a hero, @adamhooper.

WillAyd · 2018-06-14T22:15:15Z

pandas/io/html.py

+            while body_rows and all(self._equals_tag(t, 'th') for t in
+                                    self._parse_td(body_rows[0])):
+                # this row should be a header row, move it from body to header
+                header_rows.append(body_rows.pop(0))


Hmm not so sure about this. What happens if there's no table header but the user explicitly provides a header argument?

That's a great question, but I don't think this pull request affects that one way or another.

From my reading of _data_to_frame(**kwargs): _data_to_frame() creates one big table, head + body + foot. If the user does not supply a header kwarg, then the function sets header = lrange(len(head))

That makes this more-reliable header inference backwards-incompatible: after this pull request, users who didn't supply header on ugly HTML tables that had multi-row headers will get a MultiIndex instead of an Index. Those users will be negatively-affected, and they may want to add an explicit header argument to mimic previous behavior.

Let's put that in perspective. Correct handling of rowspan/colspan is also backwards-incompatible. It could break existing code for a giant fraction of read_html() users who have coded workarounds for headers pandas misplaced.

I suspect most tables in the wild with multi-row headers also use rowspan/colspan. (Otherwise, why would they have multiple rows?)

Header inference doesn't change pandas' behavior for either of its sample HTML files (spam.html and computer_sales_page.html); the rowspan/colspan fix makes pandas change behavior with both. That's a random sample, but it matches my suspicion: rowspan/colspan is going to change pandas' behavior to an enormous degree, eclipsing the change in header inference.

WillAyd · 2018-06-14T22:17:33Z

pandas/io/html.py

-            raw_data.extend(self._parse_tr(table))
+        if not header_rows:
+            # The table has no <thead>. Treat first all-<th> rows as headers.
+            while body_rows and all(self._equals_tag(t, 'th') for t in


Could be overlooking something but why are you using while here - are you not just testing the truthiness of body_rows? If so, if would be preferable

body_rows changes length each iteration. I just made the conditional shorter to make that clearer.

Ah OK that makes more sense. Is that a safe operation though? Can't say I've been personally bitten by this but SO seems to advise against modifying the container that you loop over:

https://stackoverflow.com/questions/1637807/modifying-list-while-iterating

I'd argue that this loop doesn't iterate over the list. It just iterates over a condition.

WillAyd · 2018-06-14T22:18:10Z

pandas/io/html.py

+            while body_rows and all(self._equals_tag(t, 'th') for t in
+                                    self._parse_td(body_rows[-1])):
+                # this row should be a footer row, move it from body to footer
+                footer_rows.insert(0, body_rows.pop())


Stylistically would prefer append, since that's also what you do above with header_rows

Added a comment on why not to do that

WillAyd · 2018-06-14T22:20:06Z

pandas/io/html.py

+                            for col in extracted_row]
+            # expand cols using col_colspans
+            # maybe this can be done with a list comprehension, dunno
+            cols = list(zip(


Not sure how to read this on initial glance, though it seems like you are conscious of that from the comment. If you can't think of a better way to do via code any way you can speak to what you are trying to accomplish?

I rewrote this logic as procedural.

WillAyd · 2018-06-14T22:23:33Z

pandas/io/html.py

-
-    def _parse_tr(self, table):
-        return table.xpath('.//tr')
+        # Look for direct descendents only: the "row" element here may be a


Are you looking for children or descendants? Conflict between comment and code here?

That's the word I was looking for! d'oh, thanks :)

WillAyd · 2018-06-14T22:25:42Z

pandas/io/html.py

-    def _parse_thead(self, table):
-        return table.xpath('.//thead')
+        for thead in table.xpath('.//thead'):
+            rows.extend(thead.xpath('./tr'))


Does this work if the thead has multiple rows?

yes. (Previous code returned a list of , treating it as a list of lists. New code effectively flattens the return value.)

WillAyd · 2018-06-14T22:26:12Z

pandas/tests/io/test_html.py

@@ -359,7 +349,7 @@ def test_thousands_macau_stats(self):
                             attrs={'class': 'style1'})
        df = dfs[all_non_nan_table_index]

-        assert not any(s.isna().any() for _, s in df.iteritems())
+        assert not any(s.isnull().any() for _, s in df.iteritems())


What was this changed for?

Bad merge. Reverted.

WillAyd · 2018-06-14T22:28:07Z

pandas/io/html.py

-        column_finder : callable
-            A callable that takes a row node as input and returns a list of the
-            column node in that row. This must be defined by subclasses.
+        """Parse and return all tables from the DOM.


Docstrings should always start on the line after the triple quotes

Changed throughout the file.

WillAyd · 2018-06-14T22:28:50Z

pandas/io/html.py


        Returns
        -------
-        data : list of list of strings
+        tables : list of parsed (header, body, footer) tuples from tables


Typically for Returns unless you return more than one object you don't need to provide name(s), just types on the first line and a description on the subsequent line (also add period at end of description)

Changed throughout the file

WillAyd · 2018-06-14T22:29:41Z

pandas/io/html.py

@@ -257,7 +237,7 @@ def _parse_td(self, obj):

        Parameters
        ----------
-        obj : node-like
+        obj : an HTML row element


Is this supposed to be a type or a description? Not sure the exactly terminology but I feel like the type of node-like was preferable with this as the description (on subsequent line)

codecov · 2018-06-15T15:31:18Z

Codecov Report

Merging #21487 into master will increase coverage by <.01%.
The diff coverage is 97.89%.

@@            Coverage Diff             @@
##           master   #21487      +/-   ##
==========================================
+ Coverage   91.92%   91.93%   +<.01%     
==========================================
  Files         158      158              
  Lines       49705    49716      +11     
==========================================
+ Hits        45693    45704      +11     
  Misses       4012     4012

Flag	Coverage Δ
#multiple	`90.3% <97.89%> (ø)`	⬆️
#single	`41.97% <18.94%> (-0.03%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/html.py	`89.17% <97.89%> (+0.35%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 62a0ebc...5fd863b. Read the comment docs.

jreback

looks pretty good. can you add a whatsnew note (0.24.0). if you can do it in 1 line great (e.g. what a user would care about the changes, what can it do now that it couldn't before). also a sub-section would be ok too.

jreback · 2018-06-15T16:34:43Z

pandas/io/html.py

+        Header and body are lists-of-lists. Top level list is a list of
+        rows. Each row is a list of str text.
+
+        Logic: Use <thead>, <tbody>, <tfoot> elements to identify


can you put this in Notes

jreback · 2018-06-15T16:36:21Z

pandas/io/html.py

@@ -590,6 +697,12 @@ def _parse_tables(self, doc, match, kwargs):
                             .format(patt=pattern))
        return tables

+    def _equals_tag(self, obj, tag):


aren't these the same as in the super class?

I thought this too on initial glance but the attribute of obj used to determine equality differs across implementations. An alternate approach could be to make that attribute a property that the subclasses need to implement, though not sure that is that much cleaner

My answer is either "no" or "I don't understand the question" :).

But I did notice and nix _contains_tag(), which did not belong. So thanks for focusing here :).

The bizarre inheritance used in this module makes "node-like" values different across implementations. It isn't duck-typing: they're completely different beasts. That's what it was like before I started this pull request.

ok happy to take follow-ups to make this simpler / easier for implementation.

adamhooper · 2018-06-15T16:48:40Z

@jreback Thanks for the review! There was already an entry in whatsnew, but I rephrased it such that users will better be able to predict how the change will affect their data. Do you think that's enough, or should I add a subsection?

WillAyd · 2018-06-15T16:54:16Z

pandas/io/html.py

-        tfoot : node-like
-            A <tfoot>...</tfoot> element.
+        boolean
+            Whether the object is equal to tag 'tag'


Parameter references are typically done in backticks on docstrings, so you can change "tag 'tag'" to simply "`tag`"

WillAyd · 2018-06-15T16:54:57Z

pandas/io/html.py

-            raw_data.extend(self._parse_tr(table))
+    def _expand_colspan_rowspan(self, rows):
+        """
+        Given a list of <tr>s, return a list of text rows that copy cell


Technically first line should fit on one sentence

WillAyd · 2018-06-15T17:05:53Z

pandas/io/html.py

+
+        Parameters
+        ----------
+        rows : list of <tr>s


Just say list after hyphen on the first line to denote type. I suppose you can add a comment on the subsequent line for a description

WillAyd · 2018-06-15T17:36:22Z

pandas/io/html.py

+            for td in tds:
+                # Append texts from previous rows with rowspan>1 that come
+                # before this <td>
+                while remainder and remainder[0][0] <= index:


Thanks for the comment before about iteration. Can you help point me to where remainder first gets values populated for it? I'm still somewhat concerned about the number of mutations done to that variable.

I have the feeling some of this can be better and more safely expressed using enumerate but direction on where to begin with this would be helpful

remainder is initialized on line 447 and flipped with next_remainder on line 485.

It's valid to worry about an evil web page forcing lots of remainder mutations ... to a point. If I were generating a website to foil Pandas, here's what I'd do:

<table> <tr> <td colspan="2147483647">This will crash your puny parser.</td> </tr> </table>

I don't expect this to spin forever because of mutations: I expect it to crash with out-of-memory as it grows an enormous list.

Real-world tables will be fine. Usually colspan/rowspan are only in header cells, and there are rarely many items here.

Big-O analysis: for a table with N output cells (N = WxH), each .pop(0) generates a cell. That means there are at most N list-removals, meaning at most N list-insertions. Each insertion is O(1) but each removal is O(W). (According to https://wiki.python.org/moin/TimeComplexity, list pop is O(k).) That removal is the slowest operation, so running time is O(NW).

Shall I switch to a collection.deque to make it O(N)?

Incidentally: I, too, started by writing enumerate(). But that fell apart. The index of a cell in the output depends on two inputs: both td's index in row and the number of items in remainder that pre-empt it.

An alternate strategy is be to make remainder a list of mostly-None, sometimes-tuple values and modify them in-place. That, too, would be O(N) ... but it wouldn't involve as many allocations.

Have to think through that in more detail. I'm less concerned about the choice of container; I just feel like there's a more readable and concise way to approach this that would be easier for future contributors / maintainers to use

WillAyd

Thanks for the iterations - getting closer

WillAyd · 2018-06-15T21:24:05Z

pandas/io/html.py

+                header_rows.append(body_rows.pop(0))
+
+        if not footer_rows:
+            # The table has no <tfoot>. Treat last all-<th> rows as footers.


Hmm what is the point in doing this? In the previous implementation it would just return an empty list for the footer right? Did that cause some kind of issue?

Great catch! It's cruft -- just nixed it. _data_to_frame simply does body += foot anyway.

WillAyd · 2018-06-15T21:24:20Z

pandas/io/html.py

+
+        if not footer_rows:
+            # The table has no <tfoot>. Treat last all-<th> rows as footers.
+            while body_rows and row_is_all_th(body_rows[-1]):


What is this doing? I'm under the impression this would always be false on account of the row_is_all_th condition?

Footer rows are often all-<th>: in a typical table, there may be an all-<th> "totals" row.

But this code cannot impact the user (because _data_to_frame simply does body += foot anyway), so I've nixed it.

WillAyd · 2018-06-15T21:26:43Z

pandas/io/html.py

+            for td in tds:
+                # Append texts from previous rows with rowspan>1 that come
+                # before this <td>
+                while remainder and remainder[0][0] <= index:


Have to think through that in more detail. I'm less concerned about the choice of container; I just feel like there's a more readable and concise way to approach this that would be easier for future contributors / maintainers to use

WillAyd · 2018-06-15T21:28:48Z

pandas/io/html.py

+
+                # Append the text from this <td>, colspan times
+                text = _remove_whitespace(self._text_getter(td))
+                rowspan = int(self._attr_getter(td, 'rowspan') or 1)


I think this can be simplified in one of two ways:

Simply use td.get('rowspan', 1) if td allows for that in both implementations

Move the implementation of _attr_getter into the base class

For point 2 above it looks like the subclasses both have the same implementation so it would be easier to just move to the base class

Okay -- pulled up to the base class (though that seems strange to me)

WillAyd · 2018-06-15T21:30:53Z

pandas/tests/io/test_html.py

+                </tr>
+            </tbody>
+        </table>''')
+        data2 = StringIO('''<table>


Generally can add spacing between variables to make the tests more readable

done in lots and lots of tests

WillAyd · 2018-06-15T21:31:09Z

pandas/tests/io/test_html.py

+            </tbody>
+        </table>''')
+        res1 = self.read_html(data1)
+        res2 = self.read_html(data2, header=0)


What's the reason for header=0 here?

Moved the comment right there -- it's a regression test for #5048

WillAyd · 2018-06-15T21:31:49Z

pandas/tests/io/test_html.py

+        </table>''')
+        res1 = self.read_html(data1)
+        res2 = self.read_html(data2, header=0)
+        assert_framelist_equal(res1, res2)


Minor nit but there's a pseudo-standard to use result and expected when evaluating tests. Would be preferable to say result1 and result2 here (and in other tests)

Okay, I've renamed variables (and fiddled with whitespace) in many, many tests.

WillAyd · 2018-06-15T21:32:57Z

pandas/tests/io/test_html.py

+                 </table>"""
+        expected = self.read_html(expected)[0]
+        res = self.read_html(out)[0]
+        tm.assert_frame_equal(expected, res)


Typically we use result as the first argument to this function and expected as the second. Would be nice to do that in these tests so failure output follows that pattern

Okay, I've renamed variables (and fiddled with whitespace) in many, many tests.

adamhooper · 2018-06-19T14:34:00Z

@jreback, @WillAyd I've implemented all the changes you requested. Could you please take a look?

@WillAyd I tested out an alternative algorithm, as you had concerns about the one in this pull request. The alternative is at adamhooper@67dea69. The diff is +53, -38. Maybe you find it more readable, though?

WillAyd · 2018-06-21T03:22:40Z

pandas/tests/io/test_html.py

+        """
+        Ensure parser adds <tr> within <thead> on malformed HTML.
+        """
+        expected = self.read_html('''<table>


Generally the expected value should be constructed directly and not via read_html - can you update any of the tests you've added to do that?

Done! Looks much nicer now.

WillAyd · 2018-06-21T03:22:56Z

pandas/tests/io/test_html.py

@@ -765,11 +1054,11 @@ def test_keep_default_na(self):
                    </table>"""

        expected_df = DataFrame({'a': ['N/A', 'NA']})
-        html_df = read_html(html_data, keep_default_na=False)[0]
+        html_df = self.read_html(html_data, keep_default_na=False)[0]


What was the point of changing some of these?

These tests are all run twice: once with lxml and once with bs4. But read_html()'s logic is, "try lxml, and if it fails use bs4." So it overcomes the entire purpose of the unit tests -- if the lxml backend crashes while we're testing it, we won't know because read_html() will fall back to bs4.

From my reading of the code, self.read_html() is always correct, and read_html() is always a bug.

OK thanks for the clarification. I suppose these are fine to leave as is though if there are more better to do in a separate change to keep that cleanup orthogonal to the original issue

WillAyd · 2018-06-21T03:23:26Z

pandas/tests/io/test_html.py

-            self.read_html(data, header=[0, 1])
+        # This table is unique because it has no <thead>, and its <th>-only
+        # rows are underneath an initial <td>-only row that has no content.
+        # After #kipping the empty row, header=[0,1] picks the two <th>-only


minor typo of skipping

WillAyd · 2018-06-21T03:24:59Z

pandas/tests/io/test_html.py

-                                    r"too many rows for this "
-                                    r"multi_index of columns"):
-            self.read_html(data, header=[0, 1])
+        # This table is unique because it has no <thead>, and its <th>-only


Hmm not sure about this. Certainly a crazy example but this changes the behavior of the arguments passed to header now, no?

This pull request doesn't change the behavior of the header argument. It changes the behavior one gets when the header argument is omitted. That change has no effect on this specific test.

When it comes to this specific unit test ("Test some old table in the wild"):

Prior to this pull request, pandas erroneously crashed on this table. The unit test was bad -- it tested for a bug.

This pull request includes a hack involving th-only rows. That hack would seem to infer the correct header=[0,1] for this specific table, but it doesn't because this specific table is weird

This unit test now tests that pandas works, instead of testing that it is broken. The comment explains why the header=[0,1] is needed -- since the HTML makes it look like it shouldn't be.

It looks like it broke previously for header=[0, 1] but worked for header=[1, 2]. Now it is working for header=[0,1] and that returns the same value that previously was returned with header=[1,2], right?

Not saying that is necessarily a bad thing but looking for confirmation as in the off-chance something like this does occur in the wild that is a pretty subtle change for users that may have been expecting the former behavior

No, that's not what changed. It broke before because it didn't support rowspan/colspan. The deletion of empty rows is existing behavior.

urgh, looking at the diff with master instead of my own. I see that header=[1,2] used to not crash. I'll investigate more to figure out exactly why that worked.

Okay, you're absolutely right -- the behavior changed.

Previously, pandas removed all-empty rows from the <thead> but not from the <tbody>.

Now, pandas removes all-empty rows from anywhere in the table.

So header changed on this table because its header rows are inside the <tbody>.

Should I revert that change? It was unintentional. But the original behavior appears to be unintentional to me, too -- why would <thead> and <tbody> behave differently?

References:

read_html() doesn't handle tables with multiple header rows #13434 suggests read_html() should infer multiple table rows

ENH:read_html() handles tables with multiple header rows #13434 #15242 implements by deleting all-empty rows from <thead>

From these discussions, I gather that pandas shouldn't auto-delete data, but auto-deleting in the <thead> is kinda okay because otherwise we'd get a ParserError by default.

I'll auto-delete only empty header rows. It seems to me _data_to_frame() is the correct place to do that. @WillAyd does that sound appropriate? Then I can revert the change to this unit test -- I was misunderstanding the error.

Fixed in commit ed120d8. I added new tests for behavior implied in #15242.

Unfortunately, test_spam_header tested for a bug -- it specified header=1 instead of the intent, header=2. The test passed because of a bug in the previous implementation: indexing happened after deleting empty header rows.

Let's track the bugs through history to see why header=2 is correct:

Original implementation, commit b31c033 (2013) -- uses header=0 to select <td> elements within <thead> that are not part of a <tr>. The bug that made this work: pandas erroneously treated <thead> as equivalent to <tr>, merging all <th> and <td> within the <thead> into one big row.

commit a8723a4 (2013): changed the test to header=1 and changed the value it's looking for to be the value from the first <tr> in the <tbody>. Seems like a silly change. The author probably tested a body cell because lxml and bs4 began treating the (malformed) header differently as a result of some completely-unrelated change.

commit 0ab0813 (ENH:read_html() handles tables with multiple header rows #13434 #15242, 2017): kept header=1 unchanged, because two new behaviors cancelled out. First, <thead> handling was corrected to look for <tr>s inside the <thead> (meaning one would need to set header=2). Second, _parse_raw_thead was incorrectly changed to delete empty rows (deleting one row from the header brings us back to header=1). That change made the unit test pass, but the unit test was testing the wrong behavior all along.

Today, I fixed the bug introduced in 2017, meaning header=2 finds the cell that's been referenced since a8723a4 in 2013.

@WillAyd what do you think of this change? I think anybody would agree the previous behavior was a bug. And it seems silly to re-implement the bug in this pull request in the name of avoiding feature creep.

I imagine this applies to only a tiny number of tables in the wild.

Thanks for the clarification. I do agree that header=2 appears to be more correct. Do you think there's an issue with other index-based keywords (like skiprows) in the current implementation?

OK to address in a separate issue but it would be nice to have mention of this in a whatsnew in some fashion (again separate issue is fine)

WillAyd · 2018-06-21T03:28:45Z

pandas/io/html.py

-                    if any(col != '' for col in cols):
-                        res.append(cols)
-        return res
+            # HACK: lxml does not clean up the clearly-erroneous


The comment here mentions a situation we are missing tr elements but test_header_inferred_from_th_elements is constructed without a thead element instead - is that by design?

That test is unrelated to this hack. This hack is tested in test_thead_without_tr.

WillAyd · 2018-06-21T03:30:58Z

pandas/io/html.py

-
-    def _parse_tr(self, table):
-        return table.xpath('.//tr')
+        # Look for direct children only: the "row" element here may be a


Conceptually makes sense but is this comment necessary? We aren't enforcing that row be of a certain tag

It is nonsensical that _parse_td() be called with a thead element, because it is nonsensical that a thead element contain td or th children. The _parse_td comment says parameter is a DOM <tr> node.

It's an ugly hack. I think that nonsense deserves a comment.

WillAyd · 2018-06-26T03:43:50Z

Need to clean up merge conflicts and maybe add whatsnew reference for proper indexing of tables with NA data. Otherwise lgtm

adamhooper · 2018-06-26T16:04:25Z

Urgh. I did a rebase on upstream/master and pushed with --force and now Appveyor is complaining. Was that a bad idea on my part? How should I fix?

WillAyd · 2018-06-26T22:46:29Z

I don't think the AppVeyor issue is a result of your rebase. That said, you'd be better off fetching master and merging it into your branch rather than rebase (maintains better comment history).

Looks like you have some file conflicts still anyway, so might was well resolve them locally and try merging from master into your branch this time before next push

@jowens

This is essentially a rebased and squashed pandas-dev#17054 (mad props to @jowens for doing all the hard thinking). My tweaks: * test_computer_sales_page (see pandas-dev#17074) no longer tests for ParserError, because the ParserError was a bug caused by missing colspan support. Now, test that MultiIndex works as expected. * I respectfully removed the fill_rowspan argument from pandas-dev#17073. Instead, the virtual cells created by rowspan/colspan are always copies of the real cells' text. This prevents _infer_columns() from naming virtual cells as "Unnamed: ..." * I removed a small layer of abstraction to respect pandas-dev#20891 (multiple <tbody> support), which was implemented after @jowens' pull request. Now _HtmlFrameParser has _parse_thead_trs, _parse_tbody_trs and _parse_tfoot_trs, each returning a list of <tr>s. That let me remove _parse_tr, Making All The Tests Pass. * That caused a snowball effect. lxml does not fix malformed <thead>, as tested by spam.html. The previous hacky workaround was in _parse_raw_thead, but the new _parse_thead_trs signature returns nodes instead of text. The new hacky solution: return the <thead> itself, pretending it's a <tr>. This works in all the tests. A better solution is to use html5lib with lxml; but that might belong in a separate pull request.

Mostly involved reformatting test_html.py

... but _ignore_ empty rows when inferring columns. This changes the behavior of test_spam_header, which previously ignored an empty row when the user explicitly stated the row number to use as header.

adamhooper · 2018-06-27T14:04:59Z

Indeed -- I rebased over adamhooper/master instead of pandas/master. I've learned my lesson: merge, don't rebase.

Fixed by ... er ... rebasing. Correctly, this time.

adamhooper · 2018-06-27T15:03:38Z

^^ I've noticed a few other pull requests cause ResourceWarning failures on Travis. Does that error relate to this pull request?

WillAyd · 2018-06-27T15:20:08Z

Unrelated - that intermittently seems to pop up on Travis

WillAyd · 2018-07-02T03:54:42Z

@adamhooper can you fix merge conflict?

@jreback thoughts on your end?

adamhooper · 2018-07-02T14:07:03Z

Merged :)

jreback

just some doc comments, otherwise lgtm.

jreback · 2018-07-03T15:42:24Z

doc/source/whatsnew/v0.24.0.txt

@@ -297,7 +298,7 @@ MultiIndex
 I/O
 ^^^

-
+- :func:`read_html()` no longer ignores all-whitespace ``<tr>`` within ``<thead>`` when considering the ``skiprows`` and ``header`` arguments. Previously, users had to decrease their ``header`` and ``skiprows`` values on such tables to work around the issue. (:issue:`21641`)


this is all ok, but would a subsection with a mini-example be instructive to the user about the revised functinaility? not saying 100% need to, but if a simple enough example (or even an image) would be helpful here

I'm going to display my ignorance of the whatsnew format: ... where would this go?

oh, you make a new subsection in the top somewhere, e.g. model after this

Series and Index Data-Dtype Incompatibilities

just add a new one below and you can make an extended example

jreback · 2018-07-03T15:44:02Z

pandas/io/html.py

+        """
+        Given a table, return parsed header, body, and foot.
+
+        Notes


put Notes at the end (after Parameters / Returns)

jreback · 2018-07-03T15:44:16Z

pandas/io/html.py


-        return self._parse_raw_data(raw_data)
+        Notes


same generally

jreback · 2018-07-03T15:45:14Z

pandas/io/html.py

@@ -590,6 +697,12 @@ def _parse_tables(self, doc, match, kwargs):
                             .format(patt=pattern))
        return tables

+    def _equals_tag(self, obj, tag):


ok happy to take follow-ups to make this simpler / easier for implementation.

adamhooper · 2018-07-04T17:39:25Z

@jreback Thanks -- added whatsnew entry.

Excited to see: is this last commit the final one?

WillAyd · 2018-07-04T17:42:38Z

doc/source/whatsnew/v0.24.0.txt

+
+Calls that relied on the previous behavior will need to be changed.
+
+Also, :func:`read_html` previously ignored some ``<tr>`` elements when called


See what @jreback thinks but I don't consider this second half necessary as it addresses a minor bug. Otherwise lgtm

yeah you can just list as a 1-liner in the io section is ok, no example for this needed (the latter half)

jreback · 2018-07-04T18:17:06Z

doc/source/whatsnew/v0.24.0.txt

@@ -168,6 +168,120 @@ Current Behavior:
    ...
    OverflowError: Trying to coerce negative values to unsigned integers

+read_html Incompatibilities


how about read_html enhancements!

jreback · 2018-07-04T18:17:39Z

doc/source/whatsnew/v0.24.0.txt

+
+Current Behavior:
+
+.. code-block:: ipython


you can make this an ipython block (e.g. actually run it for current)

jreback · 2018-07-04T18:18:13Z

doc/source/whatsnew/v0.24.0.txt

+
+.. code-block:: ipython
+
+    In [1]: pd.read_html("""


to avoid repeating the code, make an ipython block above where you define the html itself). then the preivious uses a code-block to show the output, and the new uses an ipython block

Ah, I get it now. (I actually read the docs to preview what it looks like.) Thanks -- looks like a super improvement.

jreback · 2018-07-04T18:18:31Z

doc/source/whatsnew/v0.24.0.txt

+    [   A  B  C
+    0  1  2  2]
+
+Calls that relied on the previous behavior will need to be changed.


this sentence is not needed

jreback · 2018-07-04T18:19:08Z

doc/source/whatsnew/v0.24.0.txt

+
+Calls that relied on the previous behavior will need to be changed.
+
+Also, :func:`read_html` previously ignored some ``<tr>`` elements when called


yeah you can just list as a 1-liner in the io section is ok, no example for this needed (the latter half)

jreback · 2018-07-05T17:48:30Z

thanks @adamhooper very nice!

adamhooper · 2018-07-05T20:19:24Z

Woo-hoo! Thank you all for your guidance. What a thrill :).

clytemnestra · 2018-07-16T12:47:37Z

Hello, when will this be included into a release? I see it's being included in v0.24, but when will that happen? I'm having this exact problem and would love to benefit from this fix. Cheers!

WillAyd requested changes Jun 14, 2018

View reviewed changes

gfyoung added Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap labels Jun 15, 2018

jreback requested changes Jun 15, 2018

View reviewed changes

WillAyd requested changes Jun 15, 2018

View reviewed changes

WillAyd requested changes Jun 21, 2018

View reviewed changes

adamhooper mentioned this pull request Jun 26, 2018

header argument in read_html() ignores empty trs, only within thead #21641

Closed

adamhooper force-pushed the colspan-rowspan branch from ed120d8 to 1773983 Compare June 26, 2018 16:02

adamhooper added 8 commits June 27, 2018 09:34

Fixes after code review -- thanks, @WillAyd

f89b32a

Docstring tweaks

34f87cb

Details -- thanks, @jreback

582c86b

Tweak comments

d2f0b83

Address latest review comments from @WillAyd

74c2384

Mostly involved reformatting test_html.py

Clean up html tests

ad6e869

Do not nix rows of empty

6fa0489

... but _ignore_ empty rows when inferring columns. This changes the behavior of test_spam_header, which previously ignored an empty row when the user explicitly stated the row number to use as header.

adamhooper force-pushed the colspan-rowspan branch from 1773983 to 6fa0489 Compare June 27, 2018 14:00

WillAyd approved these changes Jun 27, 2018

View reviewed changes

Merge branch 'master' into colspan-rowspan

86b2dea

jreback added this to the 0.24.0 milestone Jul 3, 2018

jreback requested changes Jul 3, 2018

View reviewed changes

adamhooper added 3 commits July 3, 2018 17:59

Comments: Notes after Returns

d4f4bb1

Document read_html changes in whatsnew

e296bd1

Merge branch 'master' into colspan-rowspan

f0f91c3

WillAyd reviewed Jul 4, 2018

View reviewed changes

jreback requested changes Jul 4, 2018

View reviewed changes

adamhooper and others added 2 commits July 4, 2018 16:29

Improve whatsnew with ipython

95ce993

fixup whatsnew

5fd863b

jreback approved these changes Jul 5, 2018

View reviewed changes

jreback merged commit e72b7e5 into pandas-dev:master Jul 5, 2018

WillAyd mentioned this pull request Aug 17, 2018

ENH: GH17054: read_html() handles rowspan/colspan and infers headers #17089

Closed

4 tasks

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

read_html: Handle colspan and rowspan (pandas-dev#21487)

692db69


		Calls that relied on the previous behavior will need to be changed.

		Also, :func:`read_html` previously ignored some ``<tr>`` elements when called

Uh oh!

read_html: Handle colspan and rowspan #21487

read_html: Handle colspan and rowspan #21487

Uh oh!

Conversation

adamhooper commented Jun 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pep8speaks commented Jun 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated on July 05, 2018 at 17:45 Hours UTC

Uh oh!

adamhooper commented Jun 14, 2018

Uh oh!

jowens commented Jun 14, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jun 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

adamhooper commented Jun 14, 2018 •

edited

Loading

pep8speaks commented Jun 14, 2018 •

edited

Loading

codecov bot commented Jun 15, 2018 •

edited

Loading