-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Q: correct behavior for read_html with rowspan/colspan for DataFrames? #17073
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think the philosophy with That said the behavior you proposed seems reasonable - in other parsers "Fruit" would be placed at the top level, with an unnamed second level In [18]: pd.read_csv(StringIO("""
...: Fruit,a,a,e,e
...: ,Long,Short,Long,Short
...: Apple,0,1,0
...: Banana,0,3,0"""), header=[0,1])
Out[18]:
Fruit a e
Unnamed: 0_level_1 Long Short Long Short
0 Apple 0 1 0 NaN
1 Banana 0 3 0 NaN |
OK. I can match the behavior of the other parsers. "Unnamed second level" evidently means "empty string". |
Judging from the previous conversation, I am changing the labeling. PR is welcome! |
It'll be implemented as discussed in this issue when I submit a pull request for #17054. |
Okay, sounds good. Closing in favor of that issue then. |
This is essentially a rebased and squashed pandas-dev#17054 (mad props to @jowens for doing all the hard thinking). My tweaks: * test_computer_sales_page (see pandas-dev#17074) no longer tests for ParserError, because the ParserError was a bug caused by missing colspan support. Now, test that MultiIndex works as expected. * I respectfully removed the fill_rowspan argument from pandas-dev#17073. Instead, the virtual cells created by rowspan/colspan are always copies of the real cells' text. This prevents _infer_columns() from naming virtual cells as "Unnamed: ..." * I removed a small layer of abstraction to respect pandas-dev#20891 (multiple <tbody> support), which was implemented after @jowens' pull request. Now _HtmlFrameParser has _parse_thead_trs, _parse_tbody_trs and _parse_tfoot_trs, each returning a list of <tr>s. That let me remove _parse_tr, Making All The Tests Pass. * That caused a snowball effect. lxml does not fix malformed <thead>, as tested by spam.html. The previous hacky workaround was in _parse_raw_thead, but the new _parse_thead_trs signature returns nodes instead of text. The new hacky solution: return the <thead> itself, pretending it's a <tr>. This works in all the tests. A better solution is to use html5lib with lxml; but that might belong in a separate pull request.
This is essentially a rebased and squashed pandas-dev#17054 (mad props to @jowens for doing all the hard thinking). My tweaks: * test_computer_sales_page (see pandas-dev#17074) no longer tests for ParserError, because the ParserError was a bug caused by missing colspan support. Now, test that MultiIndex works as expected. * I respectfully removed the fill_rowspan argument from pandas-dev#17073. Instead, the virtual cells created by rowspan/colspan are always copies of the real cells' text. This prevents _infer_columns() from naming virtual cells as "Unnamed: ..." * I removed a small layer of abstraction to respect pandas-dev#20891 (multiple <tbody> support), which was implemented after @jowens' pull request. Now _HtmlFrameParser has _parse_thead_trs, _parse_tbody_trs and _parse_tfoot_trs, each returning a list of <tr>s. That let me remove _parse_tr, Making All The Tests Pass. * That caused a snowball effect. lxml does not fix malformed <thead>, as tested by spam.html. The previous hacky workaround was in _parse_raw_thead, but the new _parse_thead_trs signature returns nodes instead of text. The new hacky solution: return the <thead> itself, pretending it's a <tr>. This works in all the tests. A better solution is to use html5lib with lxml; but that might belong in a separate pull request.
This is essentially a rebased and squashed pandas-dev#17054 (mad props to @jowens for doing all the hard thinking). My tweaks: * test_computer_sales_page (see pandas-dev#17074) no longer tests for ParserError, because the ParserError was a bug caused by missing colspan support. Now, test that MultiIndex works as expected. * I respectfully removed the fill_rowspan argument from pandas-dev#17073. Instead, the virtual cells created by rowspan/colspan are always copies of the real cells' text. This prevents _infer_columns() from naming virtual cells as "Unnamed: ..." * I removed a small layer of abstraction to respect pandas-dev#20891 (multiple <tbody> support), which was implemented after @jowens' pull request. Now _HtmlFrameParser has _parse_thead_trs, _parse_tbody_trs and _parse_tfoot_trs, each returning a list of <tr>s. That let me remove _parse_tr, Making All The Tests Pass. * That caused a snowball effect. lxml does not fix malformed <thead>, as tested by spam.html. The previous hacky workaround was in _parse_raw_thead, but the new _parse_thead_trs signature returns nodes instead of text. The new hacky solution: return the <thead> itself, pretending it's a <tr>. This works in all the tests. A better solution is to use html5lib with lxml; but that might belong in a separate pull request.
No code, just a question for proper behavior for rowspan/colspan with
read_html
of an HTML table into a DataFrame. (I'm not asking what currently happens withread_html
now. I'm asking what should happen.)Below is a simple HTML table that uses both colspan (a,e,i,o,u) and rowspan (Fruit, schwa, honk, and the rightmost 0 in the table). It renders identically on each of {Chrome, Firefox, Safari}. With these renderers, both rowspans and colspans are basically rendered midway through the span, either vertically (rowspan) or horizontally (colspan).
Now, let's say we wanted to import this into pandas with
read_html
. It seems to me the behavior should be different for a pandas DataFrame than for a renderer:Fruit
and the second column is a combination ofa
andLong
, etc. We don't "fill" a rowspan (the first column shouldn't be twoFruit
s), but we do "fill" a colspan (a
would appear in the 2nd and 3rd columns).If this was the case, it would imply that we treat rowspans differently in header and body.
I put the DataFrame that I think we want below. It incorporates different behavior for rowspan for header and body. One thing I don't know, though: If I don't "fill" the rowspan name for rowspan > 1, what do I put instead?
None
? empty string?False
? What does the input toTextParser
look like when some column names are "taller" than others?Thoughts? @chris-b1? (relevant to #17054)
The text was updated successfully, but these errors were encountered: