Support for Merging Columns in HTML Parser #4683

cancan101 · 2013-08-26T23:29:29Z

See the table here: http://www.sec.gov/Archives/edgar/data/47217/000104746913006802/a2215416z10-q.htm#CCSE

the trailing ")" and the leading "$" are in different columns (aka td's / cells) from the number..

There should be an option to merge all column under a given heading (#4679).

cpcloud · 2013-08-27T22:55:48Z

what sort of API would you want for this?

This is an example of a horribly designed and written table that uses style attributes in almost every tag. read_html does not and should not try to deal with such idiosyncrasies of HTML tables. However, this type of operation might be useful in the general case.

Something like: pd.read_html(collapse={'A': [0, 1, 2], 'B': [3, 4]}). This would join columns 0, 1, and 2 and form a new column A and join 3 and 4 to form B. This would be a sort of generalization of the parse_dates keyword argument in read_csv and friends.

One issue is coming up with a way to specify the dtype. I suppose you could use a nested dict, but that seems like too much.

jtratner · 2013-08-27T23:14:39Z

At a certain point you can just do that sort of thing yourself and keep
read_html simple. Did we figure out whether read_html handles MultiIndex or
hierarchical columns?

cpcloud · 2013-08-27T23:15:10Z

It does. See #4679. I gave an example.

jtratner · 2013-08-28T01:20:26Z

So then I'm not clear on what the value is here. Edge cases where this
would avoid having an intermediate dtype?

cpcloud · 2013-08-28T01:24:42Z

Maybe. I think this should be left to the user.

cancan101 · 2013-08-28T03:15:23Z

I am not worried about dtype here. The number of rows is small enough that even if the initial type is string (aka object), the value can always be cast to float after appropriate cleaning is performed (removing thousands separators, changing parenthesis to negatives, stripping dollar sign).

My biggest concern is how the reader would express this data to the user:

|3 months|6months            |
|A    |B     |A          |B  |
|$|3  |$|(3|)|($2,000|)|$|12 |

In this case the natural results would be a a MultiIndex with "3 months" and "6 months" at the top level and then "A" and "B" at next level. This would mean four distinct columns. However, given the odd layout off the actual data columns, there would be 9 columns in the data section. As long as the anonymous columns are loaded correctly, I could certainly do the merging after the parse.

@cpcloud While I agree that this is in fact a "horribly designed and written table that uses style attributes in almost every tag", most SEC filings that I have seen share similar layout characteristics (I would be happy to provide more examples). I would say that having the ability to load tables from SEC filings to worthwhile even if they make use cringe. I did fine one example so far of a a ragged table (ick): http://www.sec.gov/Archives/edgar/data/1108524/000119312511075314/d10k.htm#tx138159_26

So, I am all for keeping read_html as simple as possible as long as enough of the structure is maintained such that user can clean the data.

I agree that it does seem like a generalization of the parse_dates argument.

mroeschke · 2021-04-11T03:11:54Z

Appears that this issue didn't gain a lot of traction. Thanks for the suggestion but closing due to lack of activity. Happy to reopen if this community finds this feature useful

cpcloud mentioned this issue Aug 27, 2013

ENH: Excel - allow for multiple rows to be treated as hierarchical columns #4679

Closed

mroeschke added the Enhancement label May 7, 2020

mroeschke closed this as completed Apr 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Merging Columns in HTML Parser #4683

Support for Merging Columns in HTML Parser #4683

cancan101 commented Aug 26, 2013

cpcloud commented Aug 27, 2013

jtratner commented Aug 27, 2013

cpcloud commented Aug 27, 2013

jtratner commented Aug 28, 2013

cpcloud commented Aug 28, 2013

cancan101 commented Aug 28, 2013

mroeschke commented Apr 11, 2021

Support for Merging Columns in HTML Parser #4683

Support for Merging Columns in HTML Parser #4683

Comments

cancan101 commented Aug 26, 2013

cpcloud commented Aug 27, 2013

jtratner commented Aug 27, 2013

cpcloud commented Aug 27, 2013

jtratner commented Aug 28, 2013

cpcloud commented Aug 28, 2013

cancan101 commented Aug 28, 2013

mroeschke commented Apr 11, 2021