Skip to content

Support for Merging Columns in HTML Parser #4683

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cancan101 opened this issue Aug 26, 2013 · 7 comments
Closed

Support for Merging Columns in HTML Parser #4683

cancan101 opened this issue Aug 26, 2013 · 7 comments
Labels
API Design Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap

Comments

@cancan101
Copy link
Contributor

See the table here: http://www.sec.gov/Archives/edgar/data/47217/000104746913006802/a2215416z10-q.htm#CCSE

the trailing ")" and the leading "$" are in different columns (aka td's / cells) from the number..

There should be an option to merge all column under a given heading (#4679).

@cpcloud
Copy link
Member

cpcloud commented Aug 27, 2013

what sort of API would you want for this?

This is an example of a horribly designed and written table that uses style attributes in almost every tag. read_html does not and should not try to deal with such idiosyncrasies of HTML tables. However, this type of operation might be useful in the general case.

Something like: pd.read_html(collapse={'A': [0, 1, 2], 'B': [3, 4]}). This would join columns 0, 1, and 2 and form a new column A and join 3 and 4 to form B. This would be a sort of generalization of the parse_dates keyword argument in read_csv and friends.

One issue is coming up with a way to specify the dtype. I suppose you could use a nested dict, but that seems like too much.

@jtratner
Copy link
Contributor

At a certain point you can just do that sort of thing yourself and keep
read_html simple. Did we figure out whether read_html handles MultiIndex or
hierarchical columns?

@cpcloud
Copy link
Member

cpcloud commented Aug 27, 2013

It does. See #4679. I gave an example.

@jtratner
Copy link
Contributor

So then I'm not clear on what the value is here. Edge cases where this
would avoid having an intermediate dtype?

@cpcloud
Copy link
Member

cpcloud commented Aug 28, 2013

Maybe. I think this should be left to the user.

@cancan101
Copy link
Contributor Author

I am not worried about dtype here. The number of rows is small enough that even if the initial type is string (aka object), the value can always be cast to float after appropriate cleaning is performed (removing thousands separators, changing parenthesis to negatives, stripping dollar sign).

My biggest concern is how the reader would express this data to the user:

|3 months|6months            |
|A    |B     |A          |B  |
|$|3  |$|(3|)|($2,000|)|$|12 |

In this case the natural results would be a a MultiIndex with "3 months" and "6 months" at the top level and then "A" and "B" at next level. This would mean four distinct columns. However, given the odd layout off the actual data columns, there would be 9 columns in the data section. As long as the anonymous columns are loaded correctly, I could certainly do the merging after the parse.

@cpcloud While I agree that this is in fact a "horribly designed and written table that uses style attributes in almost every tag", most SEC filings that I have seen share similar layout characteristics (I would be happy to provide more examples). I would say that having the ability to load tables from SEC filings to worthwhile even if they make use cringe. I did fine one example so far of a a ragged table (ick): http://www.sec.gov/Archives/edgar/data/1108524/000119312511075314/d10k.htm#tx138159_26

So, I am all for keeping read_html as simple as possible as long as enough of the structure is maintained such that user can clean the data.

I agree that it does seem like a generalization of the parse_dates argument.

@mroeschke
Copy link
Member

Appears that this issue didn't gain a lot of traction. Thanks for the suggestion but closing due to lack of activity. Happy to reopen if this community finds this feature useful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

No branches or pull requests

4 participants