Skip to content

Feature Request: expose full DOM nodes to converters in html_read #14608

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Amaelb opened this issue Nov 7, 2016 · 12 comments
Open

Feature Request: expose full DOM nodes to converters in html_read #14608

Amaelb opened this issue Nov 7, 2016 · 12 comments
Labels
Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap

Comments

@Amaelb
Copy link

Amaelb commented Nov 7, 2016

With #13461 bringing converters in read_html, I would be very interested in having the full DOM nodes exposed to them : the added flexibility would make easy to perform tasks like the one described in #13141 (extracting links instead of text, or any other non-displayed information).

The old behavior could be emulated by changing the default converter (currently None) to something like

def default_converter(tag):
    remove_whitespace(tag.text)

Would this feature be interesting or is there a reason to stick with the current parsing ?

P.S : I am new to pandas and git-(hub) so please let me know if there is anything wrong in this post.

edit :

The following code emulates the treatement on a random exemple. I would say that in general first_link, links (and whitespace) would be the the most used converters and would worth to be predefined as suggested below.

Sometimes it is interesting to parse input tags like <input id="hidObj_12546" value="1" type="hidden"> so we may also want input_id and input_value or even img_alt. But if the goal is to just cover typical usecases, I would say that links and whitespace are sufficient, with maybe another one for documentation purpose.

from __future__ import (print_function)
import pandas as pa
import re


def series_key(tag):
    return re.search(r'KEY=(.*)',tag.a['href']).group(1)


def first_link(tag):
    return tag.a['href']


def default_conv(tag):
    return tag.text    #whitespaces should be removed


def changed_read_html(io, match='.+', flavor=None, header=None, index_col=None, skiprows=None,
                        attrs=None, parse_dates=False, tupleize_cols=False, thousands=',', encoding=None,
                        decimal='.', converters=None, na_values=None, keep_default_na=True):

    original_parse_raw_data = pa.io.html._HtmlFrameParser._parse_raw_data

    def new_parse_raw_data(self, rows):
        data = [[col for col in self._parse_td(row)] for row in rows]
        return data

    pa.io.html._HtmlFrameParser._parse_raw_data = new_parse_raw_data
    df = pa.read_html(io, match, flavor, header, index_col, skiprows,
                    attrs, parse_dates, tupleize_cols, thousands, encoding,
                    decimal, converters, na_values, keep_default_na)
    pa.io.html._HtmlFrameParser._parse_raw_data = original_parse_raw_data
    return df


converters = {0 : default_conv, 1 : series_key, 2 : default_conv, 3 : default_conv}


ecb_old = pa.read_html('http://sdw.ecb.europa.eu/', match='Selected Indicators for the Euro Area',
                        attrs = {'class': 'tableopenpage'})
ecb = changed_read_html('http://sdw.ecb.europa.eu/', match='Selected Indicators for the Euro Area',
                        attrs = {'class': 'tableopenpage'}, converters = converters)

print("actual read_html\n")
print(ecb_old)
print("\nproposed read_html\n")
print(ecb)

@jreback
Copy link
Contributor

jreback commented Nov 7, 2016

can u update the top with some examples of converts that are generally useful?
and would it make sense to allow converts to take things like:

{'col 1': 'whitespace', 'col 2': 'links', 'col 3': ['whitespace', 'links']}

IOW some predefined mappings to functions which perform certain conversion tasks (of course can also pass a callable)

is the chaining useful (col 3)?

@Amaelb
Copy link
Author

Amaelb commented Nov 8, 2016

Predefined mappings would be quite helpful and chaining would make them much more flexible. So I would be very happy to have both.
Would chaining be easy to get (assuming everything else right) ?

Using converts, I went into something strange :
If I understand correctly, converts seems to expect throw an error if an integer key greater than the number of columns is provided. The function read_html returning a list of dataframe, it is thus not possible to reach the the ending columns of most of the dataframes if one of them have less columns than the others.
Maybe converters should be passed as a list of dicts to read_html ?

@Amaelb
Copy link
Author

Amaelb commented Nov 9, 2016

I will open a new ticket about the aformentioned problem as it is confirmed on other examples

@jorisvandenbossche jorisvandenbossche added IO HTML read_html, to_html, Styler.apply, Styler.applymap Enhancement labels Nov 11, 2016
@jorisvandenbossche
Copy link
Member

@Amaelb I am not a heavy user of read_html, but from your examples this does indeed seem useful. If you want to work on this, we would certainly like to review/accept a PR I think.

@Amaelb
Copy link
Author

Amaelb commented Nov 17, 2016

@jorisvandenbossche I have taken some time to evaluate the feasibility of this.

When read_html returns an unique DF, there is a quite straightforward solution using converters.
But if several tables are selected, #14624 prevent the use of converters.

Since modifying parsing.py to fix #14624 would have an impact on performances of other I/O tools, it needs to be carefully thought and I have not enough background on pandas to try this.

However there is maybe a simple solution: adding a parameter to read_html giving the choice to return either the text or the full node in the DF, letting user handling data.

But this is somewhat different of what was proposed here, and of what further discussed by @jreback.

Let me know your opinion about it.

@psychemedia
Copy link
Contributor

One issue I keep facing is the collapsing of multiple <br> lines in a cell being collapsed in a way that is hard to parse eg this<br>or that becomes thisor that.

Having a converter that replaces <br> with a space, or splits blocks (<div>,<p> etc) into lists of blocked content might be handy.

@jnothman
Copy link
Contributor

+1 for the need to identify logical linebreaks defined by <p>, <br> and perhaps other block-nodes.

Another way of treating this problem is to just have a converter called raw_html, which will do no cleaning and conversion, allowing the user to post-process the contents of each cell. While this does not provide predefined user-friendly converters (at least not initially), it gives users full power while exploiting read_html as far as is reasonable, and avoids issues of code not handling different flavors.

@jnothman
Copy link
Contributor

jnothman commented May 24, 2017

The main problem I can see for any of these is that in the current code structure we need to determine whether to retain raw HTML before we've identified column names used to key conversion.

A clean solution would require the identification of column names to occur in _parse_raw_data

A more feasible but slightly hacky solution would always return raw elements, but then clean headers and index, as well as hacking converters to have a conversion function for everything (a bit like above).

@jnothman
Copy link
Contributor

It seems like cleaning column names before TextParser is hard since the logic for identifying headers is not trivial. Similarly, it's hard to do TextParser before name-specified options are applied because their logic is non-trivial.

This is proving hard to implement without adding extra features to PythonParser such as header_converter and default_converter.

@psychemedia
Copy link
Contributor

@jnothman Having the option to get the raw HTML would always be useful as a fall back if the current parsers do break for a particular scrape?

@psychemedia
Copy link
Contributor

@jnothman Just wondering if a raw_html option to converters for returning the raw HTML from a td element is a still a possibility?

@jnothman
Copy link
Contributor

Not sure what you want me to say about it...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

No branches or pull requests

5 participants