Feature Request: expose full DOM nodes to converters in html_read #14608

Amaelb · 2016-11-07T22:23:04Z

With #13461 bringing converters in read_html, I would be very interested in having the full DOM nodes exposed to them : the added flexibility would make easy to perform tasks like the one described in #13141 (extracting links instead of text, or any other non-displayed information).

The old behavior could be emulated by changing the default converter (currently None) to something like

def default_converter(tag):
    remove_whitespace(tag.text)

Would this feature be interesting or is there a reason to stick with the current parsing ?

P.S : I am new to pandas and git-(hub) so please let me know if there is anything wrong in this post.

edit :

The following code emulates the treatement on a random exemple. I would say that in general first_link, links (and whitespace) would be the the most used converters and would worth to be predefined as suggested below.

Sometimes it is interesting to parse input tags like <input id="hidObj_12546" value="1" type="hidden"> so we may also want input_id and input_value or even img_alt. But if the goal is to just cover typical usecases, I would say that links and whitespace are sufficient, with maybe another one for documentation purpose.

from __future__ import (print_function)
import pandas as pa
import re


def series_key(tag):
    return re.search(r'KEY=(.*)',tag.a['href']).group(1)


def first_link(tag):
    return tag.a['href']


def default_conv(tag):
    return tag.text    #whitespaces should be removed


def changed_read_html(io, match='.+', flavor=None, header=None, index_col=None, skiprows=None,
                        attrs=None, parse_dates=False, tupleize_cols=False, thousands=',', encoding=None,
                        decimal='.', converters=None, na_values=None, keep_default_na=True):

    original_parse_raw_data = pa.io.html._HtmlFrameParser._parse_raw_data

    def new_parse_raw_data(self, rows):
        data = [[col for col in self._parse_td(row)] for row in rows]
        return data

    pa.io.html._HtmlFrameParser._parse_raw_data = new_parse_raw_data
    df = pa.read_html(io, match, flavor, header, index_col, skiprows,
                    attrs, parse_dates, tupleize_cols, thousands, encoding,
                    decimal, converters, na_values, keep_default_na)
    pa.io.html._HtmlFrameParser._parse_raw_data = original_parse_raw_data
    return df


converters = {0 : default_conv, 1 : series_key, 2 : default_conv, 3 : default_conv}


ecb_old = pa.read_html('http://sdw.ecb.europa.eu/', match='Selected Indicators for the Euro Area',
                        attrs = {'class': 'tableopenpage'})
ecb = changed_read_html('http://sdw.ecb.europa.eu/', match='Selected Indicators for the Euro Area',
                        attrs = {'class': 'tableopenpage'}, converters = converters)

print("actual read_html\n")
print(ecb_old)
print("\nproposed read_html\n")
print(ecb)

The text was updated successfully, but these errors were encountered:

jreback · 2016-11-07T22:38:07Z

can u update the top with some examples of converts that are generally useful?
and would it make sense to allow converts to take things like:

{'col 1': 'whitespace', 'col 2': 'links', 'col 3': ['whitespace', 'links']}

IOW some predefined mappings to functions which perform certain conversion tasks (of course can also pass a callable)

is the chaining useful (col 3)?

Amaelb · 2016-11-08T20:26:08Z

Predefined mappings would be quite helpful and chaining would make them much more flexible. So I would be very happy to have both.
Would chaining be easy to get (assuming everything else right) ?

Using converts, I went into something strange :
If I understand correctly, converts seems to expect throw an error if an integer key greater than the number of columns is provided. The function read_html returning a list of dataframe, it is thus not possible to reach the the ending columns of most of the dataframes if one of them have less columns than the others.
Maybe converters should be passed as a list of dicts to read_html ?

Amaelb · 2016-11-09T12:22:07Z

I will open a new ticket about the aformentioned problem as it is confirmed on other examples

jorisvandenbossche · 2016-11-11T08:48:57Z

@Amaelb I am not a heavy user of read_html, but from your examples this does indeed seem useful. If you want to work on this, we would certainly like to review/accept a PR I think.

Amaelb · 2016-11-17T23:20:18Z

@jorisvandenbossche I have taken some time to evaluate the feasibility of this.

When read_html returns an unique DF, there is a quite straightforward solution using converters.
But if several tables are selected, #14624 prevent the use of converters.

Since modifying parsing.py to fix #14624 would have an impact on performances of other I/O tools, it needs to be carefully thought and I have not enough background on pandas to try this.

However there is maybe a simple solution: adding a parameter to read_html giving the choice to return either the text or the full node in the DF, letting user handling data.

But this is somewhat different of what was proposed here, and of what further discussed by @jreback.

Let me know your opinion about it.

psychemedia · 2017-01-21T15:24:05Z

One issue I keep facing is the collapsing of multiple <br> lines in a cell being collapsed in a way that is hard to parse eg this<br>or that becomes thisor that.

Having a converter that replaces <br> with a space, or splits blocks (<div>,<p> etc) into lists of blocked content might be handy.

jnothman · 2017-05-24T02:19:27Z

+1 for the need to identify logical linebreaks defined by <p>, <br> and perhaps other block-nodes.

Another way of treating this problem is to just have a converter called raw_html, which will do no cleaning and conversion, allowing the user to post-process the contents of each cell. While this does not provide predefined user-friendly converters (at least not initially), it gives users full power while exploiting read_html as far as is reasonable, and avoids issues of code not handling different flavors.

jnothman · 2017-05-24T02:39:31Z

The main problem I can see for any of these is that in the current code structure we need to determine whether to retain raw HTML before we've identified column names used to key conversion.

A clean solution would require the identification of column names to occur in _parse_raw_data

A more feasible but slightly hacky solution would always return raw elements, but then clean headers ~~and index~~, as well as hacking converters to have a conversion function for everything (a bit like above).

jnothman · 2017-05-24T13:05:12Z

It seems like cleaning column names before TextParser is hard since the logic for identifying headers is not trivial. Similarly, it's hard to do TextParser before name-specified options are applied because their logic is non-trivial.

This is proving hard to implement without adding extra features to PythonParser such as header_converter and default_converter.

psychemedia · 2017-05-25T08:57:36Z

@jnothman Having the option to get the raw HTML would always be useful as a fall back if the current parsers do break for a particular scrape?

psychemedia · 2019-04-29T10:18:12Z

@jnothman Just wondering if a raw_html option to converters for returning the raw HTML from a td element is a still a possibility?

jnothman · 2019-04-29T11:05:51Z

Not sure what you want me to say about it...

Amaelb mentioned this issue Nov 9, 2016

IndexError using converters in read_html #14624

Open

jorisvandenbossche added IO HTML read_html, to_html, Styler.apply, Styler.applymap Enhancement labels Nov 11, 2016

TomAugspurger mentioned this issue Aug 19, 2019

[ENH] pandas.read_html argument to interpret hyperlinks as links (not merely text) #13141

Closed

abitrolly mentioned this issue Nov 10, 2019

pd.read_html() convert <br> to space #29528

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: expose full DOM nodes to converters in html_read #14608

Feature Request: expose full DOM nodes to converters in html_read #14608

Amaelb commented Nov 7, 2016 •

edited

Loading

jreback commented Nov 7, 2016

Amaelb commented Nov 8, 2016 •

edited

Loading

Amaelb commented Nov 9, 2016

jorisvandenbossche commented Nov 11, 2016

Amaelb commented Nov 17, 2016

psychemedia commented Jan 21, 2017

jnothman commented May 24, 2017

jnothman commented May 24, 2017 •

edited

Loading

jnothman commented May 24, 2017

psychemedia commented May 25, 2017

psychemedia commented Apr 29, 2019

jnothman commented Apr 29, 2019

Feature Request: expose full DOM nodes to converters in html_read #14608

Feature Request: expose full DOM nodes to converters in html_read #14608

Comments

Amaelb commented Nov 7, 2016 • edited Loading

jreback commented Nov 7, 2016

Amaelb commented Nov 8, 2016 • edited Loading

Amaelb commented Nov 9, 2016

jorisvandenbossche commented Nov 11, 2016

Amaelb commented Nov 17, 2016

psychemedia commented Jan 21, 2017

jnothman commented May 24, 2017

jnothman commented May 24, 2017 • edited Loading

jnothman commented May 24, 2017

psychemedia commented May 25, 2017

psychemedia commented Apr 29, 2019

jnothman commented Apr 29, 2019

Amaelb commented Nov 7, 2016 •

edited

Loading

Amaelb commented Nov 8, 2016 •

edited

Loading

jnothman commented May 24, 2017 •

edited

Loading