-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Feature Request: expose full DOM nodes to converters in html_read #14608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
can u update the top with some examples of converts that are generally useful? {'col 1': 'whitespace', 'col 2': 'links', 'col 3': ['whitespace', 'links']} IOW some predefined mappings to functions which perform certain conversion tasks (of course can also pass a callable) is the chaining useful (col 3)? |
Predefined mappings would be quite helpful and chaining would make them much more flexible. So I would be very happy to have both. Using |
I will open a new ticket about the aformentioned problem as it is confirmed on other examples |
@Amaelb I am not a heavy user of |
@jorisvandenbossche I have taken some time to evaluate the feasibility of this. When Since modifying parsing.py to fix #14624 would have an impact on performances of other I/O tools, it needs to be carefully thought and I have not enough background on pandas to try this. However there is maybe a simple solution: adding a parameter to But this is somewhat different of what was proposed here, and of what further discussed by @jreback. Let me know your opinion about it. |
One issue I keep facing is the collapsing of multiple Having a converter that replaces |
+1 for the need to identify logical linebreaks defined by Another way of treating this problem is to just have a converter called |
The main problem I can see for any of these is that in the current code structure we need to determine whether to retain raw HTML before we've identified column names used to key conversion. A clean solution would require the identification of column names to occur in A more feasible but slightly hacky solution would always return raw elements, but then clean headers |
It seems like cleaning column names before This is proving hard to implement without adding extra features to |
@jnothman Having the option to get the raw HTML would always be useful as a fall back if the current parsers do break for a particular scrape? |
@jnothman Just wondering if a |
Not sure what you want me to say about it... |
With #13461 bringing
converters
inread_html
, I would be very interested in having the full DOM nodes exposed to them : the added flexibility would make easy to perform tasks like the one described in #13141 (extracting links instead of text, or any other non-displayed information).The old behavior could be emulated by changing the default converter (currently
None
) to something likeWould this feature be interesting or is there a reason to stick with the current parsing ?
P.S : I am new to pandas and git-(hub) so please let me know if there is anything wrong in this post.
edit :
The following code emulates the treatement on a random exemple. I would say that in general
first_link
,links
(andwhitespace
) would be the the most used converters and would worth to be predefined as suggested below.Sometimes it is interesting to parse input tags like
<input id="hidObj_12546" value="1" type="hidden">
so we may also wantinput_id
andinput_value
or evenimg_alt
. But if the goal is to just cover typical usecases, I would say thatlinks
andwhitespace
are sufficient, with maybe another one for documentation purpose.The text was updated successfully, but these errors were encountered: