Skip to content

ENH: Adding additional keywords to read_html for #13461 #13575

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

gte620v
Copy link
Contributor

@gte620v gte620v commented Jul 7, 2016

@jorisvandenbossche
Copy link
Member

@gte620v The suggestion in #10534 was to add dtype argument (although the converters keyword will be able to achieve a similar goal). Can you have a look if it is easy to pass through that keyword as well?

In the same line, I think it would be useful to at once evaluate the other keywords to see which ones could be useful for html parsing.

@jorisvandenbossche jorisvandenbossche added Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap labels Jul 7, 2016
</tr>
</tbody>
</table>"""
raw_data = np.array([[u'R_l0_g0', '0.763', 0.233],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls compare with expected DataFrame.

@gte620v
Copy link
Contributor Author

gte620v commented Jul 8, 2016

Can you have a look if it is easy to pass through that keyword as well?

My initial impression is that dtype can not be easily dropped in since it is not supported by the python engine parser: https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L463-L470

For reasons I haven't poked around enough to understand, the read_html function uses the TextParser function which uses the python engine: https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L1600-L1601

In the same line, I think it would be useful to at once evaluate the other keywords to see which ones could be useful for html parsing.

Sure, I'll take a look. It seems a bit cumbersome having to copy over args in 3 or 4 places to add them to read_html. I poke through the code to see if there is a better way to add all applicable args. Even without finding a better way to do things, it should be straightforward to use all of the TextParser args.

Please let me know if you have any suggestions.

@gte620v
Copy link
Contributor Author

gte620v commented Jul 8, 2016

It would be easiest to just pass around kwargs from TextParser in https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L1546-L1601

Is there an existing better way to do that other than explicitly calling out the arguments in each of these places? If not, I can probably work about a better way to do it. Let me know if you have any thoughts.

@codecov-io
Copy link

codecov-io commented Jul 8, 2016

Current coverage is 84.53%

Merging #13575 into master will increase coverage by 0.21%

@@             master     #13575   diff @@
==========================================
  Files           138        141     +3   
  Lines         51157      51147    -10   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43132      43235   +103   
+ Misses         8025       7912   -113   
  Partials          0          0          

Powered by Codecov. Last updated by ba82b51...2abb473

@gte620v gte620v changed the title ENH: Adding converters and na_values arguments to read_html for #13461 and #10534 ENH: Adding converters, na_values, keep_default_na, squeeze, and date_parser arguments to read_html for #13461 and #10534 Jul 8, 2016
@gte620v
Copy link
Contributor Author

gte620v commented Jul 8, 2016

@jorisvandenbossche I added support for a few more arguments: keep_default_na, squeeze, and date_parser.

I also consolidated the argument definitions from being called in several places to only being enumerated in the read_html function. This should make it easier to add more arguments to that function in the future.

I could not easily add dtype support since read_html uses the python engine.

Let me know if you want me to make more changes.

@jorisvandenbossche jorisvandenbossche added this to the 0.19.0 milestone Jul 11, 2016
- The ``pd.read_html()`` has gained support for the ``converters`` option (:issue:`13461`)
- The ``pd.read_html()`` has gained support for the ``keep_default_na`` option (:issue:`13461`)
- The ``pd.read_html()`` has gained support for the ``squeeze`` option (:issue:`13461`)
- The ``pd.read_html()`` has gained support for the ``date_parser`` option (:issue:`13461`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you combine this in one entry?

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jul 11, 2016

@gte620v The way of passing through looks good! Thanks

Regarding the extra keywords, maybe we should evaluate each keyword on its usefulness to add (sorry if I was a bit fast in pushing you to add more keywords):

  • keep_default_nan -> certainly good to add (needed if we also add na_values, which was the original issue)
  • squeeze / date_parser -> those are maybe less essential (the same effect can be easily obtained after parsing), so is maybe not worth the additional clutter in the long list of possible arguments

OK regarding not adding dtype for now (it's indeed not supported with the python engine), but can you then remove the note that it closed that issue? -> changed the PR title, so OK

@jorisvandenbossche jorisvandenbossche changed the title ENH: Adding converters, na_values, keep_default_na, squeeze, and date_parser arguments to read_html for #13461 and #10534 ENH: Adding converters, na_values, keep_default_na, squeeze, and date_parser arguments to read_html for #13461 Jul 11, 2016
@jorisvandenbossche jorisvandenbossche changed the title ENH: Adding converters, na_values, keep_default_na, squeeze, and date_parser arguments to read_html for #13461 ENH: Adding additional keywords to read_html for #13461 Jul 11, 2016
@gte620v
Copy link
Contributor Author

gte620v commented Jul 11, 2016

Sure, will do!

Let me know definitively about squeeze and date_parser. I can leave or remove.

@jorisvandenbossche
Copy link
Member

@jreback Opinion on adding squeeze and date_parser keywords to to_html?
("it's easy to support" (just passing through to TextParser) vs "not essential keywords so only cluttering long list of possible keywords" (in contrast to the NaN handling keywords, the effect of squeeze and date_parser are easily obtained manually after parsing))

@jreback
Copy link
Contributor

jreback commented Jul 19, 2016

squeeze should not be added; its purpose is to return a Series; while this method returns a list of dataframes (always), I would raise if its not-None.

date_parser is kind of useless. you already have parse_dates to auto parse most dates, otherwise you use .to_datetime (usually with a format). To be honest I would blow that away from the csv parsers, but that's another thing. So raise here if its passed as well (a ValueError)

@gte620v
Copy link
Contributor Author

gte620v commented Jul 19, 2016

@jreback Thanks. I removed squeeze and date_parser. I think it is good to go. Let me know if I misunderstood.

@@ -207,6 +207,8 @@ Other enhancements
- The ``pd.read_csv()`` with ``engine='python'`` has gained support for the ``na_filter`` option (:issue:`13321`)
- The ``pd.read_csv()`` with ``engine='python'`` has gained support for the ``memory_map`` option (:issue:`13381`)

- The ``pd.read_html()`` has gained support for the ``na_values``, ``converters``, ``keep_default_na`` options (:issue:`13461`)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need an update in the docs themsleves? e.g. an example

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: added to io.rst

@gte620v
Copy link
Contributor Author

gte620v commented Jul 19, 2016

Sure, I can do that.

@gte620v
Copy link
Contributor Author

gte620v commented Jul 19, 2016

I added examples to the docs. Let me know if you want a modification or if I need to add something someplace else.

@@ -1959,6 +1959,29 @@ Specify an HTML attribute
dfs2 = read_html(url, attrs={'class': 'sortable'})
print(np.array_equal(dfs1[0], dfs2[0])) # Should be True

Specify values that should be converted to NaN

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a version added tag for these

@jreback
Copy link
Contributor

jreback commented Jul 20, 2016

minor comment. ping when green.

@jorisvandenbossche
Copy link
Member

@gte620v Thanks a lot!

@jorisvandenbossche jorisvandenbossche merged commit 4d3b6c1 into pandas-dev:master Jul 21, 2016
@gte620v
Copy link
Contributor Author

gte620v commented Jul 21, 2016

My pleasure!

@gte620v gte620v deleted the add_functions_to_read_html branch July 21, 2016 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Expose more parsing options in html_read
5 participants