ENH: Adding additional keywords to read_html for #13461 #13575

gte620v · 2016-07-07T06:47:15Z

closes Feature Request: Expose more parsing options in html_read #13461
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

jorisvandenbossche · 2016-07-07T08:00:28Z

@gte620v The suggestion in #10534 was to add dtype argument (although the converters keyword will be able to achieve a similar goal). Can you have a look if it is easy to pass through that keyword as well?

In the same line, I think it would be useful to at once evaluate the other keywords to see which ones could be useful for html parsing.

sinhrks · 2016-07-07T14:10:03Z

pandas/io/tests/test_html.py

+                            </tr>
+                        </tbody>
+                    </table>"""
+        raw_data = np.array([[u'R_l0_g0', '0.763', 0.233],


pls compare with expected DataFrame.

gte620v · 2016-07-08T02:35:36Z

Can you have a look if it is easy to pass through that keyword as well?

My initial impression is that dtype can not be easily dropped in since it is not supported by the python engine parser: https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L463-L470

For reasons I haven't poked around enough to understand, the read_html function uses the TextParser function which uses the python engine: https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L1600-L1601

In the same line, I think it would be useful to at once evaluate the other keywords to see which ones could be useful for html parsing.

Sure, I'll take a look. It seems a bit cumbersome having to copy over args in 3 or 4 places to add them to read_html. I poke through the code to see if there is a better way to add all applicable args. Even without finding a better way to do things, it should be straightforward to use all of the TextParser args.

Please let me know if you have any suggestions.

gte620v · 2016-07-08T02:46:53Z

It would be easiest to just pass around kwargs from TextParser in https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L1546-L1601

Is there an existing better way to do that other than explicitly calling out the arguments in each of these places? If not, I can probably work about a better way to do it. Let me know if you have any thoughts.

codecov-io · 2016-07-08T19:04:33Z

Current coverage is 84.53%

Merging #13575 into master will increase coverage by 0.21%

@@             master     #13575   diff @@
==========================================
  Files           138        141     +3   
  Lines         51157      51147    -10   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43132      43235   +103   
+ Misses         8025       7912   -113   
  Partials          0          0

Powered by Codecov. Last updated by ba82b51...2abb473

gte620v · 2016-07-08T20:57:40Z

@jorisvandenbossche I added support for a few more arguments: keep_default_na, squeeze, and date_parser.

I also consolidated the argument definitions from being called in several places to only being enumerated in the read_html function. This should make it easier to add more arguments to that function in the future.

I could not easily add dtype support since read_html uses the python engine.

Let me know if you want me to make more changes.

jorisvandenbossche · 2016-07-11T14:46:24Z

doc/source/whatsnew/v0.19.0.txt

+- The ``pd.read_html()`` has gained support for the ``converters`` option (:issue:`13461`)
+- The ``pd.read_html()`` has gained support for the ``keep_default_na`` option (:issue:`13461`)
+- The ``pd.read_html()`` has gained support for the ``squeeze`` option (:issue:`13461`)
+- The ``pd.read_html()`` has gained support for the ``date_parser`` option (:issue:`13461`)


Can you combine this in one entry?

jorisvandenbossche · 2016-07-11T15:00:14Z

@gte620v The way of passing through looks good! Thanks

Regarding the extra keywords, maybe we should evaluate each keyword on its usefulness to add (sorry if I was a bit fast in pushing you to add more keywords):

keep_default_nan -> certainly good to add (needed if we also add na_values, which was the original issue)
squeeze / date_parser -> those are maybe less essential (the same effect can be easily obtained after parsing), so is maybe not worth the additional clutter in the long list of possible arguments

OK regarding not adding dtype for now (it's indeed not supported with the python engine), ~~but can you then remove the note that it closed that issue?~~ -> changed the PR title, so OK

gte620v · 2016-07-11T15:02:02Z

Sure, will do!

Let me know definitively about squeeze and date_parser. I can leave or remove.

jorisvandenbossche · 2016-07-11T15:06:05Z

@jreback Opinion on adding squeeze and date_parser keywords to to_html?
("it's easy to support" (just passing through to TextParser) vs "not essential keywords so only cluttering long list of possible keywords" (in contrast to the NaN handling keywords, the effect of squeeze and date_parser are easily obtained manually after parsing))

jreback · 2016-07-19T01:39:04Z

squeeze should not be added; its purpose is to return a Series; while this method returns a list of dataframes (always), I would raise if its not-None.

date_parser is kind of useless. you already have parse_dates to auto parse most dates, otherwise you use .to_datetime (usually with a format). To be honest I would blow that away from the csv parsers, but that's another thing. So raise here if its passed as well (a ValueError)

gte620v · 2016-07-19T03:08:55Z

@jreback Thanks. I removed squeeze and date_parser. I think it is good to go. Let me know if I misunderstood.

jreback · 2016-07-19T13:16:52Z

doc/source/whatsnew/v0.19.0.txt

@@ -207,6 +207,8 @@ Other enhancements
 - The ``pd.read_csv()`` with ``engine='python'`` has gained support for the ``na_filter`` option (:issue:`13321`)
 - The ``pd.read_csv()`` with ``engine='python'`` has gained support for the ``memory_map`` option (:issue:`13381`)

+- The ``pd.read_html()`` has gained support for the ``na_values``, ``converters``, ``keep_default_na``  options (:issue:`13461`)
+


do we need an update in the docs themsleves? e.g. an example

Done: added to io.rst

gte620v · 2016-07-19T13:30:37Z

Sure, I can do that.

gte620v · 2016-07-19T14:54:11Z

I added examples to the docs. Let me know if you want a modification or if I need to add something someplace else.

jreback · 2016-07-20T21:41:55Z

doc/source/io.rst

@@ -1959,6 +1959,29 @@ Specify an HTML attribute
   dfs2 = read_html(url, attrs={'class': 'sortable'})
   print(np.array_equal(dfs1[0], dfs2[0]))  # Should be True

+Specify values that should be converted to NaN
+


can you add a version added tag for these

jreback · 2016-07-20T21:42:56Z

minor comment. ping when green.

jorisvandenbossche · 2016-07-21T14:16:26Z

@gte620v Thanks a lot!

gte620v · 2016-07-21T14:21:03Z

My pleasure!

gte620v added 3 commits July 7, 2016 01:29

Adding and arguments to for pandas-dev#13461 and pandas-dev#10534

afc7b2e

fixing lint

b7fa8a7

adding whatsnew

cfe786c

jorisvandenbossche added Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap labels Jul 7, 2016

sinhrks reviewed Jul 7, 2016
View reviewed changes

removing old whatsnew

3f828e2

gte620v added 2 commits July 8, 2016 14:07

Merge branch 'master' into add_functions_to_read_html

8a49bad

adding support for more read_html arguments

ef371d0

gte620v changed the title ~~ENH: Adding converters and na_values arguments to read_html for #13461 and #10534~~ ENH: Adding converters, na_values, keep_default_na, squeeze, and date_parser arguments to read_html for #13461 and #10534 Jul 8, 2016

jorisvandenbossche added this to the 0.19.0 milestone Jul 11, 2016

jorisvandenbossche reviewed Jul 11, 2016
View reviewed changes

jorisvandenbossche changed the title ~~ENH: Adding converters, na_values, keep_default_na, squeeze, and date_parser arguments to read_html for #13461 and #10534~~ ENH: Adding converters, na_values, keep_default_na, squeeze, and date_parser arguments to read_html for #13461 Jul 11, 2016

jorisvandenbossche changed the title ~~ENH: Adding converters, na_values, keep_default_na, squeeze, and date_parser arguments to read_html for #13461~~ ENH: Adding additional keywords to read_html for #13461 Jul 11, 2016

consolidating args list and whatsnew list

7e6b5fe

removing squeeze and date_parser

dac660a

jreback reviewed Jul 19, 2016
View reviewed changes

adding new read_html functions to docs

2abb473

jreback reviewed Jul 20, 2016
View reviewed changes

adding versionadded to docs

5cb8243

jorisvandenbossche merged commit 4d3b6c1 into pandas-dev:master Jul 21, 2016

gte620v deleted the add_functions_to_read_html branch July 21, 2016 14:21

jorisvandenbossche mentioned this pull request Feb 11, 2017

Pandas.read_html missing converted data #15366

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Adding additional keywords to read_html for #13461 #13575

ENH: Adding additional keywords to read_html for #13461 #13575

gte620v commented Jul 7, 2016 •

edited

Loading

jorisvandenbossche commented Jul 7, 2016

sinhrks Jul 7, 2016

gte620v commented Jul 8, 2016 •

edited

Loading

gte620v commented Jul 8, 2016 •

edited

Loading

codecov-io commented Jul 8, 2016 •

edited

Loading

gte620v commented Jul 8, 2016

jorisvandenbossche Jul 11, 2016

jorisvandenbossche commented Jul 11, 2016 •

edited

Loading

gte620v commented Jul 11, 2016

jorisvandenbossche commented Jul 11, 2016

jreback commented Jul 19, 2016

gte620v commented Jul 19, 2016

jreback Jul 19, 2016

gte620v Jul 19, 2016

gte620v commented Jul 19, 2016

gte620v commented Jul 19, 2016

jreback Jul 20, 2016

jreback commented Jul 20, 2016

jorisvandenbossche commented Jul 21, 2016

gte620v commented Jul 21, 2016

ENH: Adding additional keywords to read_html for #13461 #13575

ENH: Adding additional keywords to read_html for #13461 #13575

Conversation

gte620v commented Jul 7, 2016 • edited Loading

jorisvandenbossche commented Jul 7, 2016

sinhrks Jul 7, 2016

Choose a reason for hiding this comment

gte620v commented Jul 8, 2016 • edited Loading

gte620v commented Jul 8, 2016 • edited Loading

codecov-io commented Jul 8, 2016 • edited Loading

Current coverage is 84.53%

gte620v commented Jul 8, 2016

jorisvandenbossche Jul 11, 2016

Choose a reason for hiding this comment

jorisvandenbossche commented Jul 11, 2016 • edited Loading

gte620v commented Jul 11, 2016

jorisvandenbossche commented Jul 11, 2016

jreback commented Jul 19, 2016

gte620v commented Jul 19, 2016

jreback Jul 19, 2016

Choose a reason for hiding this comment

gte620v Jul 19, 2016

Choose a reason for hiding this comment

gte620v commented Jul 19, 2016

gte620v commented Jul 19, 2016

jreback Jul 20, 2016

Choose a reason for hiding this comment

jreback commented Jul 20, 2016

jorisvandenbossche commented Jul 21, 2016

gte620v commented Jul 21, 2016

gte620v commented Jul 7, 2016 •

edited

Loading

gte620v commented Jul 8, 2016 •

edited

Loading

gte620v commented Jul 8, 2016 •

edited

Loading

codecov-io commented Jul 8, 2016 •

edited

Loading

jorisvandenbossche commented Jul 11, 2016 •

edited

Loading