Skip to content

fix for read_html with bs4 failing on table with header and one column #12975

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

fix for read_html with bs4 failing on table with header and one column #12975

wants to merge 3 commits into from

Conversation

hnykda
Copy link

@hnykda hnykda commented Apr 24, 2016

  • closes #9178
  • The test is added and passing (while failing before the fix).
  • passes git diff upstream/master | flake8 --diff
  • whatsnew entry

Fix as had been proposed in PR 9194, but this PR was closed because of tests missing. They are added now.

@jreback
Copy link
Contributor

jreback commented Apr 25, 2016

git diff master | flake8 --diff

@jreback jreback added Bug IO HTML read_html, to_html, Styler.apply, Styler.applymap labels Apr 25, 2016
@hnykda
Copy link
Author

hnykda commented Apr 25, 2016

Should be OK now.

"""
Don't fail with bs4 when there is a header and only one column
"""
data = StringIO('''<html>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number as a comment

@jreback
Copy link
Contributor

jreback commented Apr 25, 2016

small comments. ping when green.

@jreback jreback added this to the 0.18.1 milestone Apr 25, 2016
@hnykda
Copy link
Author

hnykda commented Apr 25, 2016

Done.

(I wasn't sure if I can use (:issue 9178), so it's just as a regular comment)

data = StringIO('''<html>
<body>
<table>
<thead>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't seem to replicate the error message though:

In [1]:         data = StringIO('''<html>
   ...:             <body>
   ...:              <table>
   ...:                 <thead>
   ...:                     <tr>
   ...:                         <th>Header</th>
   ...:                     </tr>
   ...:                 </thead>
   ...:                 <tbody>
   ...:                     <tr>
   ...:                         <td>first</td>
   ...:                     </tr>
   ...:                 </tbody>
   ...:             </table>
   ...:             </body>
   ...:         </html>''')

In [2]: pd.read_html(data)
Out[2]: 
[  Header
 0  first]

In [3]: pd.read_html(data)[0]
Out[3]: 
  Header
0  first

In [4]: pd.__version__
Out[4]: u'0.18.0'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should't this raise a similar error?

Copy link
Author

@hnykda hnykda Apr 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You forgot to add flavor='bs4'. When I do:

In [2]: pandas.__version__
Out[2]: '0.18.0'

In [3]: s = '''<html>
            <body>
             <table>
                <thead>
                    <tr>
                        <th>Header</th>
                    </tr>
                </thead>
                <tbody>
                    <tr>
                        <td>first</td>
                    </tr>
                </tbody>
            </table>
            </body>
        </html>'''

In [4]: pandas.read_html(s, flavor="bs4")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-5f8f3ea79c02> in <module>()
----> 1 pandas.read_html(s, flavor="bs4")

/home/dan/.local/opt/miniconda3/envs/mathbs/lib/python3.5/site-packages/pandas/io/html.py in read_html(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, tupleize_cols, thousands, encoding)
    868     _validate_header_arg(header)
    869     return _parse(flavor, io, match, header, index_col, skiprows,
--> 870                   parse_dates, tupleize_cols, thousands, attrs, encoding)

/home/dan/.local/opt/miniconda3/envs/mathbs/lib/python3.5/site-packages/pandas/io/html.py in _parse(flavor, io, match, header, index_col, skiprows, parse_dates, tupleize_cols, thousands, attrs, encoding)
    741                                       parse_dates=parse_dates,
    742                                       tupleize_cols=tupleize_cols,
--> 743                                       thousands=thousands))
    744         except StopIteration:  # empty table
    745             continue

/home/dan/.local/opt/miniconda3/envs/mathbs/lib/python3.5/site-packages/pandas/io/html.py in _data_to_frame(data, header, index_col, skiprows, parse_dates, tupleize_cols, thousands)
    622 
    623     # fill out elements of body that are "ragged"
--> 624     _expand_elements(body)
    625 
    626     tp = TextParser(body, header=header, index_col=index_col,

/home/dan/.local/opt/miniconda3/envs/mathbs/lib/python3.5/site-packages/pandas/io/html.py in _expand_elements(body)
    599 
    600 def _expand_elements(body):
--> 601     lens = Series(lmap(len, body))
    602     lens_max = lens.max()
    603     not_max = lens[lens != lens_max]

/home/dan/.local/opt/miniconda3/envs/mathbs/lib/python3.5/site-packages/pandas/compat/__init__.py in lmap(*args, **kwargs)
    116 
    117     def lmap(*args, **kwargs):
--> 118         return list(map(*args, **kwargs))
    119 
    120     def lfilter(*args, **kwargs):

TypeError: len() of unsized object

while using patched version it works:

In [3]: import pandas

In [4]: pandas.__version__
Out[4]: '0.18.0+145.g9b6f9f2'

In [5]: s = '''<html>
            <body>
             <table>
                <thead>
                    <tr>
                        <th>Header</th>
                    </tr>
                </thead>
                <tbody>
                    <tr>
                        <td>first</td>
                    </tr>
                </tbody>
            </table>
            </body>
        </html>'''

In [6]: pandas.read_html(s, flavor="bs4")
Out[6]: 
[  Header
 0  first]

@jreback
Copy link
Contributor

jreback commented Apr 25, 2016

@hnykda ahh I see. we are testing with multiple flavors. I think we default to lxml which I have installed so it works now. ok. then. ping on green.

@hnykda
Copy link
Author

hnykda commented Apr 25, 2016

Exactly.

Everything is green.

@jreback
Copy link
Contributor

jreback commented Apr 25, 2016

thanks @hnykda

@jreback jreback closed this in bec5272 Apr 25, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants