Skip to content

IndexError using converters in read_html #14624

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Amaelb opened this issue Nov 9, 2016 · 3 comments
Open

IndexError using converters in read_html #14624

Amaelb opened this issue Nov 9, 2016 · 3 comments
Labels
Bug IO HTML read_html, to_html, Styler.apply, Styler.applymap

Comments

@Amaelb
Copy link

Amaelb commented Nov 9, 2016

read_html returns a list of DF. Giving a converters parameter (see #13461) applies the converters on each DF. Keys of the converters, when being integers, can not be greater than the number of columns minus 1 of the parsed DF (otherwise it raises an IndexError exception in io.parser.PythonParser._convert_data ).
But most of the time, DFs returned by read_html are of different sizes. Thus converters are unusable on all columns of index greater min([len(df.column) for df in pd.read_html(url)])

Example

import pandas

def converter_one(c):
    return 1

d = {0 : converter_one, 1 : converter_one}
d2 = {0 : converter_one, 1 : converter_one, 2 : converter_one}
url = 'https://web.archive.org/web/20160419075502/http://clients.rte-france.com/lang/an/visiteurs/vie/prod/prevision_production.jsp'
pandas.read_html(url, converters = d2)

Expected Output

As in

pandas.read_html(url, converters = d)

This issue may also prevent to change read_htmlas proposed in #14608

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8
LOCALE: fr_FR.UTF-8

pandas: 0.19.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.25.1
numpy: 1.11.2
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: 4.2.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.0
bs4: 4.4.1
html5lib: 1.0b8
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None

@jorisvandenbossche jorisvandenbossche added the IO HTML read_html, to_html, Styler.apply, Styler.applymap label Nov 11, 2016
@mroeschke
Copy link
Member

Looks like this example is no longer reproducible. Happy to reopen this issue if we can get a reproducible example.

In [9]: import pandas
   ...:
   ...: def converter_one(c):
   ...:     return 1
   ...:
   ...: d = {0 : converter_one, 1 : converter_one}
   ...: d2 = {0 : converter_one, 1 : converter_one, 2 : converter_one}
   ...: url = 'http://clients.rte-france.com/lang/an/visiteurs/vie/prod/prevision_production.jsp'
   ...: pandas.read_html(url, converters = d2)

ValueError: No tables found

@Amaelb
Copy link
Author

Amaelb commented May 8, 2020

changing url to

url = 'https://web.archive.org/web/20160419075502/http://clients.rte-france.com/lang/an/visiteurs/vie/prod/prevision_production.jsp'

makes it work again.

@mroeschke mroeschke reopened this May 8, 2020
@mroeschke mroeschke added the Bug label May 8, 2020
@paulcwatts
Copy link

This is a minimal reproduction:

import pandas as pd

pd.read_html(
    """
<table>
<tbody>
  <tr>
    <td>Foo</td>
  </tr>
</tbody>
</table>
""",
    converters={1: lambda x: x},
)

The backtrace:

Traceback (most recent call last):
  File line 3, in <module>
    pd.read_html(
  File "/usr/local/lib/python3.10/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/pandas/io/html.py", line 1098, in read_html
    return _parse(
  File "/usr/local/lib/python3.10/site-packages/pandas/io/html.py", line 931, in _parse
    ret.append(_data_to_frame(data=table, **kwargs))
  File "/usr/local/lib/python3.10/site-packages/pandas/io/html.py", line 811, in _data_to_frame
    return tp.read()
  File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1047, in read
    index, columns, col_dict = self._engine.read(nrows)
  File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/python_parser.py", line 283, in read
    data = self._convert_data(data)
  File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/python_parser.py", line 321, in _convert_data
    clean_conv = _clean_mapping(self.converters)
  File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/python_parser.py", line 317, in _clean_mapping
    col = self.orig_names[col]
IndexError: list index out of range

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

No branches or pull requests

4 participants