IndexError using converters in read_html #14624

Amaelb · 2016-11-09T13:19:05Z

read_html returns a list of DF. Giving a converters parameter (see #13461) applies the converters on each DF. Keys of the converters, when being integers, can not be greater than the number of columns minus 1 of the parsed DF (otherwise it raises an IndexError exception in io.parser.PythonParser._convert_data ).
But most of the time, DFs returned by read_html are of different sizes. Thus converters are unusable on all columns of index greater min([len(df.column) for df in pd.read_html(url)])

Example

import pandas

def converter_one(c):
    return 1

d = {0 : converter_one, 1 : converter_one}
d2 = {0 : converter_one, 1 : converter_one, 2 : converter_one}
url = 'https://web.archive.org/web/20160419075502/http://clients.rte-france.com/lang/an/visiteurs/vie/prod/prevision_production.jsp'
pandas.read_html(url, converters = d2)

Expected Output

As in

pandas.read_html(url, converters = d)

This issue may also prevent to change read_htmlas proposed in #14608

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8
LOCALE: fr_FR.UTF-8

pandas: 0.19.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.25.1
numpy: 1.11.2
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: 4.2.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.0
bs4: 4.4.1
html5lib: 1.0b8
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

mroeschke · 2020-05-07T21:17:17Z

Looks like this example is no longer reproducible. Happy to reopen this issue if we can get a reproducible example.

In [9]: import pandas
   ...:
   ...: def converter_one(c):
   ...:     return 1
   ...:
   ...: d = {0 : converter_one, 1 : converter_one}
   ...: d2 = {0 : converter_one, 1 : converter_one, 2 : converter_one}
   ...: url = 'http://clients.rte-france.com/lang/an/visiteurs/vie/prod/prevision_production.jsp'
   ...: pandas.read_html(url, converters = d2)

ValueError: No tables found

Amaelb · 2020-05-08T11:54:23Z

changing url to

url = 'https://web.archive.org/web/20160419075502/http://clients.rte-france.com/lang/an/visiteurs/vie/prod/prevision_production.jsp'

makes it work again.

paulcwatts · 2021-11-23T22:29:15Z

This is a minimal reproduction:

import pandas as pd

pd.read_html(
    """
<table>
<tbody>
  <tr>
    <td>Foo</td>
  </tr>
</tbody>
</table>
""",
    converters={1: lambda x: x},
)

The backtrace:

Traceback (most recent call last):
  File line 3, in <module>
    pd.read_html(
  File "/usr/local/lib/python3.10/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/pandas/io/html.py", line 1098, in read_html
    return _parse(
  File "/usr/local/lib/python3.10/site-packages/pandas/io/html.py", line 931, in _parse
    ret.append(_data_to_frame(data=table, **kwargs))
  File "/usr/local/lib/python3.10/site-packages/pandas/io/html.py", line 811, in _data_to_frame
    return tp.read()
  File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1047, in read
    index, columns, col_dict = self._engine.read(nrows)
  File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/python_parser.py", line 283, in read
    data = self._convert_data(data)
  File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/python_parser.py", line 321, in _convert_data
    clean_conv = _clean_mapping(self.converters)
  File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/python_parser.py", line 317, in _clean_mapping
    col = self.orig_names[col]
IndexError: list index out of range

jorisvandenbossche added the IO HTML read_html, to_html, Styler.apply, Styler.applymap label Nov 11, 2016

Amaelb mentioned this issue Nov 17, 2016

Feature Request: expose full DOM nodes to converters in html_read #14608

Open

mroeschke closed this as completed May 7, 2020

mroeschke reopened this May 8, 2020

mroeschke added the Bug label May 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IndexError using converters in read_html #14624

IndexError using converters in read_html #14624

Amaelb commented Nov 9, 2016 •

edited by mroeschke

Loading

INSTALLED VERSIONS

mroeschke commented May 7, 2020

Amaelb commented May 8, 2020 •

edited

Loading

paulcwatts commented Nov 23, 2021

IndexError using converters in read_html #14624

IndexError using converters in read_html #14624

Comments

Amaelb commented Nov 9, 2016 • edited by mroeschke Loading

Example

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

mroeschke commented May 7, 2020

Amaelb commented May 8, 2020 • edited Loading

paulcwatts commented Nov 23, 2021

Amaelb commented Nov 9, 2016 •

edited by mroeschke

Loading

Output of `pd.show_versions()`

Amaelb commented May 8, 2020 •

edited

Loading