BUG: Converters handling inconsistent with usecols #18566

fortooon · 2017-11-29T17:04:08Z

use attached file and run script
csv2.txt

import re
from pandas import read_csv

kw = {'engine': 'python', 'header': 0, 'usecols': [2, 1], 'iterator': True}
kw['sep'] = "[" + re.escape("\t,") + "]" # or just "\t,"
kw["converters"] = {i: lambda(value): value for i in kw["usecols"]} # if comment this line or set 'usecols' : [0, 1] or 'sep': ","   -  work good
reader = read_csv("path to attached csv2.txt file", **kw)
print "rows", [row for row in reader.get_chunk().values]

Problem description

  print "rows", [row for row in reader.get_chunk().values]
  File "/tmp/opt/linux-CentOS_4.4-x64/P7/python-2.7.7-dbg/lib/python2.7/site-packages/pandas/io/parsers.py", line 768, in get_chunk
    return self.read(nrows=size)
  File "/tmp/opt/linux-CentOS_4.4-x64/P7/python-2.7.7-dbg/lib/python2.7/site-packages/pandas/io/parsers.py", line 747, in read
    ret = self._engine.read(nrows)
  File "/tmp/opt/linux-CentOS_4.4-x64/P7/python-2.7.7-dbg/lib/python2.7/site-packages/pandas/io/parsers.py", line 1610, in read
    data = self._convert_data(data)
  File "/tmp/opt/linux-CentOS_4.4-x64/P7/python-2.7.7-dbg/lib/python2.7/site-packages/pandas/io/parsers.py", line 1643, in _convert_data
    col = self.orig_names[col]
IndexError: list index out of range

Terminal INSTALLED VERSIONS
Terminal ------------------
Terminal commit: None
Terminal python: 2.7.7.final.0
Terminal python-bits: 64
Terminal OS: Linux
Terminal OS-release: 4.4.0-97-generic
Terminal machine: x86_64
Terminal processor: x86_64
Terminal byteorder: little
Terminal LC_ALL: None
Terminal LANG: en_US.UTF-8
Terminal LOCALE: en_US.UTF-8
Terminal
Terminal pandas: 0.20.1
Terminal pytest: None
Terminal pip: 9.0.1
Terminal setuptools: 1.1.6
Terminal Cython: None
Terminal numpy: 1.11.2
Terminal scipy: 0.18.1
Terminal xarray: None
Terminal IPython: 0.13.2
Terminal sphinx: None
Terminal patsy: None
Terminal dateutil: 2.3
Terminal pytz: 2014.10
Terminal blosc: None
Terminal bottleneck: None
Terminal tables: None
Terminal numexpr: None
Terminal feather: None
Terminal matplotlib: 1.5.0
Terminal openpyxl: 2.3.0
Terminal xlrd: 0.9.4
Terminal xlwt: 1.0.0
Terminal xlsxwriter: None
Terminal lxml: None
Terminal bs4: None
Terminal html5lib: None
Terminal sqlalchemy: None
Terminal pymysql: None
Terminal psycopg2: None
Terminal jinja2: 2.8
Terminal s3fs: None
Terminal pandas_gbq: None
Terminal pandas_datareader: None

The text was updated successfully, but these errors were encountered:

gfyoung · 2017-12-01T22:51:36Z

@fortooon : Thanks for reporting this! ~~This is not a bug, as we explicitly do not allow separators with such complexity. The IndexError is to be expected, as the regex looks malformed.~~

(EDIT: the explanation was wrong. See below for correct explanation.)

Closing this issue for now can always re-open if necessary.

fortooon · 2017-12-02T15:55:10Z

@gfyoung : Thanks for response. But how can I avoid this error and use regexp separator that determines tab sign or comma? I thought that using 'sep' parameter as regexp is legal according to doc :

In addition, separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'

gfyoung · 2017-12-02T19:02:35Z

@fortooon : Good point. I misspoke yesterday about that. Looking at this with a clearer mind, your regex is indeed correct. However, your converters are not correct. The indices you choose must be relative to the usecols parameter that you pass in. Here is what I mean:

When you specify usecols=[2, 1], the filtered table looks like this:

x2,f
5,6
5,6

The converters are then applied to this filtered table. You can see here that there no longer is a column 2. Thus, when pandas tries to apply your converter to column 2, it gets an IndexError because there are only two columns, and we 0-index.

The reason why usecols=[0, 1] was working for you is because those indices are still correct (and within the bounds) on the filtered table, as you have only two columns. Thus, the two converters are created are still within the bounds.

I hope this clarifies what you're seeing. Let me know if you have any other questions.

fortooon · 2017-12-03T12:37:46Z

@gfyoung Thanks for detailed answer.
But if I change my example a little:

import re
from pandas import read_csv

kw = { 'header': 0, 'usecols': [2], 'iterator': True} # don't use python engine
kw['sep'] = ","# use simple sep
def _convert_cell_value(value):
  print "_convert_cell_value", value
  return value 
kw["converters"] = {i: lambda(value): _convert_cell_value(value) for i in range(len(kw["usecols"]))} 
reader = read_csv("path to attached csv2.txt file", **kw)
print "rows", [row for row in reader.get_chunk().values]

I can see, that my_converter didn't call:
Output : rows [array([6]), array([6])]
But if 'usecols': [2, 1] Output is :

_convert_cell_value 5
_convert_cell_value 5
rows [array(['5', 6], dtype=object), array(['5', 6], dtype=object)]

only one converter was called.
So I think, that different approach of selection converters indices (depend on engine type) will help me.

Please, comment this point. Also situation, when some indices of converters work good with python engine, but not work with default - look like a bug.

gfyoung · 2017-12-03T18:57:45Z

@fortooon : Ah, yes, now we're onto something. This is related to #13302, as converter handling is executed at different points in time between the two engines. Patching this to make handling consistent, however, would require some major refactoring of the code.

gfyoung added the IO CSV read_csv, to_csv label Dec 1, 2017

gfyoung closed this as completed Dec 1, 2017

gfyoung added this to the No action milestone Dec 1, 2017

gfyoung modified the milestones: No action, Next Major Release Dec 3, 2017

gfyoung added the Bug label Dec 3, 2017

gfyoung changed the title ~~Like a parser bug with multiple separator and not ordered usecols and converters~~ BUG: Converters handling inconsistent with usecols Dec 3, 2017

gfyoung added the Difficulty Intermediate label Dec 3, 2017

gfyoung reopened this Dec 3, 2017

jbrockmendel removed the Difficulty Intermediate label Oct 21, 2019

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Converters handling inconsistent with usecols #18566

BUG: Converters handling inconsistent with usecols #18566

fortooon commented Nov 29, 2017

gfyoung commented Dec 1, 2017 •

edited

Loading

fortooon commented Dec 2, 2017 •

edited

Loading

gfyoung commented Dec 2, 2017

fortooon commented Dec 3, 2017

gfyoung commented Dec 3, 2017

BUG: Converters handling inconsistent with usecols #18566

BUG: Converters handling inconsistent with usecols #18566

Comments

fortooon commented Nov 29, 2017

Problem description

gfyoung commented Dec 1, 2017 • edited Loading

fortooon commented Dec 2, 2017 • edited Loading

gfyoung commented Dec 2, 2017

fortooon commented Dec 3, 2017

gfyoung commented Dec 3, 2017

gfyoung commented Dec 1, 2017 •

edited

Loading

fortooon commented Dec 2, 2017 •

edited

Loading