Skip to content

BUG: Converters handling inconsistent with usecols #18566

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
fortooon opened this issue Nov 29, 2017 · 5 comments
Open

BUG: Converters handling inconsistent with usecols #18566

fortooon opened this issue Nov 29, 2017 · 5 comments
Labels
Bug IO CSV read_csv, to_csv

Comments

@fortooon
Copy link

use attached file and run script
csv2.txt

import re
from pandas import read_csv

kw = {'engine': 'python', 'header': 0, 'usecols': [2, 1], 'iterator': True}
kw['sep'] = "[" + re.escape("\t,") + "]" # or just "\t,"
kw["converters"] = {i: lambda(value): value for i in kw["usecols"]} # if comment this line or set 'usecols' : [0, 1] or 'sep': ","   -  work good
reader = read_csv("path to attached csv2.txt file", **kw)
print "rows", [row for row in reader.get_chunk().values]

Problem description

  print "rows", [row for row in reader.get_chunk().values]
  File "/tmp/opt/linux-CentOS_4.4-x64/P7/python-2.7.7-dbg/lib/python2.7/site-packages/pandas/io/parsers.py", line 768, in get_chunk
    return self.read(nrows=size)
  File "/tmp/opt/linux-CentOS_4.4-x64/P7/python-2.7.7-dbg/lib/python2.7/site-packages/pandas/io/parsers.py", line 747, in read
    ret = self._engine.read(nrows)
  File "/tmp/opt/linux-CentOS_4.4-x64/P7/python-2.7.7-dbg/lib/python2.7/site-packages/pandas/io/parsers.py", line 1610, in read
    data = self._convert_data(data)
  File "/tmp/opt/linux-CentOS_4.4-x64/P7/python-2.7.7-dbg/lib/python2.7/site-packages/pandas/io/parsers.py", line 1643, in _convert_data
    col = self.orig_names[col]
IndexError: list index out of range

Terminal INSTALLED VERSIONS
Terminal ------------------
Terminal commit: None
Terminal python: 2.7.7.final.0
Terminal python-bits: 64
Terminal OS: Linux
Terminal OS-release: 4.4.0-97-generic
Terminal machine: x86_64
Terminal processor: x86_64
Terminal byteorder: little
Terminal LC_ALL: None
Terminal LANG: en_US.UTF-8
Terminal LOCALE: en_US.UTF-8
Terminal
Terminal pandas: 0.20.1
Terminal pytest: None
Terminal pip: 9.0.1
Terminal setuptools: 1.1.6
Terminal Cython: None
Terminal numpy: 1.11.2
Terminal scipy: 0.18.1
Terminal xarray: None
Terminal IPython: 0.13.2
Terminal sphinx: None
Terminal patsy: None
Terminal dateutil: 2.3
Terminal pytz: 2014.10
Terminal blosc: None
Terminal bottleneck: None
Terminal tables: None
Terminal numexpr: None
Terminal feather: None
Terminal matplotlib: 1.5.0
Terminal openpyxl: 2.3.0
Terminal xlrd: 0.9.4
Terminal xlwt: 1.0.0
Terminal xlsxwriter: None
Terminal lxml: None
Terminal bs4: None
Terminal html5lib: None
Terminal sqlalchemy: None
Terminal pymysql: None
Terminal psycopg2: None
Terminal jinja2: 2.8
Terminal s3fs: None
Terminal pandas_gbq: None
Terminal pandas_datareader: None

@gfyoung gfyoung added the IO CSV read_csv, to_csv label Dec 1, 2017
@gfyoung
Copy link
Member

gfyoung commented Dec 1, 2017

@fortooon : Thanks for reporting this! This is not a bug, as we explicitly do not allow separators with such complexity. The IndexError is to be expected, as the regex looks malformed.

(EDIT: the explanation was wrong. See below for correct explanation.)

Closing this issue for now can always re-open if necessary.

@gfyoung gfyoung closed this as completed Dec 1, 2017
@gfyoung gfyoung added this to the No action milestone Dec 1, 2017
@fortooon
Copy link
Author

fortooon commented Dec 2, 2017

@gfyoung : Thanks for response. But how can I avoid this error and use regexp separator that determines tab sign or comma? I thought that using 'sep' parameter as regexp is legal according to doc :

In addition, separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'

@gfyoung
Copy link
Member

gfyoung commented Dec 2, 2017

@fortooon : Good point. I misspoke yesterday about that. Looking at this with a clearer mind, your regex is indeed correct. However, your converters are not correct. The indices you choose must be relative to the usecols parameter that you pass in. Here is what I mean:

When you specify usecols=[2, 1], the filtered table looks like this:

x2,f
5,6
5,6

The converters are then applied to this filtered table. You can see here that there no longer is a column 2. Thus, when pandas tries to apply your converter to column 2, it gets an IndexError because there are only two columns, and we 0-index.

The reason why usecols=[0, 1] was working for you is because those indices are still correct (and within the bounds) on the filtered table, as you have only two columns. Thus, the two converters are created are still within the bounds.

I hope this clarifies what you're seeing. Let me know if you have any other questions.

@fortooon
Copy link
Author

fortooon commented Dec 3, 2017

@gfyoung Thanks for detailed answer.
But if I change my example a little:

import re
from pandas import read_csv

kw = { 'header': 0, 'usecols': [2], 'iterator': True} # don't use python engine
kw['sep'] = ","# use simple sep
def _convert_cell_value(value):
  print "_convert_cell_value", value
  return value 
kw["converters"] = {i: lambda(value): _convert_cell_value(value) for i in range(len(kw["usecols"]))} 
reader = read_csv("path to attached csv2.txt file", **kw)
print "rows", [row for row in reader.get_chunk().values] 

I can see, that my_converter didn't call:
Output : rows [array([6]), array([6])]
But if 'usecols': [2, 1] Output is :

_convert_cell_value 5
_convert_cell_value 5
rows [array(['5', 6], dtype=object), array(['5', 6], dtype=object)]

only one converter was called.
So I think, that different approach of selection converters indices (depend on engine type) will help me.

Please, comment this point. Also situation, when some indices of converters work good with python engine, but not work with default - look like a bug.

@gfyoung gfyoung modified the milestones: No action, Next Major Release Dec 3, 2017
@gfyoung gfyoung added the Bug label Dec 3, 2017
@gfyoung gfyoung changed the title Like a parser bug with multiple separator and not ordered usecols and converters BUG: Converters handling inconsistent with usecols Dec 3, 2017
@gfyoung
Copy link
Member

gfyoung commented Dec 3, 2017

@fortooon : Ah, yes, now we're onto something. This is related to #13302, as converter handling is executed at different points in time between the two engines. Patching this to make handling consistent, however, would require some major refactoring of the code.

@gfyoung gfyoung reopened this Dec 3, 2017
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

4 participants