index_col and usecols do not work reliably together in read_csv #9098

awhan · 2014-12-17T10:24:44Z

This code shows 3 situations.

import pandas as pd
from io import StringIO
import random
import sys

def fun(s,n,u):
    random.seed(s)
    names = [str(e) for e in range(1, n)]
    data = ','.join([str(e) for e in names])

    usecols = random.sample(names, u)
    index_col = random.choice(usecols)
    print('usecols', usecols)
    print('index_col', index_col)

    try:
        df = pd.read_csv(StringIO(data), names=names, usecols=usecols, index_col=index_col, header=None)
        print(df)
    except:
        print(sys.exc_info())
        df = pd.read_csv(StringIO(data), names=names, usecols=usecols, header=None)
        df.set_index(index_col, inplace=True)
        print(df)

    print('--------------------------------------------------')


fun(123, 10, 4) # exception
fun(123, 10, 5) # works
fun(123, 20, 4) # works BUT index name and value are not proper

here are the results
fun(123, 10, 4), an exception occurs but when index_col is ommitted and later set_index is used then it works ok.

usecols ['1', '5', '9', '4']
index_col 9
(<class 'IndexError'>, IndexError('list index out of range',), <traceback object at 0x7f129001a348>)
   1  4  5
9
9  1  4  5
--------------------------------------------------

fun(123, 10, 5), this worked ok.

usecols ['1', '5', '9', '4', '3']
index_col 1
   3  4  5  9
1
1  3  4  5  9
--------------------------------------------------

fun(123, 20, 4), this worked ok but it picked up the wrong value for the index

usecols ['2', '9', '3', '14']
index_col 3
   2  3  14
3
9  2  3  14
--------------------------------------------------

pandas.__version__ is '0.15.2'
64 bit archlinux
$ python --version
Python 3.4.2

The text was updated successfully, but these errors were encountered:

VelizarVESSELINOV · 2016-02-22T14:33:09Z

👍

krassowski · 2017-10-02T21:57:41Z

I wish we could fix this issue. I would like to contribute but the huge codebase looks intimidating.

Here is what I noticed so far:

read_csv() is in essence the same as read_table(sep=',')
If usecols is a list of integers and the index_col is included in usecols everything works fine
If usecols is a list of names and the index_col is included in usecols, we have index_col result is unexpected when usecols is used to skip a column #12408. This suggests that usecols_key does not know about the index_col and fixing this should help to make things right.
If we fix the (3), we may reduce both the issues: usecols=list of names and usecols=callable to usecols=list of ints, because _evaluate_cols uses usecols_key too.

And here is the reduced reproduction test case for (4):

import pandas as pd
from io import StringIO

data = """\
Gene	Control_1	Control_2	Tumour_1	Tumour_2
TP53	6	6	7	6
BRCA2	6	7	7	9\
"""
expected_result = {
    'Control_1': {'TP53': 6, 'BRCA2': 6},
    'Control_2': {'TP53': 6, 'BRCA2': 7},
}

# when index_col is in usecols: 
df = pd.read_table(StringIO(data), usecols=[0, 1, 2], index_col=0, header=0)
assert df.to_dict() == expected_result   # evaluates to True and passes :)

# when index_col is not in usecols:
df = pd.read_table(StringIO(data), usecols=[1, 2], index_col=0, header=0)
assert df.to_dict() == expected_result   # fails :(

For the latter case df is malformed:

           Control_2
Control_1           
6                  6
6                  7

and to_dict() returns {'Control_2': {6: 7}}.

Here is my question (@jreback ?): is the latter case (when index_col is not in usecols) a correct use of usecols and index_col together? Am I correct to expect that it will work the same as when the index_col is in usecols (which does not)?

jreback · 2017-10-03T12:29:27Z

is your example on master? we have had a number of fixes related to usecols recently.

cc @gfyoung

krassowski · 2017-10-03T13:13:48Z

I just cloned the repo and tested with 0.21.0.dev0+573.g9e67f4370 and the latter case:

pd.read_table(StringIO(data), usecols=[1, 2], index_col=0, header=0)

still does not work as I would expect.

Importantly the test cases from #12408 and from this issue run perfectly fine on version from master :)

jreback · 2017-10-03T13:17:04Z

@krassowski

so what cases could we close if we have some validation tests?
you could also add the above as an xfail tests (and we can then point directly to it).

want to do a PR?

gfyoung · 2017-10-03T15:47:31Z

usecols is a very powerful parameter. It has a lot of say as to how the other parameters behave because it tells us the columns on which to operate in our data. The reproducible examples above fail to strike me as buggy in fact.

usecols=[1, 2] means the data you're using is just columns 1 and 2. Thus, index_col=0 refers to the 1st column in your group of selected columns, column 1.

Then again, given the confusion, we should either make this clear that usecols is a first-class parameter OR reconsider this behavior for subsequent versions.

krassowski · 2017-10-03T16:21:12Z

@gfyoung that is really what I started to think after I submitted my comment. Thanks for clarification.

Probably a sentence or two in the docs could improve the situation significantly.

krassowski · 2017-10-03T16:26:23Z

want to do a PR?

@jreback, given what @gfyoung wrote I start to think that the changes should be made to documentation and the xfail test is not really needed. Is it?

gfyoung · 2017-10-03T16:30:52Z

@krassowski : Add the test anyways (more tests are good), just without xfail. Doc changes alongside that will be perfect for a PR.

laserson · 2017-10-06T18:07:52Z

Just ran into this and reported here:
#2654 (comment)

awhan mentioned this issue Dec 17, 2014

BUG: "index_col=False" not working when "usecols" is specified in read_csv #9082

Closed

jreback added IO CSV read_csv, to_csv Bug labels Dec 17, 2014

jreback added this to the 0.16.0 milestone Dec 17, 2014

jreback mentioned this issue Jan 2, 2015

BUG: "index_col=False" not working when "usecols" is specified in read_csv (GH9082) #9176

Closed

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

jreback mentioned this issue Feb 22, 2016

index_col result is unexpected when usecols is used to skip a column #12408

Closed

jreback added Difficulty Intermediate labels Oct 3, 2017

gfyoung mentioned this issue Oct 6, 2017

read_csv in combination with index_col and usecols #2654

Closed

jbrockmendel removed Difficulty Intermediate labels Oct 21, 2019

phofl mentioned this issue Dec 18, 2021

Add tests for usecols and index col combinations #44951

Merged

3 tasks

jreback modified the milestones: Contributions Welcome, 1.4 Dec 18, 2021

jreback closed this as completed in #44951 Dec 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index_col and usecols do not work reliably together in read_csv #9098

index_col and usecols do not work reliably together in read_csv #9098

awhan commented Dec 17, 2014

VelizarVESSELINOV commented Feb 22, 2016

krassowski commented Oct 2, 2017

jreback commented Oct 3, 2017

krassowski commented Oct 3, 2017

jreback commented Oct 3, 2017

gfyoung commented Oct 3, 2017 •

edited

Loading

krassowski commented Oct 3, 2017

krassowski commented Oct 3, 2017

gfyoung commented Oct 3, 2017

laserson commented Oct 6, 2017

index_col and usecols do not work reliably together in read_csv #9098

index_col and usecols do not work reliably together in read_csv #9098

Comments

awhan commented Dec 17, 2014

VelizarVESSELINOV commented Feb 22, 2016

krassowski commented Oct 2, 2017

jreback commented Oct 3, 2017

krassowski commented Oct 3, 2017

jreback commented Oct 3, 2017

gfyoung commented Oct 3, 2017 • edited Loading

krassowski commented Oct 3, 2017

krassowski commented Oct 3, 2017

gfyoung commented Oct 3, 2017

laserson commented Oct 6, 2017

gfyoung commented Oct 3, 2017 •

edited

Loading