Skip to content

index_col and usecols do not work reliably together in read_csv #9098

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
awhan opened this issue Dec 17, 2014 · 10 comments · Fixed by #44951
Closed

index_col and usecols do not work reliably together in read_csv #9098

awhan opened this issue Dec 17, 2014 · 10 comments · Fixed by #44951
Labels
Bug IO CSV read_csv, to_csv
Milestone

Comments

@awhan
Copy link

awhan commented Dec 17, 2014

This code shows 3 situations.

import pandas as pd
from io import StringIO
import random
import sys

def fun(s,n,u):
    random.seed(s)
    names = [str(e) for e in range(1, n)]
    data = ','.join([str(e) for e in names])

    usecols = random.sample(names, u)
    index_col = random.choice(usecols)
    print('usecols', usecols)
    print('index_col', index_col)

    try:
        df = pd.read_csv(StringIO(data), names=names, usecols=usecols, index_col=index_col, header=None)
        print(df)
    except:
        print(sys.exc_info())
        df = pd.read_csv(StringIO(data), names=names, usecols=usecols, header=None)
        df.set_index(index_col, inplace=True)
        print(df)

    print('--------------------------------------------------')


fun(123, 10, 4) # exception
fun(123, 10, 5) # works
fun(123, 20, 4) # works BUT index name and value are not proper

here are the results
fun(123, 10, 4), an exception occurs but when index_col is ommitted and later set_index is used then it works ok.

usecols ['1', '5', '9', '4']
index_col 9
(<class 'IndexError'>, IndexError('list index out of range',), <traceback object at 0x7f129001a348>)
   1  4  5
9
9  1  4  5
--------------------------------------------------

fun(123, 10, 5), this worked ok.

usecols ['1', '5', '9', '4', '3']
index_col 1
   3  4  5  9
1
1  3  4  5  9
--------------------------------------------------

fun(123, 20, 4), this worked ok but it picked up the wrong value for the index

usecols ['2', '9', '3', '14']
index_col 3
   2  3  14
3
9  2  3  14
--------------------------------------------------

pandas.__version__ is '0.15.2'
64 bit archlinux
$ python --version
Python 3.4.2

@VelizarVESSELINOV
Copy link

👍

@krassowski
Copy link

I wish we could fix this issue. I would like to contribute but the huge codebase looks intimidating.

Here is what I noticed so far:

  1. read_csv() is in essence the same as read_table(sep=',')
  2. If usecols is a list of integers and the index_col is included in usecols everything works fine
  3. If usecols is a list of names and the index_col is included in usecols, we have index_col result is unexpected when usecols is used to skip a column #12408. This suggests that usecols_key does not know about the index_col and fixing this should help to make things right.
  4. If we fix the (3), we may reduce both the issues: usecols=list of names and usecols=callable to usecols=list of ints, because _evaluate_cols uses usecols_key too.

And here is the reduced reproduction test case for (4):

import pandas as pd
from io import StringIO

data = """\
Gene	Control_1	Control_2	Tumour_1	Tumour_2
TP53	6	6	7	6
BRCA2	6	7	7	9\
"""
expected_result = {
    'Control_1': {'TP53': 6, 'BRCA2': 6},
    'Control_2': {'TP53': 6, 'BRCA2': 7},
}

# when index_col is in usecols: 
df = pd.read_table(StringIO(data), usecols=[0, 1, 2], index_col=0, header=0)
assert df.to_dict() == expected_result   # evaluates to True and passes :)

# when index_col is not in usecols:
df = pd.read_table(StringIO(data), usecols=[1, 2], index_col=0, header=0)
assert df.to_dict() == expected_result   # fails :(

For the latter case df is malformed:

           Control_2
Control_1           
6                  6
6                  7

and to_dict() returns {'Control_2': {6: 7}}.

Here is my question (@jreback ?): is the latter case (when index_col is not in usecols) a correct use of usecols and index_col together? Am I correct to expect that it will work the same as when the index_col is in usecols (which does not)?

@jreback
Copy link
Contributor

jreback commented Oct 3, 2017

is your example on master? we have had a number of fixes related to usecols recently.

cc @gfyoung

@krassowski
Copy link

I just cloned the repo and tested with 0.21.0.dev0+573.g9e67f4370 and the latter case:

pd.read_table(StringIO(data), usecols=[1, 2], index_col=0, header=0)

still does not work as I would expect.

Importantly the test cases from #12408 and from this issue run perfectly fine on version from master :)

@jreback
Copy link
Contributor

jreback commented Oct 3, 2017

@krassowski

so what cases could we close if we have some validation tests?
you could also add the above as an xfail tests (and we can then point directly to it).

want to do a PR?

@gfyoung
Copy link
Member

gfyoung commented Oct 3, 2017

usecols is a very powerful parameter. It has a lot of say as to how the other parameters behave because it tells us the columns on which to operate in our data. The reproducible examples above fail to strike me as buggy in fact.

usecols=[1, 2] means the data you're using is just columns 1 and 2. Thus, index_col=0 refers to the 1st column in your group of selected columns, column 1.

Then again, given the confusion, we should either make this clear that usecols is a first-class parameter OR reconsider this behavior for subsequent versions.

@krassowski
Copy link

@gfyoung that is really what I started to think after I submitted my comment. Thanks for clarification.

Probably a sentence or two in the docs could improve the situation significantly.

@krassowski
Copy link

want to do a PR?

@jreback, given what @gfyoung wrote I start to think that the changes should be made to documentation and the xfail test is not really needed. Is it?

@gfyoung
Copy link
Member

gfyoung commented Oct 3, 2017

@krassowski : Add the test anyways (more tests are good), just without xfail. Doc changes alongside that will be perfect for a PR.

@laserson
Copy link

laserson commented Oct 6, 2017

Just ran into this and reported here:
#2654 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants