Dataframe loading with duplicated columns and usecols #11823

thenx · 2015-12-11T11:35:17Z

I'm using pandas 0.17.1

import pandas as pd
  pd.__version__

Out:'0.17.1'

When column names are duplicated

cols = ['A', 'A', 'B']
with open('pandas.csv', 'w') as f:
  f.write('1,2,3')

we can still load dataframe

pd.read_csv('pandas.csv',
            header=None,
            names=cols,
           )

with explainable behaviour

Out:
     A    A    B
0    2    2    3

Then we might want to load some of the columns with python engine

pd.read_csv('pandas.csv',
            engine='python',
            header=None,
            names=cols,
            usecols=cols
           )

and get different but still explainable result

Out:
     A    B
0    1    3

But then we switch back to c-engine

pd.read_csv('pandas.csv',
            engine='c',
            header=None,
            names=cols,
            usecols=cols
           )

and get the following

Out:
     A    A    B
0    2    2    NaN

which is:
(a) different (which is not good in my opinion)
(b) looks like bug

The text was updated successfully, but these errors were encountered:

jreback · 2015-12-11T15:22:33Z

we had an issue about this IIRC, but can't seem to locate it.

can you replace the top with some self-reproducing examples that demonstrate this? (use StringIO and in-line data)

(its actually an easy fix, just need some tests cases). The issue is that we pass a dict of the results to the DataFrame constructor.

sxwang · 2015-12-18T06:17:32Z

Maybe I'm missing something, but this doesn't appear to be an easy fix ...
The c parser pandas/parser.pyx first converts the usecols to a set, removing duplicates:

in cdef class TextReader, function __init__:

        # suboptimal
        if usecols is not None:
            self.has_usecols = 1
            self.usecols = set(usecols)

But then it seems that when it's reading the file, it stops after len(usecols) values are read. So in the example above, that 3rd column is never read:
In TextReader._convert_column_data:

            elif self.usecols and nused == len(self.usecols):
                # Once we've gathered all requested columns, stop. GH5766
                break

This occurs before the dict of the result is passed to the DataFrame constructor.
I'm wondering why usecols needs to be a set. Apart from that, any suggestions?

jreback · 2015-12-18T14:21:13Z

@sxwang yeh might be a bit more involved :)

You can try to change usecols to a list and see where that takes you

sxwang · 2015-12-22T03:56:10Z

Ok, I think usecols does not need to be a set, I changed it to a list, that seems to fix it for the c engine. I also updated the python engine to handle duplicated columns/usecols and now it should behave the same.

…#11823

gfyoung · 2016-05-02T23:31:01Z

Just jumping in now: while I am +1 for changing usecols to a list, I am -0.5 for declaring the situation described in the issue as being a bug for pandas. If you are using duplicate names and then specifying those same duplicate str names as usecols, that is inherently ambiguous IMO. If you want to be crystal clear, you should be using integer indices instead.

gfyoung · 2016-05-23T23:56:17Z

Having played around with this issue for a little bit, the fix is not very clear-cut, and in fact the changes made in #11882 were not very robust. This is because the underlying code currently assumes that we iterate through the column names in the original file only once because we assume that the column names in usecols are unique. This results in phenomenon like this using the changes in #11882:

>>> data = '1,2,3'
>>> names = ['A', 'A', 'B']
>>> read_csv(StringIO(data), names=names, usecols=[0,1,1], engine='c')
   A  A.1    B
0  1    2  NaN
>>> read_csv(StringIO(data), names=names, usecols=[0,1,1], engine='python')
   A  A.1
0  1    2

That leads to a couple of points and questions:

A complete fix would involve refactoring the column iterations so that it varies depending on whether we have a usecols attribute (i.e. non-null) or not. Is that complexity worth it?
Duplicate string names - allowed or not? I'm in the NOT camp because that is inherently ambiguous, especially if we have duplicate column names. Duplicate integer indices are more acceptable IMO.
How much should we allow the user to duplicate? For example, should we allow usecols=[1] * 1000 if names=['a', 'b']?

OR...

What is the use case exactly for being able to duplicate usecols like this? Can we go for simple documentation change that says we process ONLY unique column names / indices and error if there are duplicates?

kimsey0 · 2019-04-03T11:48:08Z

I've just been hit by this bug as well. I expected pd.read_csv('pandas.csv', header=0, usecols=['A', 'B']) to give the same result as pd.read_csv('pandas.csv', header=0) for a CSV file:

A,A,B
1,2,3

I understand why this can't be changed now due to backwards compatibility, but it would at least be nice to document for parameter usecols.

gfyoung · 2019-04-08T15:57:23Z

@Kimsey : You're more than welcome to submit a PR for this! A test for this might also be useful as well. Duplicate columns in a CSV are quite buggy, so we mangle the second "A".

kimsey0 · 2019-04-08T16:04:54Z

@gfyoung: I'd be happy to send a PR, but what for? Just a documentation change and a test?

gfyoung · 2019-04-08T16:44:01Z

Yep, that’s exactly right.

kimsey0 · 2019-05-03T07:30:29Z

I just came back to this and now I realize, I'm not even sure what behavior I'm supposed to be testing for. When using duplicate names and usecols, since I now know that the behavior I described in my first comment isn't correct, I'd then expect the first column with each name to be kept:

@pytest.mark.parametrize("names,usecols", [
    (["a", "a", "b"], ["a", "b"]),
    (["a", "a", "b"], ["a", "a", "b"])
])
def test_usecols_with_duplicate_names(all_parsers, names, usecols):
    data = """\
1,2,3
4,5,6
7,8,9
10,11,12"""
    parser = all_parsers
    result = parser.read_csv(StringIO(data), names=names, usecols=usecols)

    expected = DataFrame([[1, 3], [4, 6], [7, 9],
                          [10, 12]], columns=["a", "b"])
    tm.assert_frame_equal(result, expected)

However, there's still a difference between the Python engine, in which this passes, and the C engine, in which the actual result is:

In any case, I'd expect the exact same result with header=0 and the names put into the header.

gfyoung · 2019-05-03T07:34:43Z

The discrepancy: Is this on the first iteration of the parameterization (of names and usecols)? How about the second one? Are those consistent?

kimsey0 · 2019-05-03T07:37:21Z

It's the same result (the one above) for both parameterizations.

gfyoung · 2019-05-03T09:01:45Z

Ah, I see what's happening. Can you add the column names directly into StringIO data? Passing in names with a duplicate column name is discouraged.

Thus, you should be only parameterizing on usecols.

kimsey0 · 2019-05-03T09:14:10Z

I have a separate test for usecols with a header which works as expected (keeps first column for each name). Passing in names may be discouraged, but it's not prevented, so shouldn't the behavior still be tested and documented?

gfyoung · 2019-05-03T09:16:06Z

No, it should not. It only remains so that people have time to remove it from their code. You should have seen a warning about this when you ran the test.

kimsey0 · 2019-05-03T09:19:22Z

I see. And you wouldn't want to test for that warning (and the error in the future)?

gfyoung · 2019-05-03T09:25:15Z

We already have tests for this behavior. Warnings like that would absolutely have to have tests.

kimsey0 · 2019-05-03T09:39:05Z

Ah! I searched for "Duplicate names specified" and didn't find anything, but it's just specified as tests for UserWarning. I'll just keep the test with header names then.

jreback added Bug IO CSV read_csv, to_csv Difficulty Novice labels Dec 11, 2015

jreback added this to the Next Major Release milestone Dec 11, 2015

sxwang mentioned this issue Dec 22, 2015

BUG: dataframe loading with duplicated columns and usecols #11823 #11882

Closed

sxwang added a commit to sxwang/pandas that referenced this issue Mar 20, 2016

BUG: dataframe loading with duplicated columns and usecols pandas-dev…

27349b9

…#11823

jreback modified the milestones: 0.18.1, Next Major Release Mar 20, 2016

jreback mentioned this issue Mar 20, 2016

ERR: dont' allow ambiguous usecols #12678

Closed

jreback modified the milestones: 0.18.2, 0.18.1 Apr 18, 2016

jorisvandenbossche added this to the 0.20.0 milestone Aug 29, 2016

jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017

TomAugspurger added the good first issue label Oct 11, 2017

jreback removed the Difficulty Novice label Dec 15, 2017

gfyoung added Docs and removed Bug labels Apr 8, 2019

jbrockmendel removed the Effort Low label Oct 21, 2019

mroeschke added Testing pandas testing functions or related to the test suite and removed good first issue labels Apr 21, 2021

mroeschke added Needs Tests Unit test(s) needed to prevent regressions good first issue and removed Testing pandas testing functions or related to the test suite labels Jul 1, 2022

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

liang3zy22 mentioned this issue Jun 15, 2023

TST: Add test for duplcated columns and usecols GH#11823 #53683

Merged

5 tasks

mroeschke closed this as completed in #53683 Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataframe loading with duplicated columns and usecols #11823

Dataframe loading with duplicated columns and usecols #11823

thenx commented Dec 11, 2015

jreback commented Dec 11, 2015

sxwang commented Dec 18, 2015

jreback commented Dec 18, 2015

sxwang commented Dec 22, 2015

gfyoung commented May 2, 2016 •

edited

Loading

gfyoung commented May 23, 2016 •

edited

Loading

kimsey0 commented Apr 3, 2019

gfyoung commented Apr 8, 2019

kimsey0 commented Apr 8, 2019

gfyoung commented Apr 8, 2019

kimsey0 commented May 3, 2019

gfyoung commented May 3, 2019 •

edited

Loading

kimsey0 commented May 3, 2019

gfyoung commented May 3, 2019 •

edited

Loading

kimsey0 commented May 3, 2019

gfyoung commented May 3, 2019 •

edited

Loading

kimsey0 commented May 3, 2019

gfyoung commented May 3, 2019

kimsey0 commented May 3, 2019

Dataframe loading with duplicated columns and usecols #11823

Dataframe loading with duplicated columns and usecols #11823

Comments

thenx commented Dec 11, 2015

jreback commented Dec 11, 2015

sxwang commented Dec 18, 2015

jreback commented Dec 18, 2015

sxwang commented Dec 22, 2015

gfyoung commented May 2, 2016 • edited Loading

gfyoung commented May 23, 2016 • edited Loading

kimsey0 commented Apr 3, 2019

gfyoung commented Apr 8, 2019

kimsey0 commented Apr 8, 2019

gfyoung commented Apr 8, 2019

kimsey0 commented May 3, 2019

gfyoung commented May 3, 2019 • edited Loading

kimsey0 commented May 3, 2019

gfyoung commented May 3, 2019 • edited Loading

kimsey0 commented May 3, 2019

gfyoung commented May 3, 2019 • edited Loading

kimsey0 commented May 3, 2019

gfyoung commented May 3, 2019

kimsey0 commented May 3, 2019

gfyoung commented May 2, 2016 •

edited

Loading

gfyoung commented May 23, 2016 •

edited

Loading

gfyoung commented May 3, 2019 •

edited

Loading

gfyoung commented May 3, 2019 •

edited

Loading

gfyoung commented May 3, 2019 •

edited

Loading