Skip to content

Dataframe loading with duplicated columns and usecols #11823

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
thenx opened this issue Dec 11, 2015 · 19 comments · Fixed by #53683
Closed

Dataframe loading with duplicated columns and usecols #11823

thenx opened this issue Dec 11, 2015 · 19 comments · Fixed by #53683
Labels
Docs good first issue IO CSV read_csv, to_csv Needs Tests Unit test(s) needed to prevent regressions

Comments

@thenx
Copy link

thenx commented Dec 11, 2015

I'm using pandas 0.17.1

import pandas as pd
  pd.__version__

Out:'0.17.1'

When column names are duplicated

cols = ['A', 'A', 'B']
with open('pandas.csv', 'w') as f:
  f.write('1,2,3')

we can still load dataframe

pd.read_csv('pandas.csv',
            header=None,
            names=cols,
           )

with explainable behaviour

Out:
     A    A    B
0    2    2    3

Then we might want to load some of the columns with python engine

pd.read_csv('pandas.csv',
            engine='python',
            header=None,
            names=cols,
            usecols=cols
           )

and get different but still explainable result

Out:
     A    B
0    1    3

But then we switch back to c-engine

pd.read_csv('pandas.csv',
            engine='c',
            header=None,
            names=cols,
            usecols=cols
           )

and get the following

Out:
     A    A    B
0    2    2    NaN

which is:
(a) different (which is not good in my opinion)
(b) looks like bug

@jreback
Copy link
Contributor

jreback commented Dec 11, 2015

we had an issue about this IIRC, but can't seem to locate it.

can you replace the top with some self-reproducing examples that demonstrate this? (use StringIO and in-line data)

(its actually an easy fix, just need some tests cases). The issue is that we pass a dict of the results to the DataFrame constructor.

@jreback jreback added this to the Next Major Release milestone Dec 11, 2015
@sxwang
Copy link
Contributor

sxwang commented Dec 18, 2015

Maybe I'm missing something, but this doesn't appear to be an easy fix ...
The c parser pandas/parser.pyx first converts the usecols to a set, removing duplicates:

in cdef class TextReader, function __init__:

        # suboptimal
        if usecols is not None:
            self.has_usecols = 1
            self.usecols = set(usecols)

But then it seems that when it's reading the file, it stops after len(usecols) values are read. So in the example above, that 3rd column is never read:
In TextReader._convert_column_data:

            elif self.usecols and nused == len(self.usecols):
                # Once we've gathered all requested columns, stop. GH5766
                break

This occurs before the dict of the result is passed to the DataFrame constructor.
I'm wondering why usecols needs to be a set. Apart from that, any suggestions?

@jreback
Copy link
Contributor

jreback commented Dec 18, 2015

@sxwang yeh might be a bit more involved :)

You can try to change usecols to a list and see where that takes you

@sxwang
Copy link
Contributor

sxwang commented Dec 22, 2015

Ok, I think usecols does not need to be a set, I changed it to a list, that seems to fix it for the c engine. I also updated the python engine to handle duplicated columns/usecols and now it should behave the same.

@gfyoung
Copy link
Member

gfyoung commented May 2, 2016

Just jumping in now: while I am +1 for changing usecols to a list, I am -0.5 for declaring the situation described in the issue as being a bug for pandas. If you are using duplicate names and then specifying those same duplicate str names as usecols, that is inherently ambiguous IMO. If you want to be crystal clear, you should be using integer indices instead.

@gfyoung
Copy link
Member

gfyoung commented May 23, 2016

Having played around with this issue for a little bit, the fix is not very clear-cut, and in fact the changes made in #11882 were not very robust. This is because the underlying code currently assumes that we iterate through the column names in the original file only once because we assume that the column names in usecols are unique. This results in phenomenon like this using the changes in #11882:

>>> data = '1,2,3'
>>> names = ['A', 'A', 'B']
>>> read_csv(StringIO(data), names=names, usecols=[0,1,1], engine='c')
   A  A.1    B
0  1    2  NaN
>>> read_csv(StringIO(data), names=names, usecols=[0,1,1], engine='python')
   A  A.1
0  1    2

That leads to a couple of points and questions:

  1. A complete fix would involve refactoring the column iterations so that it varies depending on whether we have a usecols attribute (i.e. non-null) or not. Is that complexity worth it?

  2. Duplicate string names - allowed or not? I'm in the NOT camp because that is inherently ambiguous, especially if we have duplicate column names. Duplicate integer indices are more acceptable IMO.

  3. How much should we allow the user to duplicate? For example, should we allow usecols=[1] * 1000 if names=['a', 'b']?

OR...

What is the use case exactly for being able to duplicate usecols like this? Can we go for simple documentation change that says we process ONLY unique column names / indices and error if there are duplicates?

@jorisvandenbossche jorisvandenbossche added this to the 0.20.0 milestone Aug 29, 2016
@jreback jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017
@kimsey0
Copy link
Contributor

kimsey0 commented Apr 3, 2019

I've just been hit by this bug as well. I expected pd.read_csv('pandas.csv', header=0, usecols=['A', 'B']) to give the same result as pd.read_csv('pandas.csv', header=0) for a CSV file:

A,A,B
1,2,3

I understand why this can't be changed now due to backwards compatibility, but it would at least be nice to document for parameter usecols.

@gfyoung
Copy link
Member

gfyoung commented Apr 8, 2019

@Kimsey : You're more than welcome to submit a PR for this! A test for this might also be useful as well. Duplicate columns in a CSV are quite buggy, so we mangle the second "A".

@gfyoung gfyoung added Docs and removed Bug labels Apr 8, 2019
@kimsey0
Copy link
Contributor

kimsey0 commented Apr 8, 2019

@gfyoung: I'd be happy to send a PR, but what for? Just a documentation change and a test?

@gfyoung
Copy link
Member

gfyoung commented Apr 8, 2019

Yep, that’s exactly right.

@kimsey0
Copy link
Contributor

kimsey0 commented May 3, 2019

I just came back to this and now I realize, I'm not even sure what behavior I'm supposed to be testing for. When using duplicate names and usecols, since I now know that the behavior I described in my first comment isn't correct, I'd then expect the first column with each name to be kept:

@pytest.mark.parametrize("names,usecols", [
    (["a", "a", "b"], ["a", "b"]),
    (["a", "a", "b"], ["a", "a", "b"])
])
def test_usecols_with_duplicate_names(all_parsers, names, usecols):
    data = """\
1,2,3
4,5,6
7,8,9
10,11,12"""
    parser = all_parsers
    result = parser.read_csv(StringIO(data), names=names, usecols=usecols)

    expected = DataFrame([[1, 3], [4, 6], [7, 9],
                          [10, 12]], columns=["a", "b"])
    tm.assert_frame_equal(result, expected)

However, there's still a difference between the Python engine, in which this passes, and the C engine, in which the actual result is:

    a   b
0   1   2
1   4   5
2   7   8
3  10  11

In any case, I'd expect the exact same result with header=0 and the names put into the header.

@gfyoung
Copy link
Member

gfyoung commented May 3, 2019

The discrepancy: Is this on the first iteration of the parameterization (of names and usecols)? How about the second one? Are those consistent?

@kimsey0
Copy link
Contributor

kimsey0 commented May 3, 2019

It's the same result (the one above) for both parameterizations.

@gfyoung
Copy link
Member

gfyoung commented May 3, 2019

Ah, I see what's happening. Can you add the column names directly into StringIO data? Passing in names with a duplicate column name is discouraged.

Thus, you should be only parameterizing on usecols.

@kimsey0
Copy link
Contributor

kimsey0 commented May 3, 2019

I have a separate test for usecols with a header which works as expected (keeps first column for each name). Passing in names may be discouraged, but it's not prevented, so shouldn't the behavior still be tested and documented?

@gfyoung
Copy link
Member

gfyoung commented May 3, 2019

No, it should not. It only remains so that people have time to remove it from their code. You should have seen a warning about this when you ran the test.

@kimsey0
Copy link
Contributor

kimsey0 commented May 3, 2019

I see. And you wouldn't want to test for that warning (and the error in the future)?

@gfyoung
Copy link
Member

gfyoung commented May 3, 2019

We already have tests for this behavior. Warnings like that would absolutely have to have tests.

@kimsey0
Copy link
Contributor

kimsey0 commented May 3, 2019

Ah! I searched for "Duplicate names specified" and didn't find anything, but it's just specified as tests for UserWarning. I'll just keep the test with header names then.

@mroeschke mroeschke added Testing pandas testing functions or related to the test suite and removed good first issue labels Apr 21, 2021
@mroeschke mroeschke added Needs Tests Unit test(s) needed to prevent regressions good first issue and removed Testing pandas testing functions or related to the test suite labels Jul 1, 2022
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs good first issue IO CSV read_csv, to_csv Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants