-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Dataframe loading with duplicated columns and usecols #11823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
we had an issue about this IIRC, but can't seem to locate it. can you replace the top with some self-reproducing examples that demonstrate this? (use StringIO and in-line data) (its actually an easy fix, just need some tests cases). The issue is that we pass a dict of the results to the DataFrame constructor. |
Maybe I'm missing something, but this doesn't appear to be an easy fix ... in # suboptimal
if usecols is not None:
self.has_usecols = 1
self.usecols = set(usecols) But then it seems that when it's reading the file, it stops after elif self.usecols and nused == len(self.usecols):
# Once we've gathered all requested columns, stop. GH5766
break This occurs before the dict of the result is passed to the DataFrame constructor. |
@sxwang yeh might be a bit more involved :) You can try to change |
Ok, I think |
Just jumping in now: while I am +1 for changing |
Having played around with this issue for a little bit, the fix is not very clear-cut, and in fact the changes made in #11882 were not very robust. This is because the underlying code currently assumes that we iterate through the column names in the original file only once because we assume that the column names in >>> data = '1,2,3'
>>> names = ['A', 'A', 'B']
>>> read_csv(StringIO(data), names=names, usecols=[0,1,1], engine='c')
A A.1 B
0 1 2 NaN
>>> read_csv(StringIO(data), names=names, usecols=[0,1,1], engine='python')
A A.1
0 1 2 That leads to a couple of points and questions:
OR... What is the use case exactly for being able to duplicate |
I've just been hit by this bug as well. I expected
I understand why this can't be changed now due to backwards compatibility, but it would at least be nice to document for parameter |
@Kimsey : You're more than welcome to submit a PR for this! A test for this might also be useful as well. Duplicate columns in a CSV are quite buggy, so we mangle the second "A". |
@gfyoung: I'd be happy to send a PR, but what for? Just a documentation change and a test? |
Yep, that’s exactly right. |
I just came back to this and now I realize, I'm not even sure what behavior I'm supposed to be testing for. When using duplicate names and usecols, since I now know that the behavior I described in my first comment isn't correct, I'd then expect the first column with each name to be kept: @pytest.mark.parametrize("names,usecols", [
(["a", "a", "b"], ["a", "b"]),
(["a", "a", "b"], ["a", "a", "b"])
])
def test_usecols_with_duplicate_names(all_parsers, names, usecols):
data = """\
1,2,3
4,5,6
7,8,9
10,11,12"""
parser = all_parsers
result = parser.read_csv(StringIO(data), names=names, usecols=usecols)
expected = DataFrame([[1, 3], [4, 6], [7, 9],
[10, 12]], columns=["a", "b"])
tm.assert_frame_equal(result, expected) However, there's still a difference between the Python engine, in which this passes, and the C engine, in which the actual result is:
In any case, I'd expect the exact same result with |
The discrepancy: Is this on the first iteration of the parameterization (of |
It's the same result (the one above) for both parameterizations. |
Ah, I see what's happening. Can you add the column names directly into Thus, you should be only parameterizing on |
I have a separate test for |
No, it should not. It only remains so that people have time to remove it from their code. You should have seen a warning about this when you ran the test. |
I see. And you wouldn't want to test for that warning (and the error in the future)? |
We already have tests for this behavior. Warnings like that would absolutely have to have tests. |
Ah! I searched for "Duplicate names specified" and didn't find anything, but it's just specified as tests for |
I'm using pandas 0.17.1
Out:'0.17.1'
When column names are duplicated
we can still load dataframe
with explainable behaviour
Then we might want to load some of the columns with python engine
and get different but still explainable result
But then we switch back to c-engine
and get the following
which is:
(a) different (which is not good in my opinion)
(b) looks like bug
The text was updated successfully, but these errors were encountered: