-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
[BUG] Concat duplicates errors (or lack there of) #38654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… duplicates (pandas-dev#36290)" This reverts commit b32febd.
wow let me have a closer look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like this - pretty consistent
can u add test cases for each of the scenarios?
+1 on putting this in 1.2
cc @phofl |
I think i want to merge this for 1.2 @ivirshup can you add some tests cases here (we could merge w/o those). |
We should document that concat does not support duplicates and close all related issues. |
yes abolutely. |
@ivirshup can you update this with some tests that assert this. |
ok i am going to merge this. @ivirshup pls open a new one and we can add tests. |
@meeseeksdev backport 1.2.x |
…#38682) Co-authored-by: Isaac Virshup <[email protected]>
Ah, sorry, took the weekend off for the holiday! I'll get on those tests. Just a point to make on this, I had left the case where dataframes have equal indexes, even with duplicates. E.g: pd.concat([
pd.DataFrame(np.ones((3, 3)), columns=list("aab")),
pd.DataFrame(np.ones((3, 3)), columns=list("aab"))
]) I believe this has worked for a while. It's more obvious what a user intended in this case – even if it's not quite a set or join operation. If this were to be disallowed, I think it could use a deprecation first. |
Indeed, that case has always worked (I assume because the indexes are equal, we don't attempt any reindexing). If we wanted to change that, it would indeed need a deprecation cycle, so let's keep that as a separate issue/PR (but personally I think I am fine with keeping it working as is). |
I would like to ask a question about this behavior on connection with the effect (or lack thereof) of Here is a test case:
This returns the following output:
It appears to me that |
* Revert "[BUG]: Fix ValueError in concat() when at least one Index has duplicates (pandas-dev#36290)" This reverts commit b32febd.
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
@jreback
So here is a basic proposal for making
concat
throw a more useful error when there are duplicate values on the index not being concatenated along. Below are some examples of how this error message is improved. I'm including a comparison to 1.2.0rc1, since it includes a pr which was supposed to change some of these errors to working cases, but those cases are extremely limited. That's why the first commit of this PR reverts that change.I haven't gone too far with this since it a) reverts a merged PR and b) affects the release candidate, so should be conservative. I would like to get feedback on whether I should proceed with this approach.
My thinking right now is that a better error and revert should go into 1.2 – no behaviour change from 1.1, just better errors. A question of whether
pd.concat(..., join="inner")
means an intersection of indices (the current and documented behaviour) or an actual inner join seems like something that could benefit from more discussion.If they are set operations, it should probably error on duplicates. If they are relational algebra operations, they should work with duplicates. It would probably also be useful to then add more arguments to
concat
, likemerge
svalidate
.One dataframe has repeated column names (str)
This PR
Pandas v1.1.5
pandas 1.2.0rc1
One dataframe has repeated column names (int)
This PR
Pandas v1.1.5
pandas 1.2.0rc1
Repeated columns (same amount) different column ordering
This PR:
In pandas v1.15
In pandas v1.2.0rc1
Repeated columns of different values in each
This point is what the reverted PR was trying to fix. This demonstrates that it didn't seem to fix what it set out to, and that the intended fix introduced strange behaviour anyways. To do this, I'll use the reverted test case.
Int index
This PR
Pandas v1.1.5
pandas 1.2.0rc1
This behaviour is a bit strange
First, the
concat
documentation defines the options here as set operations, like intersect and union. Is this a union, or is this an outer join?If these are joins, then inner joins only seem to work when they are equivalent to intersections:
str index
This is mostly just to show the behaviour in 1.2.0rc1 is different for string and int indices
This PR
Pandas v1.1.5
pandas 1.2.0rc1