-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Assigning values breaks in different ways when duplicate column names #24798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
IIUC correctly you have quite a few separate problems here - could you make a checklist of them to aid the discussion? As is now it's a little challenging to dissect the entire sample provided |
Updated in an attempt to make it a little bit clearer. |
@user347 can you search to see if any of these are duplicate issues? Your 1a sounds familiar. |
Maybe I'm blind but I searched before posting and again now, and can't find any open issues that seem the same as either of the issues reported here. |
Issue 1A (ValueError: Buffer has wrong number of dimensions (expected 1, got 0)) is fixed on master
|
fixed in #32477 33f67d9 is the first new commit
This PR also fixed Issue 1b
|
Is there more work to be done here? Or is this ticket effectively closed? |
need to add tests for 1A and 1B to prevent regression before closing issue. PRs welcome. |
Happy to take this. I imagine the test (at least for 1a) will live in tests/frame/methods/test_replace.py. This file doesn't seem to be passing tests currently. I'm trying to run
Any advice on tracking this down? Or do you know if someone working on a fix? |
You will need to rebuild the C extensions. Check here for more info: |
Thanks Will, that did it. With the original problem statement in mind I've parametrized the test the run with inplace=[True, False], but it's causing a bit of a problem.
Fails because the B column in Do you think this is expected behavior? Or is there another issue lurking? EDIT: Solved the above issue by not assigning the column when using inplace=True. I've run into a second issue, though. I'm attempting to replace with
expected column B has dtype object. This errors when I compare to Do you think it's better to leave out the pd.NA case, or ignore dtype when comparing the dataframes (at least for the pd.NA parameter)? |
@matteosantama Thanks for looking into this.
The return type is None, when inplace=True, so effectively assigning None to
to parameterise over inplace parameter you will need to use the the following pattern in the test
also note that
you could use the same pattern in the OP to assign column names after creating the dataframe. maybe
|
@simonjayhawkins Oops! Good catch, new here so trying to get my bearings. 1A) still seems to be partially broken.
If you point me in the right direction I can take a crack at it. |
inplace was not included in the OP (only as a comment) and appears to still be broken. can you see if we have an issue for this. (maybe raise one) to close this issue, an more importantly to add tests to prevent regression from master for the working cases fixed in #32477 we can xfail the inplace tests for now if necessary. can you submit your current changes as a PR as this will make it easier to review.
any fixes should be done separately, can we keep the PR scoped to adding tests. |
Understood, PR submitted. It's still failing one test, and the is issue that when a DataFrame column is initialized with a pd.NA entry, the dtype is object, but when you replace with pd.NA the dtype is float64.
|
@simonjayhawkins see if the refs on the checkboxes above make sense. |
@matteosantama did #34302 close this? |
@TomAugspurger #34302 adds a test to detect regression in Series.replace() (issue 1a). |
OK, thanks. 1b was the only one fixed but unchecked, so maybe everything has been fixed. If not, then we'll open a new issue. |
Code Sample, a copy-pastable example if possible
Our primary data with two columns with identical name:
Issue 1a:
Series.replace
throwsValueError
when assigning:Issue 1b: Same
ValueError
as above thrown when assigningnp.nan
withloc
:Issue 2a: assigning
np.nan
withiloc
on column with a duplicate will apply on both columns:Issue 2b: assigning
int
withiloc
will work intv0.22.0
but notv0.23.4
Problem description
The main topic for this issue is assigning different values to a
DataFrame
that contains duplicate column names. List of issues reported:Series.replace
throwsValueError
when assigning (BUG: iloc.__setitem__ with duplicate columns #32477)ValueError
asIssue 1a
thrown when assigning np.nan with loc (TST: GH24798 df.replace() with duplicate columns #34302)np.nan
withiloc
on column with a duplicate will apply on both columns (Replacing a column with iloc replaces another column with same name #22036, BUG: .iloc indexing with duplicates #15686 closed by BUG: iloc.__setitem__ with duplicate columns #32477)int
withiloc
will work inv0.22.0
but notv0.23.4
(Replacing a column with iloc replaces another column with same name #22036, BUG: .iloc indexing with duplicates #15686 closed by BUG: iloc.__setitem__ with duplicate columns #32477)he issue with
iloc
andnp.nan
(above calledIssue 2a
) was reported and closed as fixed here: #13423 per0.18.0
but I'm able to recreate that issue withv0.23.4
.Expected Output
Either the same output as we would expect if we had only unique names in our columns or a
DuplicateColumnWarning
/DuplicateColumnException
whenDataFrame
contains duplicate columns.Output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: