Assigning values breaks in different ways when duplicate column names #24798

user347 · 2019-01-16T12:56:08Z

Code Sample, a copy-pastable example if possible

Our primary data with two columns with identical name:

df = pd.DataFrame(np.arange(12).reshape(4, 3).T)
df.columns = list('AABC')
print(df)
"""
   A  A  B   C
0  0  3  6   9
1  1  4  7  10
2  2  5  8  11
"""

Issue 1a: Series.replace throws ValueError when assigning:

print(df['B'].replace(6, np.nan))  # will work as expected with int as well
"""
0    NaN
1    7.0
2    8.0
Name: B, dtype: float64
"""
# ValueError: Buffer has wrong number of dimensions (expected 1, got 0):
df['B'] = df['B'].replace(6, np.nan)  # inplace=True does not raise error, but no change
df['B'] = df['B'].replace(6, 5)

Issue 1b: Same ValueError as above thrown when assigning np.nan with loc:

# ValueError: Buffer has wrong number of dimensions (expected 1, got 0):
df.loc[df['B'] == 6, 'B'] = np.nan

# Assigning int with loc will however work: 
df.loc[df['B'] == 6, 'B'] = 5

Issue 2a: assigning np.nan with iloc on column with a duplicate will apply on both columns:

# Assigning np.nan with iloc on column with a duplicate will apply on both columns:
df.iloc[0, 0] = np.nan
print(df)
"""
     A    A  B   C
0  NaN  NaN  5   9
1  1.0  4.0  7  10
2  2.0  5.0  8  11
"""

Issue 2b: assigning int with iloc will work int v0.22.0 but not v0.23.4

df.iloc[0, 0] = 10
print(df)
"""
0.22.0:
    A  A  B   C
0  10  3  5   9
1   1  4  7  10
2   2  5  8  11

0.23.4:
      A     A  B   C
0  10.0  10.0  5   9
1   1.0   4.0  7  10
2   2.0   5.0  8  11
"""
# Assigning with iloc will not break if BOTH columns contain a nan:
x = pd.DataFrame({'a': np.array([np.nan, 1, 2])})
y = pd.DataFrame({'a': np.array([0, np.nan, 2])})

df = pd.concat([x, y], axis=1)

df.iloc[0, 0] = 10
print(df)
"""
      a    a 
0  10.0  0.0
1   1.0  NaN
2   2.0  2.0
"""

Problem description

The main topic for this issue is assigning different values to a DataFrame that contains duplicate column names. List of issues reported:

Issue 1a: Series.replace throws ValueError when assigning (BUG: iloc.__setitem__ with duplicate columns #32477)
Issue 1b: Same ValueError as Issue 1a thrown when assigning np.nan with loc (TST: GH24798 df.replace() with duplicate columns #34302)
Issue 2a: assigning np.nan with iloc on column with a duplicate will apply on both columns (Replacing a column with iloc replaces another column with same name #22036, BUG: .iloc indexing with duplicates #15686 closed by BUG: iloc.__setitem__ with duplicate columns #32477)
Issue 2b: assigning int with iloc will work in v0.22.0 but not v0.23.4 (Replacing a column with iloc replaces another column with same name #22036, BUG: .iloc indexing with duplicates #15686 closed by BUG: iloc.__setitem__ with duplicate columns #32477)

he issue with iloc and np.nan (above called Issue 2a) was reported and closed as fixed here: #13423 per 0.18.0 but I'm able to recreate that issue with v0.23.4.

Expected Output

Either the same output as we would expect if we had only unique names in our columns or a DuplicateColumnWarning/DuplicateColumnException when DataFrame contains duplicate columns.

Output of `pd.show_versions()`

pandas: 0.23.4 pytest: None pip: 18.0 setuptools: 39.0.1 Cython: None numpy: 1.16.0 scipy: 1.1.0T pyarrow: None xarray: None IPython: 6.3.1 sphinx: None patsy: None dateutil: 2.7.2 pytz: 2018.3 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.1.0 openpyxl: 2.5.1 xlrd: 1.1.0 xlwt: None xlsxwriter: None lxml: 4.2.1 bs4: 4.6.0 html5lib: 1.0.1 sqlalchemy: 1.2.5 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-01-16T16:19:37Z

IIUC correctly you have quite a few separate problems here - could you make a checklist of them to aid the discussion? As is now it's a little challenging to dissect the entire sample provided

user347 · 2019-01-16T18:20:42Z

Updated in an attempt to make it a little bit clearer.

TomAugspurger · 2019-01-16T20:27:47Z

@user347 can you search to see if any of these are duplicate issues? Your 1a sounds familiar.

user347 · 2019-01-17T08:17:14Z

@user347 can you search to see if any of these are duplicate issues? Your 1a sounds familiar.

Maybe I'm blind but I searched before posting and again now, and can't find any open issues that seem the same as either of the issues reported here.

simonjayhawkins · 2020-03-30T16:13:47Z

Issue 1A (ValueError: Buffer has wrong number of dimensions (expected 1, got 0)) is fixed on master

>>> import numpy as np
>>>
>>> import pandas as pd
>>>
>>> pd.__version__
'1.1.0.dev0+1029.gbdf969cd6'
>>>
>>> df = pd.DataFrame(np.arange(12).reshape(4, 3).T)
>>> df.columns = list("AABC")
>>> print(df)
   A  A  B   C
0  0  3  6   9
1  1  4  7  10
2  2  5  8  11
>>>
>>> print(df["B"].replace(6, np.nan))
0    NaN
1    7.0
2    8.0
Name: B, dtype: float64
>>>
>>> df["B"] = df["B"].replace(6, np.nan)
>>> print(df)
   A  A    B   C
0  0  3  NaN   9
1  1  4  7.0  10
2  2  5  8.0  11
>>>
>>> df = pd.DataFrame(np.arange(12).reshape(4, 3).T)
>>> df.columns = list("AABC")
>>> print(df)
   A  A  B   C
0  0  3  6   9
1  1  4  7  10
2  2  5  8  11
>>>
>>> df["B"] = df["B"].replace(6, 5)
>>> df
   A  A  B   C
0  0  3  5   9
1  1  4  7  10
2  2  5  8  11
>>>

simonjayhawkins · 2020-03-30T17:28:35Z

Issue 1A (ValueError: Buffer has wrong number of dimensions (expected 1, got 0)) is fixed on master

fixed in #32477

33f67d9 is the first new commit
commit 33f67d9
Author: jbrockmendel [email protected]
Date: Tue Mar 10 19:35:21 2020 -0700

BUG: iloc.__setitem__ with duplicate columns (#32477)

This PR also fixed Issue 1b

>>> import numpy as np
>>>
>>> import pandas as pd
>>>
>>> pd.__version__
'1.1.0.dev0+756.g33f67d98a'
>>>
>>> df = pd.DataFrame(np.arange(12).reshape(4, 3).T)
>>> df.columns = list("AABC")
>>> print(df)
   A  A  B   C
0  0  3  6   9
1  1  4  7  10
2  2  5  8  11
>>>
>>> # ValueError: Buffer has wrong number of dimensions (expected 1, got 0):
>>> df.loc[df["B"] == 6, "B"] = np.nan
>>> print(df)
   A  A    B   C
0  0  3  NaN   9
1  1  4  7.0  10
2  2  5  8.0  11
>>>

simonjayhawkins · 2020-03-30T17:56:22Z

Issue 2a (and Issue 2b) is a duplicate of #22036 and #15686 and also fixed by #32477 (no additional tests needed)

matteosantama · 2020-05-20T02:39:03Z

Is there more work to be done here? Or is this ticket effectively closed?

simonjayhawkins · 2020-05-20T11:48:40Z

Is there more work to be done here? Or is this ticket effectively closed?

need to add tests for 1A and 1B to prevent regression before closing issue.

PRs welcome.

matteosantama · 2020-05-20T18:19:07Z

Happy to take this. I imagine the test (at least for 1a) will live in tests/frame/methods/test_replace.py. This file doesn't seem to be passing tests currently. I'm trying to run pytest test_replace.py from within the /methods directory.

AttributeError: module 'pandas._libs.tslibs.offsets' has no attribute 'YearOffset'

Any advice on tracking this down? Or do you know if someone working on a fix?

WillAyd · 2020-05-20T18:20:49Z

You will need to rebuild the C extensions. Check here for more info:

https://pandas.pydata.org/pandas-docs/stable/development/contributing.html#creating-a-python-environment

matteosantama · 2020-05-20T19:20:05Z

Thanks Will, that did it.

With the original problem statement in mind I've parametrized the test the run with inplace=[True, False], but it's causing a bit of a problem.

result = pd.DataFrame({"A": [1, 2, 3], "A": [4, 5, 6], "B": [7, 8, 9]})
result["B"] = result["B"].replace(7, np.nan, inplace=True)
    
expected = pd.DataFrame(
    {"A": [1, 2, 3], "A": [4, 5, 6], "B": [np.nan, 8, 9]}
)

Fails because the B column in result is object dtype, but float64 in expected. Doesn't happen when inplace=False.

Do you think this is expected behavior? Or is there another issue lurking?

EDIT: Solved the above issue by not assigning the column when using inplace=True.

I've run into a second issue, though. I'm attempting to replace with pd.NA as well, and when I initialize expected as

expected = pd.DataFrame({"A": [1, 2, 3], "A": [4, 5, 6], "B": [pd.NA, 8, 9]})

result = pd.DataFrame({"A": [1, 2, 3], "A": [4, 5, 6], "B": [7, 8, 9]})
result["B"] = result["B"].replace(7, pd.NA, inplace=False)

expected column B has dtype object. This errors when I compare to result, which has dtype float64. I know the whole None/pd.NA/np.nan ecosystem is problematic. In short, replacing with pd.NA seems to maintain dtype, but initializing with pd.NA falls back on object dtype.

Do you think it's better to leave out the pd.NA case, or ignore dtype when comparing the dataframes (at least for the pd.NA parameter)?

simonjayhawkins · 2020-05-21T11:40:17Z

@matteosantama Thanks for looking into this.

With the original problem statement in mind I've parametrized the test the run with inplace=[True, False], but it's causing a bit of a problem.
result = pd.DataFrame({"A": [1, 2, 3], "A": [4, 5, 6], "B": [7, 8, 9]})
result["B"] = result["B"].replace(7, np.nan, inplace=True)
    
expected = pd.DataFrame(
    {"A": [1, 2, 3], "A": [4, 5, 6], "B": [np.nan, 8, 9]}
)
Fails because the B column in result is object dtype, but float64 in expected. Doesn't happen when inplace=False.

Do you think this is expected behavior? Or is there another issue lurking?

The return type is None, when inplace=True, so effectively assigning None to B

>>>
>>> pd.__version__
'1.1.0.dev0+1624.gc5ea16f302'
>>>
>>> result = pd.DataFrame({"A": [1, 2, 3], "A": [4, 5, 6], "B": [7, 8, 9]})
>>> result["B"] = result["B"].replace(7, np.nan, inplace=True)
>>> result
   A     B
0  4  None
1  5  None
2  6  None
>>>

to parameterise over inplace parameter you will need to use the the following pattern in the test

        if inplace:
            result["B"].replace(7, np.nan, inplace=True)
        else:
            result["B"]= result["B"].replace(7, np.nan, inplace=False)

also note that pd.DataFrame({"A": [1, 2, 3], "A": [4, 5, 6], "B": [7, 8, 9]}) does NOT create a dataframe with duplicate column names.

>>> pd.DataFrame({"A": [1, 2, 3], "A": [4, 5, 6], "B": [7, 8, 9]})
   A  B
0  4  7
1  5  8
2  6  9
>>>

you could use the same pattern in the OP to assign column names after creating the dataframe. maybe

>>> df = pd.DataFrame({"A": [1, 2, 3], "A1": [4, 5, 6], "B": [7, 8, 9]})
>>> df.columns = list("AAB")
>>> df
   A  A  B
0  1  4  7
1  2  5  8
2  3  6  9
>>>

matteosantama · 2020-05-21T16:40:25Z

also note that pd.DataFrame({"A": [1, 2, 3], "A": [4, 5, 6], "B": [7, 8, 9]}) does NOT create a dataframe with duplicate column names.

@simonjayhawkins Oops! Good catch, new here so trying to get my bearings. 1A) still seems to be partially broken.

>>> pd.__version__
'1.1.0.dev0+1622.g23c7e85d8.dirty'
>>> 
>>> result = pd.DataFrame({"A": [1, 2, 3], "A1": [4, 5, 6], "B": [7, 8, 9]})
>>> result.columns = list("AAB")
>>> 
>>> result["B"].replace(7, np.nan, inplace=True)
/home/pandas/pandas/util/_decorators.py:355: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)
>>>
>>> result
   A  A  B
0  1  4  7
1  2  5  8
2  3  6  9

If you point me in the right direction I can take a crack at it.

simonjayhawkins · 2020-05-21T18:52:21Z

1A) still seems to be partially broken.

inplace was not included in the OP (only as a comment) and appears to still be broken. can you see if we have an issue for this. (maybe raise one)

to close this issue, an more importantly to add tests to prevent regression from master for the working cases fixed in #32477 we can xfail the inplace tests for now if necessary.

can you submit your current changes as a PR as this will make it easier to review.

If you point me in the right direction I can take a crack at it.

any fixes should be done separately, can we keep the PR scoped to adding tests.

matteosantama · 2020-05-21T21:12:04Z

Understood, PR submitted. It's still failing one test, and the is issue that when a DataFrame column is initialized with a pd.NA entry, the dtype is object, but when you replace with pd.NA the dtype is float64.

>>> pd.__version__
'1.1.0.dev0+1622.g23c7e85d8.dirty'
>>> 
>>> expected = pd.DataFrame({"A": [1, 2, 3], "A1": [4, 5, 6], "B": [pd.NA, 8, 9]})
>>> expected.columns = list("AAB")
>>> 
>>> result = pd.DataFrame({"A": [1, 2, 3], "A1": [4, 5, 6], "B": [7, 8, 9]})
>>> result.columns = list("AAB")
>>> result["B"] = result["B"].replace(7, pd.NA, inplace=False)
>>> 
>>> expected.dtypes
A     int64
A     int64
B    object
dtype: object
>>> 
>>> result.dtypes
A      int64
A      int64
B    float64
dtype: object

jreback · 2020-05-25T22:56:04Z

@simonjayhawkins see if the refs on the checkboxes above make sense.

TomAugspurger · 2020-07-15T19:50:32Z

@matteosantama did #34302 close this?

matteosantama · 2020-07-16T14:39:14Z

@matteosantama did #34302 close this?

@TomAugspurger #34302 adds a test to detect regression in Series.replace() (issue 1a).

TomAugspurger · 2020-07-16T15:19:06Z

OK, thanks. 1b was the only one fixed but unchecked, so maybe everything has been fixed. If not, then we'll open a new issue.

WillAyd added the Needs Info Clarification about behavior needed to assess issue label Jan 16, 2019

TomAugspurger added Indexing Related to indexing on series/frames, not to indexes themselves Dtype Conversions Unexpected or buggy dtype conversions labels Jan 16, 2019

rok mentioned this issue Jul 6, 2019

BUG: Enable changing dtype of column where column names are not unique #27256

Closed

4 tasks

mroeschke removed the Needs Info Clarification about behavior needed to assess issue label Mar 8, 2020

simonjayhawkins added the Needs Tests Unit test(s) needed to prevent regressions label Mar 30, 2020

simonjayhawkins added good first issue and removed Dtype Conversions Unexpected or buggy dtype conversions Indexing Related to indexing on series/frames, not to indexes themselves labels Mar 30, 2020

simonjayhawkins added this to the Contributions Welcome milestone May 20, 2020

matteosantama mentioned this issue May 25, 2020

TST: GH24798 df.replace() with duplicate columns #34302

Merged

5 tasks

jreback modified the milestones: Contributions Welcome, 1.1 May 25, 2020

jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label May 25, 2020

TomAugspurger closed this as completed Jul 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assigning values breaks in different ways when duplicate column names #24798

Assigning values breaks in different ways when duplicate column names #24798

user347 commented Jan 16, 2019 •

edited by TomAugspurger

Loading

WillAyd commented Jan 16, 2019

user347 commented Jan 16, 2019

TomAugspurger commented Jan 16, 2019

user347 commented Jan 17, 2019

simonjayhawkins commented Mar 30, 2020

simonjayhawkins commented Mar 30, 2020

simonjayhawkins commented Mar 30, 2020 •

edited

Loading

matteosantama commented May 20, 2020

simonjayhawkins commented May 20, 2020

matteosantama commented May 20, 2020

WillAyd commented May 20, 2020

matteosantama commented May 20, 2020 •

edited

Loading

simonjayhawkins commented May 21, 2020

matteosantama commented May 21, 2020

simonjayhawkins commented May 21, 2020

matteosantama commented May 21, 2020 •

edited

Loading

jreback commented May 25, 2020

TomAugspurger commented Jul 15, 2020

matteosantama commented Jul 16, 2020

TomAugspurger commented Jul 16, 2020

Assigning values breaks in different ways when duplicate column names #24798

Assigning values breaks in different ways when duplicate column names #24798

Comments

user347 commented Jan 16, 2019 • edited by TomAugspurger Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

WillAyd commented Jan 16, 2019

user347 commented Jan 16, 2019

TomAugspurger commented Jan 16, 2019

user347 commented Jan 17, 2019

simonjayhawkins commented Mar 30, 2020

simonjayhawkins commented Mar 30, 2020

simonjayhawkins commented Mar 30, 2020 • edited Loading

matteosantama commented May 20, 2020

simonjayhawkins commented May 20, 2020

matteosantama commented May 20, 2020

WillAyd commented May 20, 2020

matteosantama commented May 20, 2020 • edited Loading

simonjayhawkins commented May 21, 2020

matteosantama commented May 21, 2020

simonjayhawkins commented May 21, 2020

matteosantama commented May 21, 2020 • edited Loading

jreback commented May 25, 2020

TomAugspurger commented Jul 15, 2020

matteosantama commented Jul 16, 2020

TomAugspurger commented Jul 16, 2020

user347 commented Jan 16, 2019 •

edited by TomAugspurger

Loading

Output of `pd.show_versions()`

simonjayhawkins commented Mar 30, 2020 •

edited

Loading

matteosantama commented May 20, 2020 •

edited

Loading

matteosantama commented May 21, 2020 •

edited

Loading