Skip to content

Assigning values breaks in different ways when duplicate column names #24798

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
4 tasks done
user347 opened this issue Jan 16, 2019 · 20 comments
Closed
4 tasks done

Assigning values breaks in different ways when duplicate column names #24798

user347 opened this issue Jan 16, 2019 · 20 comments
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@user347
Copy link

user347 commented Jan 16, 2019

Code Sample, a copy-pastable example if possible

Our primary data with two columns with identical name:

df = pd.DataFrame(np.arange(12).reshape(4, 3).T)
df.columns = list('AABC')
print(df)
"""
   A  A  B   C
0  0  3  6   9
1  1  4  7  10
2  2  5  8  11
"""

Issue 1a: Series.replace throws ValueError when assigning:

print(df['B'].replace(6, np.nan))  # will work as expected with int as well
"""
0    NaN
1    7.0
2    8.0
Name: B, dtype: float64
"""
# ValueError: Buffer has wrong number of dimensions (expected 1, got 0):
df['B'] = df['B'].replace(6, np.nan)  # inplace=True does not raise error, but no change
df['B'] = df['B'].replace(6, 5)

Issue 1b: Same ValueError as above thrown when assigning np.nan with loc:

# ValueError: Buffer has wrong number of dimensions (expected 1, got 0):
df.loc[df['B'] == 6, 'B'] = np.nan

# Assigning int with loc will however work: 
df.loc[df['B'] == 6, 'B'] = 5

Issue 2a: assigning np.nan with iloc on column with a duplicate will apply on both columns:

# Assigning np.nan with iloc on column with a duplicate will apply on both columns:
df.iloc[0, 0] = np.nan
print(df)
"""
     A    A  B   C
0  NaN  NaN  5   9
1  1.0  4.0  7  10
2  2.0  5.0  8  11
"""

Issue 2b: assigning int with iloc will work int v0.22.0 but not v0.23.4

df.iloc[0, 0] = 10
print(df)
"""
0.22.0:
    A  A  B   C
0  10  3  5   9
1   1  4  7  10
2   2  5  8  11

0.23.4:
      A     A  B   C
0  10.0  10.0  5   9
1   1.0   4.0  7  10
2   2.0   5.0  8  11
"""
# Assigning with iloc will not break if BOTH columns contain a nan:
x = pd.DataFrame({'a': np.array([np.nan, 1, 2])})
y = pd.DataFrame({'a': np.array([0, np.nan, 2])})

df = pd.concat([x, y], axis=1)

df.iloc[0, 0] = 10
print(df)
"""
      a    a 
0  10.0  0.0
1   1.0  NaN
2   2.0  2.0
"""

Problem description

The main topic for this issue is assigning different values to a DataFrame that contains duplicate column names. List of issues reported:

he issue with iloc and np.nan (above called Issue 2a) was reported and closed as fixed here: #13423 per 0.18.0 but I'm able to recreate that issue with v0.23.4.

Expected Output

Either the same output as we would expect if we had only unique names in our columns or a DuplicateColumnWarning/DuplicateColumnException when DataFrame contains duplicate columns.

Output of pd.show_versions()

pandas: 0.23.4 pytest: None pip: 18.0 setuptools: 39.0.1 Cython: None numpy: 1.16.0 scipy: 1.1.0T pyarrow: None xarray: None IPython: 6.3.1 sphinx: None patsy: None dateutil: 2.7.2 pytz: 2018.3 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.1.0 openpyxl: 2.5.1 xlrd: 1.1.0 xlwt: None xlsxwriter: None lxml: 4.2.1 bs4: 4.6.0 html5lib: 1.0.1 sqlalchemy: 1.2.5 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
@WillAyd
Copy link
Member

WillAyd commented Jan 16, 2019

IIUC correctly you have quite a few separate problems here - could you make a checklist of them to aid the discussion? As is now it's a little challenging to dissect the entire sample provided

@WillAyd WillAyd added the Needs Info Clarification about behavior needed to assess issue label Jan 16, 2019
@user347
Copy link
Author

user347 commented Jan 16, 2019

Updated in an attempt to make it a little bit clearer.

@TomAugspurger
Copy link
Contributor

@user347 can you search to see if any of these are duplicate issues? Your 1a sounds familiar.

@TomAugspurger TomAugspurger added Indexing Related to indexing on series/frames, not to indexes themselves Dtype Conversions Unexpected or buggy dtype conversions labels Jan 16, 2019
@user347
Copy link
Author

user347 commented Jan 17, 2019

@user347 can you search to see if any of these are duplicate issues? Your 1a sounds familiar.

Maybe I'm blind but I searched before posting and again now, and can't find any open issues that seem the same as either of the issues reported here.

@mroeschke mroeschke removed the Needs Info Clarification about behavior needed to assess issue label Mar 8, 2020
@simonjayhawkins
Copy link
Member

Issue 1A (ValueError: Buffer has wrong number of dimensions (expected 1, got 0)) is fixed on master

>>> import numpy as np
>>>
>>> import pandas as pd
>>>
>>> pd.__version__
'1.1.0.dev0+1029.gbdf969cd6'
>>>
>>> df = pd.DataFrame(np.arange(12).reshape(4, 3).T)
>>> df.columns = list("AABC")
>>> print(df)
   A  A  B   C
0  0  3  6   9
1  1  4  7  10
2  2  5  8  11
>>>
>>> print(df["B"].replace(6, np.nan))
0    NaN
1    7.0
2    8.0
Name: B, dtype: float64
>>>
>>> df["B"] = df["B"].replace(6, np.nan)
>>> print(df)
   A  A    B   C
0  0  3  NaN   9
1  1  4  7.0  10
2  2  5  8.0  11
>>>
>>> df = pd.DataFrame(np.arange(12).reshape(4, 3).T)
>>> df.columns = list("AABC")
>>> print(df)
   A  A  B   C
0  0  3  6   9
1  1  4  7  10
2  2  5  8  11
>>>
>>> df["B"] = df["B"].replace(6, 5)
>>> df
   A  A  B   C
0  0  3  5   9
1  1  4  7  10
2  2  5  8  11
>>>

@simonjayhawkins
Copy link
Member

Issue 1A (ValueError: Buffer has wrong number of dimensions (expected 1, got 0)) is fixed on master

fixed in #32477

33f67d9 is the first new commit
commit 33f67d9
Author: jbrockmendel [email protected]
Date: Tue Mar 10 19:35:21 2020 -0700

BUG: iloc.__setitem__ with duplicate columns (#32477)

This PR also fixed Issue 1b

>>> import numpy as np
>>>
>>> import pandas as pd
>>>
>>> pd.__version__
'1.1.0.dev0+756.g33f67d98a'
>>>
>>> df = pd.DataFrame(np.arange(12).reshape(4, 3).T)
>>> df.columns = list("AABC")
>>> print(df)
   A  A  B   C
0  0  3  6   9
1  1  4  7  10
2  2  5  8  11
>>>
>>> # ValueError: Buffer has wrong number of dimensions (expected 1, got 0):
>>> df.loc[df["B"] == 6, "B"] = np.nan
>>> print(df)
   A  A    B   C
0  0  3  NaN   9
1  1  4  7.0  10
2  2  5  8.0  11
>>>

@simonjayhawkins simonjayhawkins added the Needs Tests Unit test(s) needed to prevent regressions label Mar 30, 2020
@simonjayhawkins
Copy link
Member

simonjayhawkins commented Mar 30, 2020

Issue 2a (and Issue 2b) is a duplicate of #22036 and #15686 and also fixed by #32477 (no additional tests needed)

@simonjayhawkins simonjayhawkins added good first issue and removed Dtype Conversions Unexpected or buggy dtype conversions Indexing Related to indexing on series/frames, not to indexes themselves labels Mar 30, 2020
@matteosantama
Copy link
Contributor

Is there more work to be done here? Or is this ticket effectively closed?

@simonjayhawkins
Copy link
Member

Is there more work to be done here? Or is this ticket effectively closed?

need to add tests for 1A and 1B to prevent regression before closing issue.

PRs welcome.

@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone May 20, 2020
@matteosantama
Copy link
Contributor

Happy to take this. I imagine the test (at least for 1a) will live in tests/frame/methods/test_replace.py. This file doesn't seem to be passing tests currently. I'm trying to run pytest test_replace.py from within the /methods directory.

AttributeError: module 'pandas._libs.tslibs.offsets' has no attribute 'YearOffset'

Any advice on tracking this down? Or do you know if someone working on a fix?

@WillAyd
Copy link
Member

WillAyd commented May 20, 2020

You will need to rebuild the C extensions. Check here for more info:

https://pandas.pydata.org/pandas-docs/stable/development/contributing.html#creating-a-python-environment

@matteosantama
Copy link
Contributor

matteosantama commented May 20, 2020

Thanks Will, that did it.

With the original problem statement in mind I've parametrized the test the run with inplace=[True, False], but it's causing a bit of a problem.

result = pd.DataFrame({"A": [1, 2, 3], "A": [4, 5, 6], "B": [7, 8, 9]})
result["B"] = result["B"].replace(7, np.nan, inplace=True)
    
expected = pd.DataFrame(
    {"A": [1, 2, 3], "A": [4, 5, 6], "B": [np.nan, 8, 9]}
)

Fails because the B column in result is object dtype, but float64 in expected. Doesn't happen when inplace=False.

Do you think this is expected behavior? Or is there another issue lurking?

EDIT: Solved the above issue by not assigning the column when using inplace=True.

I've run into a second issue, though. I'm attempting to replace with pd.NA as well, and when I initialize expected as

expected = pd.DataFrame({"A": [1, 2, 3], "A": [4, 5, 6], "B": [pd.NA, 8, 9]})

result = pd.DataFrame({"A": [1, 2, 3], "A": [4, 5, 6], "B": [7, 8, 9]})
result["B"] = result["B"].replace(7, pd.NA, inplace=False)

expected column B has dtype object. This errors when I compare to result, which has dtype float64. I know the whole None/pd.NA/np.nan ecosystem is problematic. In short, replacing with pd.NA seems to maintain dtype, but initializing with pd.NA falls back on object dtype.

Do you think it's better to leave out the pd.NA case, or ignore dtype when comparing the dataframes (at least for the pd.NA parameter)?

@simonjayhawkins
Copy link
Member

@matteosantama Thanks for looking into this.

With the original problem statement in mind I've parametrized the test the run with inplace=[True, False], but it's causing a bit of a problem.

result = pd.DataFrame({"A": [1, 2, 3], "A": [4, 5, 6], "B": [7, 8, 9]})
result["B"] = result["B"].replace(7, np.nan, inplace=True)
    
expected = pd.DataFrame(
    {"A": [1, 2, 3], "A": [4, 5, 6], "B": [np.nan, 8, 9]}
)

Fails because the B column in result is object dtype, but float64 in expected. Doesn't happen when inplace=False.

Do you think this is expected behavior? Or is there another issue lurking?

The return type is None, when inplace=True, so effectively assigning None to B

>>>
>>> pd.__version__
'1.1.0.dev0+1624.gc5ea16f302'
>>>
>>> result = pd.DataFrame({"A": [1, 2, 3], "A": [4, 5, 6], "B": [7, 8, 9]})
>>> result["B"] = result["B"].replace(7, np.nan, inplace=True)
>>> result
   A     B
0  4  None
1  5  None
2  6  None
>>>

to parameterise over inplace parameter you will need to use the the following pattern in the test

        if inplace:
            result["B"].replace(7, np.nan, inplace=True)
        else:
            result["B"]= result["B"].replace(7, np.nan, inplace=False)

also note that pd.DataFrame({"A": [1, 2, 3], "A": [4, 5, 6], "B": [7, 8, 9]}) does NOT create a dataframe with duplicate column names.

>>> pd.DataFrame({"A": [1, 2, 3], "A": [4, 5, 6], "B": [7, 8, 9]})
   A  B
0  4  7
1  5  8
2  6  9
>>>

you could use the same pattern in the OP to assign column names after creating the dataframe. maybe

>>> df = pd.DataFrame({"A": [1, 2, 3], "A1": [4, 5, 6], "B": [7, 8, 9]})
>>> df.columns = list("AAB")
>>> df
   A  A  B
0  1  4  7
1  2  5  8
2  3  6  9
>>>

@matteosantama
Copy link
Contributor

also note that pd.DataFrame({"A": [1, 2, 3], "A": [4, 5, 6], "B": [7, 8, 9]}) does NOT create a dataframe with duplicate column names.

@simonjayhawkins Oops! Good catch, new here so trying to get my bearings. 1A) still seems to be partially broken.

>>> pd.__version__
'1.1.0.dev0+1622.g23c7e85d8.dirty'
>>> 
>>> result = pd.DataFrame({"A": [1, 2, 3], "A1": [4, 5, 6], "B": [7, 8, 9]})
>>> result.columns = list("AAB")
>>> 
>>> result["B"].replace(7, np.nan, inplace=True)
/home/pandas/pandas/util/_decorators.py:355: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)
>>>
>>> result
   A  A  B
0  1  4  7
1  2  5  8
2  3  6  9

If you point me in the right direction I can take a crack at it.

@simonjayhawkins
Copy link
Member

1A) still seems to be partially broken.

inplace was not included in the OP (only as a comment) and appears to still be broken. can you see if we have an issue for this. (maybe raise one)

to close this issue, an more importantly to add tests to prevent regression from master for the working cases fixed in #32477 we can xfail the inplace tests for now if necessary.

can you submit your current changes as a PR as this will make it easier to review.

If you point me in the right direction I can take a crack at it.

any fixes should be done separately, can we keep the PR scoped to adding tests.

@matteosantama
Copy link
Contributor

matteosantama commented May 21, 2020

Understood, PR submitted. It's still failing one test, and the is issue that when a DataFrame column is initialized with a pd.NA entry, the dtype is object, but when you replace with pd.NA the dtype is float64.

>>> pd.__version__
'1.1.0.dev0+1622.g23c7e85d8.dirty'
>>> 
>>> expected = pd.DataFrame({"A": [1, 2, 3], "A1": [4, 5, 6], "B": [pd.NA, 8, 9]})
>>> expected.columns = list("AAB")
>>> 
>>> result = pd.DataFrame({"A": [1, 2, 3], "A1": [4, 5, 6], "B": [7, 8, 9]})
>>> result.columns = list("AAB")
>>> result["B"] = result["B"].replace(7, pd.NA, inplace=False)
>>> 
>>> expected.dtypes
A     int64
A     int64
B    object
dtype: object
>>> 
>>> result.dtypes
A      int64
A      int64
B    float64
dtype: object

@jreback jreback modified the milestones: Contributions Welcome, 1.1 May 25, 2020
@jreback jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label May 25, 2020
@jreback
Copy link
Contributor

jreback commented May 25, 2020

@simonjayhawkins see if the refs on the checkboxes above make sense.

@TomAugspurger
Copy link
Contributor

@matteosantama did #34302 close this?

@matteosantama
Copy link
Contributor

@matteosantama did #34302 close this?

@TomAugspurger #34302 adds a test to detect regression in Series.replace() (issue 1a).

@TomAugspurger
Copy link
Contributor

OK, thanks. 1b was the only one fixed but unchecked, so maybe everything has been fixed. If not, then we'll open a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

7 participants