Skip to content

BUG: DataFrame.combine_first() works incorrect with with string indexes #37519

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
sfilatov96 opened this issue Oct 30, 2020 · 2 comments · Fixed by #37568
Closed
3 tasks done

BUG: DataFrame.combine_first() works incorrect with with string indexes #37519

sfilatov96 opened this issue Oct 30, 2020 · 2 comments · Fixed by #37568
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@sfilatov96
Copy link

sfilatov96 commented Oct 30, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

def test_astype(self):
        df_data = {'a': {4: '162', 7: '85'}, 'b': {4: NA, 7: NA}}
        df2_data = {'a': {139: '85'}, 'b': {139: NA}}
        df = DataFrame(df_data).astype({
            "a": 'string',
            "b": 'string',
        })
        df2 = DataFrame(df2_data).astype({
            "a": 'string',
            "b": 'string',
        })
        sum_df = df.set_index(['a', 'b']).combine_first(
            df2.set_index(['a', 'b']),
        ).reset_index()
        assert sum_df

Problem description

You can see that in resulting sum_df there are duplicate rows like ('85', nan). This behavior is incorrect. combine_first() method should join such rows

@sfilatov96 sfilatov96 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 30, 2020
@nagesh-chowdaiah
Copy link
Contributor

nagesh-chowdaiah commented Oct 31, 2020

@sfilatov96

df.set_index(['a', 'b']) will return you empty dataframe. with index values as values of a and b and since index is a tuple. (85,nan) of first df and (85, nan) of second index are different. Hence the data is duplicated.

    df_data = {'a': {4: '162', 7: '85'}, 'b': {4: NA, 7: NA}}
    df2_data = {'a': {139: '85'}, 'b': {139: NA}}
    df = DataFrame(df_data).astype({
        "a": 'string',
        "b": 'string',
    })
    df = df.set_index(['a', 'b'])
    df2 = DataFrame(df2_data).astype({
        "a": 'string',
        "b": 'string',
    })
    df2 = df2.set_index(['a', 'b'])
    print(df)
    print(df2)

    df3=pd.concat([df,df2])
    for i in df3.index:
        print(id(i))
>>> print(df)
Empty DataFrame
Columns: []
Index: [(162, nan), (85, nan)]
>>> print(df2)
Empty DataFrame
Columns: []
Index: [(85, nan)]

address of all three index after concatination.

1901858368896
1901853613952
1901841593216

@phofl
Copy link
Member

phofl commented Oct 31, 2020

That is not really the issue. When avoiding the astype statement, we get the result expected by @sfilatov96

df_data = {'a': ["162", "85"], 'b': [NA] * 2}
df2_data = {'a': ["85"], 'b': [NA]}
df = DataFrame(df_data)
df2 = DataFrame(df2_data)
sum_df = df.set_index(['a', 'b']).combine_first(
    df2.set_index(['a', 'b']),
).reset_index()
print(sum_df)

returns

     a    b
0  162  NaN
1   85  NaN

@phofl phofl added Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 1, 2020
@jreback jreback added this to the 1.2 milestone Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants