BUG: DataFrame.combine_first() works incorrect with with string indexes #37519

sfilatov96 · 2020-10-30T12:42:12Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

def test_astype(self):
        df_data = {'a': {4: '162', 7: '85'}, 'b': {4: NA, 7: NA}}
        df2_data = {'a': {139: '85'}, 'b': {139: NA}}
        df = DataFrame(df_data).astype({
            "a": 'string',
            "b": 'string',
        })
        df2 = DataFrame(df2_data).astype({
            "a": 'string',
            "b": 'string',
        })
        sum_df = df.set_index(['a', 'b']).combine_first(
            df2.set_index(['a', 'b']),
        ).reset_index()
        assert sum_df

Problem description

You can see that in resulting sum_df there are duplicate rows like ('85', nan). This behavior is incorrect. combine_first() method should join such rows

The text was updated successfully, but these errors were encountered:

nagesh-chowdaiah · 2020-10-31T14:32:14Z

@sfilatov96

df.set_index(['a', 'b']) will return you empty dataframe. with index values as values of a and b and since index is a tuple. (85,nan) of first df and (85, nan) of second index are different. Hence the data is duplicated.

    df_data = {'a': {4: '162', 7: '85'}, 'b': {4: NA, 7: NA}}
    df2_data = {'a': {139: '85'}, 'b': {139: NA}}
    df = DataFrame(df_data).astype({
        "a": 'string',
        "b": 'string',
    })
    df = df.set_index(['a', 'b'])
    df2 = DataFrame(df2_data).astype({
        "a": 'string',
        "b": 'string',
    })
    df2 = df2.set_index(['a', 'b'])
    print(df)
    print(df2)

    df3=pd.concat([df,df2])
    for i in df3.index:
        print(id(i))

>>> print(df)
Empty DataFrame
Columns: []
Index: [(162, nan), (85, nan)]
>>> print(df2)
Empty DataFrame
Columns: []
Index: [(85, nan)]

address of all three index after concatination.

1901858368896
1901853613952
1901841593216

phofl · 2020-10-31T22:45:13Z

That is not really the issue. When avoiding the astype statement, we get the result expected by @sfilatov96

df_data = {'a': ["162", "85"], 'b': [NA] * 2}
df2_data = {'a': ["85"], 'b': [NA]}
df = DataFrame(df_data)
df2 = DataFrame(df2_data)
sum_df = df.set_index(['a', 'b']).combine_first(
    df2.set_index(['a', 'b']),
).reset_index()
print(sum_df)

returns

     a    b
0  162  NaN
1   85  NaN

sfilatov96 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 30, 2020

phofl mentioned this issue Nov 1, 2020

BUG: Fix bug in combine_first with string dtype and only NA #37568

Merged

5 tasks

phofl added Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 1, 2020

jreback added this to the 1.2 milestone Nov 2, 2020

jreback closed this as completed in #37568 Nov 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: DataFrame.combine_first() works incorrect with with string indexes #37519

BUG: DataFrame.combine_first() works incorrect with with string indexes #37519

sfilatov96 commented Oct 30, 2020 •

edited

Loading

nagesh-chowdaiah commented Oct 31, 2020 •

edited

Loading

phofl commented Oct 31, 2020

BUG: DataFrame.combine_first() works incorrect with with string indexes #37519

BUG: DataFrame.combine_first() works incorrect with with string indexes #37519

Comments

sfilatov96 commented Oct 30, 2020 • edited Loading

Code Sample, a copy-pastable example

Problem description

nagesh-chowdaiah commented Oct 31, 2020 • edited Loading

phofl commented Oct 31, 2020

sfilatov96 commented Oct 30, 2020 •

edited

Loading

nagesh-chowdaiah commented Oct 31, 2020 •

edited

Loading