BUG: Inconsistent column order result using Left Merge depending if left dataframe is empty or not #43087

fjossandon · 2021-08-17T20:23:33Z

I have checked that this issue has not already been reported.
There is a similar report but using "outer join" instead of "left join": pd.merge for outer with empty dataframe on the left side results in merged dataframe with the key column not being first #9937
I have confirmed this bug exists on the latest version of pandas.
Is present at least in the last version I can install with pip3 "1.1.5".
(optional) I have confirmed this bug exists on the master branch of pandas.
Partially, the relevant underlying code that I identified in master looks identical to my version, although I have not upgrade it

Code Sample, a copy-pastable example

First the regular and expected behavior

left_df = pandas.DataFrame({
    "id": [1, 2],
    "full_name": ["Long Name", "Long Name 2"],
    "short_name": ["Short", "Short 2"],
    "counts": [1000, 500],
    "percent": [20, 10],
    "constant": [5, 1],
})
right_df = pandas.DataFrame({
    "full_name": ["Long Name"],
    "copies": [20000],
    "norm_copies": [10000],
    "unit": ["abcd"],
    "calculated_value": [2000],
})
merged_df = pandas.merge(
    left_df,
    right_df,
    how="left",
    on="full_name",
)
print(merged_df)

   id    full_name short_name  counts  percent  constant   copies  norm_copies  unit  calculated_value
0   1    Long Name      Short    1000       20         5  20000.0      10000.0  abcd            2000.0
1   2  Long Name 2    Short 2     500       10         1      NaN          NaN   NaN               NaN

Note that it preserves the left_df order and adds the new columns at the end

Then when the left_df is empty

left_df = pandas.DataFrame({
    "id": [],
    "full_name": [],
    "short_name": [],
    "counts": [],
    "percent": [],
    "constant": [],
})
right_df = pandas.DataFrame({
    "full_name": [],
    "copies": [],
    "norm_copies": [],
    "unit": [],
    "calculated_value": [],
})
merged_df = pandas.merge(
    left_df,
    right_df,
    how="left",
    on="full_name",
)
print(merged_df)

Empty DataFrame
Columns: [id, short_name, counts, percent, constant, full_name, copies, norm_copies, unit, calculated_value]
Index: []

Problem description

There is an inconsistency in the column order depending only if the left_df is empty or not. When its not empty the merged dataframe have the expected column ordering, keeping the ON full_name column in its original place, while when its empty the right_df overrides this and imposes its own column ordering at the end of the merged dataframe, moving the ON full_name column to a different place.

My own problem was that some tests I made for my code expects consistency on the merged dataframe structure (which is later transformed in exported tables), but now I see that the ordering changes on different scenarios. Of course I can try to compensate this problem by always reordering columns after the fact, or some other adjustments to ensure a consistent output, but the main point is that merged column order should follow a consistent behavior (either the original order which I prefer, or the new order imposed by the right dataframe) regardless of the left dataframe being empty or not.

I'm not sure about what is the "right" behavior in the outer join case I referred above, but at least in a left join the column order of the left dataframe should mandate in all cases.

Expected Output

Preserve original column order

Empty DataFrame
Columns: [id, full_name, short_name, counts, percent, constant, copies, norm_copies, unit, calculated_value]
Index: []

I looked deeper with my debugger and traced to the relevant merge code, https://github.com/pandas-dev/pandas/blob/master/pandas/core/reshape/merge.py#L1097-L1102:

                        if lk is not None and lk == rk:
                            # avoid key upcast in corner case (length-0)
                            if len(left) > 0:
                                right_drop.append(rk)
                            else:
                                left_drop.append(lk)

In this place the code decides which redundant column (the ON column) to remove, If the left dataframe is empty, it adds it to a drop list that is executed later, removing the column from there, while if not empty it removes from the right dataframe, which is exactly what I'm observing.

Doing some archaelogy, I went back until I found the commit where this was first implemented 9 years ago, 0da0066:

Fix IndexError in left merges on a DataFrame with 0-length (merge with empty DataFrame IndexError #1628)

Apparently, it always dropped from the right dataframe before but changed because of an error with the left dataframe being empty. So the relevant question is if that would still be a problem today, since in my example the right dataframe was also empty but I saw no error with that... could it be that the fix is not needed anymore because the old error is resolved in other ways and go back to a consistent behavior??

I made a temporal local edit of my files, removing the IF to always drop from the dataframe to test this case, and it worked without errors!:

                        if lk is not None and lk == rk:
                            right_drop.append(rk)

print(merged_df)
Empty DataFrame
Columns: [id, full_name, short_name, counts, percent, constant, copies, norm_copies, unit, calculated_value]
Index: []

The columns have the correct order.

Since the removal of the IF was successful, I decided to also test the code in the old issue report with the error associated to this IF mentioned above:

>>> from pandas import *
>>> a = DataFrame(columns=['column1'], data=[1])
>>> b = DataFrame(columns=['column1'])
>>> a.merge(b, on=['column1'], how='left')
   column1
0        1

And no errors!! So it means that fixing the column order by removing that IF will not cause a regression on that other report.

Please consider implementing this if it does not cause any test regression, it would be an easy fix.

Best regards!

Output of `pd.show_versions()`

>>> pandas.show_versions()

INSTALLED VERSIONS

commit : b5958ee
python : 3.6.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-80-generic
Version : #90~18.04.1-Ubuntu SMP Tue Jul 13 19:40:02 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : es_ES.UTF-8
LOCALE : es_ES.UTF-8

pandas : 1.1.5
numpy : 1.19.5
pytz : 2021.1
dateutil : 2.8.1
pip : 21.2.4
setuptools : 39.0.1
Cython : 0.29.20
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.2.9
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 2.10
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.2.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 3.0.0
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.17
tables : None
tabulate : 0.8.9
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

Dinkarkumar · 2021-08-18T01:32:32Z

Issue is resolved is quite well, i have also tested it and it worked completely fine on my system too. Great work.

fjossandon added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 17, 2021

phofl added Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 19, 2021

lukemanley mentioned this issue Sep 6, 2023

BUG: merge with left and/or right empty returning mis-ordered columns #55028

Merged

7 tasks

mroeschke closed this as completed in #55028 Sep 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Inconsistent column order result using Left Merge depending if left dataframe is empty or not #43087

BUG: Inconsistent column order result using Left Merge depending if left dataframe is empty or not #43087

fjossandon commented Aug 17, 2021

INSTALLED VERSIONS

Dinkarkumar commented Aug 18, 2021

BUG: Inconsistent column order result using Left Merge depending if left dataframe is empty or not #43087

BUG: Inconsistent column order result using Left Merge depending if left dataframe is empty or not #43087

Comments

fjossandon commented Aug 17, 2021

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

Dinkarkumar commented Aug 18, 2021

Output of `pd.show_versions()`