Skip to content

BUG: Inconsistent column order result using Left Merge depending if left dataframe is empty or not #43087

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
fjossandon opened this issue Aug 17, 2021 · 1 comment · Fixed by #55028
Closed
3 tasks done
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@fjossandon
Copy link


Code Sample, a copy-pastable example

First the regular and expected behavior

left_df = pandas.DataFrame({
    "id": [1, 2],
    "full_name": ["Long Name", "Long Name 2"],
    "short_name": ["Short", "Short 2"],
    "counts": [1000, 500],
    "percent": [20, 10],
    "constant": [5, 1],
})
right_df = pandas.DataFrame({
    "full_name": ["Long Name"],
    "copies": [20000],
    "norm_copies": [10000],
    "unit": ["abcd"],
    "calculated_value": [2000],
})
merged_df = pandas.merge(
    left_df,
    right_df,
    how="left",
    on="full_name",
)
print(merged_df)

   id    full_name short_name  counts  percent  constant   copies  norm_copies  unit  calculated_value
0   1    Long Name      Short    1000       20         5  20000.0      10000.0  abcd            2000.0
1   2  Long Name 2    Short 2     500       10         1      NaN          NaN   NaN               NaN

Note that it preserves the left_df order and adds the new columns at the end

Then when the left_df is empty

left_df = pandas.DataFrame({
    "id": [],
    "full_name": [],
    "short_name": [],
    "counts": [],
    "percent": [],
    "constant": [],
})
right_df = pandas.DataFrame({
    "full_name": [],
    "copies": [],
    "norm_copies": [],
    "unit": [],
    "calculated_value": [],
})
merged_df = pandas.merge(
    left_df,
    right_df,
    how="left",
    on="full_name",
)
print(merged_df)

Empty DataFrame
Columns: [id, short_name, counts, percent, constant, full_name, copies, norm_copies, unit, calculated_value]
Index: []

Problem description

There is an inconsistency in the column order depending only if the left_df is empty or not. When its not empty the merged dataframe have the expected column ordering, keeping the ON full_name column in its original place, while when its empty the right_df overrides this and imposes its own column ordering at the end of the merged dataframe, moving the ON full_name column to a different place.

My own problem was that some tests I made for my code expects consistency on the merged dataframe structure (which is later transformed in exported tables), but now I see that the ordering changes on different scenarios. Of course I can try to compensate this problem by always reordering columns after the fact, or some other adjustments to ensure a consistent output, but the main point is that merged column order should follow a consistent behavior (either the original order which I prefer, or the new order imposed by the right dataframe) regardless of the left dataframe being empty or not.

I'm not sure about what is the "right" behavior in the outer join case I referred above, but at least in a left join the column order of the left dataframe should mandate in all cases.

Expected Output

Preserve original column order

Empty DataFrame
Columns: [id, full_name, short_name, counts, percent, constant, copies, norm_copies, unit, calculated_value]
Index: []

I looked deeper with my debugger and traced to the relevant merge code, https://github.com/pandas-dev/pandas/blob/master/pandas/core/reshape/merge.py#L1097-L1102:

                        if lk is not None and lk == rk:
                            # avoid key upcast in corner case (length-0)
                            if len(left) > 0:
                                right_drop.append(rk)
                            else:
                                left_drop.append(lk)

In this place the code decides which redundant column (the ON column) to remove, If the left dataframe is empty, it adds it to a drop list that is executed later, removing the column from there, while if not empty it removes from the right dataframe, which is exactly what I'm observing.

Doing some archaelogy, I went back until I found the commit where this was first implemented 9 years ago, 0da0066:

Apparently, it always dropped from the right dataframe before but changed because of an error with the left dataframe being empty. So the relevant question is if that would still be a problem today, since in my example the right dataframe was also empty but I saw no error with that... could it be that the fix is not needed anymore because the old error is resolved in other ways and go back to a consistent behavior??

I made a temporal local edit of my files, removing the IF to always drop from the dataframe to test this case, and it worked without errors!:

                        if lk is not None and lk == rk:
                            right_drop.append(rk)

print(merged_df)
Empty DataFrame
Columns: [id, full_name, short_name, counts, percent, constant, copies, norm_copies, unit, calculated_value]
Index: []

The columns have the correct order.

Since the removal of the IF was successful, I decided to also test the code in the old issue report with the error associated to this IF mentioned above:

>>> from pandas import *
>>> a = DataFrame(columns=['column1'], data=[1])
>>> b = DataFrame(columns=['column1'])
>>> a.merge(b, on=['column1'], how='left')
   column1
0        1

And no errors!! So it means that fixing the column order by removing that IF will not cause a regression on that other report.

Please consider implementing this if it does not cause any test regression, it would be an easy fix.

Best regards!

Output of pd.show_versions()

>>> pandas.show_versions()

INSTALLED VERSIONS

commit : b5958ee
python : 3.6.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-80-generic
Version : #90~18.04.1-Ubuntu SMP Tue Jul 13 19:40:02 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : es_ES.UTF-8
LOCALE : es_ES.UTF-8

pandas : 1.1.5
numpy : 1.19.5
pytz : 2021.1
dateutil : 2.8.1
pip : 21.2.4
setuptools : 39.0.1
Cython : 0.29.20
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.2.9
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 2.10
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.2.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 3.0.0
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.17
tables : None
tabulate : 0.8.9
xarray : None
xlrd : None
xlwt : None
numba : None

@fjossandon fjossandon added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 17, 2021
@Dinkarkumar
Copy link

Issue is resolved is quite well, i have also tested it and it worked completely fine on my system too. Great work.

@phofl phofl added Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants