You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have confirmed this bug exists on the latest version of pandas.
Is present at least in the last version I can install with pip3 "1.1.5".
(optional) I have confirmed this bug exists on the master branch of pandas.
Partially, the relevant underlying code that I identified in master looks identical to my version, although I have not upgrade it
There is an inconsistency in the column order depending only if the left_df is empty or not. When its not empty the merged dataframe have the expected column ordering, keeping the ON full_name column in its original place, while when its empty the right_df overrides this and imposes its own column ordering at the end of the merged dataframe, moving the ON full_name column to a different place.
My own problem was that some tests I made for my code expects consistency on the merged dataframe structure (which is later transformed in exported tables), but now I see that the ordering changes on different scenarios. Of course I can try to compensate this problem by always reordering columns after the fact, or some other adjustments to ensure a consistent output, but the main point is that merged column order should follow a consistent behavior (either the original order which I prefer, or the new order imposed by the right dataframe) regardless of the left dataframe being empty or not.
I'm not sure about what is the "right" behavior in the outer join case I referred above, but at least in a left join the column order of the left dataframe should mandate in all cases.
iflkisnotNoneandlk==rk:
# avoid key upcast in corner case (length-0)iflen(left) >0:
right_drop.append(rk)
else:
left_drop.append(lk)
In this place the code decides which redundant column (the ON column) to remove, If the left dataframe is empty, it adds it to a drop list that is executed later, removing the column from there, while if not empty it removes from the right dataframe, which is exactly what I'm observing.
Doing some archaelogy, I went back until I found the commit where this was first implemented 9 years ago, 0da0066:
Apparently, it always dropped from the right dataframe before but changed because of an error with the left dataframe being empty. So the relevant question is if that would still be a problem today, since in my example the right dataframe was also empty but I saw no error with that... could it be that the fix is not needed anymore because the old error is resolved in other ways and go back to a consistent behavior??
I made a temporal local edit of my files, removing the IF to always drop from the dataframe to test this case, and it worked without errors!:
Since the removal of the IF was successful, I decided to also test the code in the old issue report with the error associated to this IF mentioned above:
I have checked that this issue has not already been reported.
There is a similar report but using "outer join" instead of "left join": pd.merge for outer with empty dataframe on the left side results in merged dataframe with the key column not being first #9937
I have confirmed this bug exists on the latest version of pandas.
Is present at least in the last version I can install with pip3 "1.1.5".
(optional) I have confirmed this bug exists on the master branch of pandas.
Partially, the relevant underlying code that I identified in master looks identical to my version, although I have not upgrade it
Code Sample, a copy-pastable example
First the regular and expected behavior
Note that it preserves the left_df order and adds the new columns at the end
Then when the left_df is empty
Problem description
There is an inconsistency in the column order depending only if the
left_df
is empty or not. When its not empty the merged dataframe have the expected column ordering, keeping the ONfull_name
column in its original place, while when its empty the right_df overrides this and imposes its own column ordering at the end of the merged dataframe, moving the ONfull_name
column to a different place.My own problem was that some tests I made for my code expects consistency on the merged dataframe structure (which is later transformed in exported tables), but now I see that the ordering changes on different scenarios. Of course I can try to compensate this problem by always reordering columns after the fact, or some other adjustments to ensure a consistent output, but the main point is that merged column order should follow a consistent behavior (either the original order which I prefer, or the new order imposed by the right dataframe) regardless of the left dataframe being empty or not.
I'm not sure about what is the "right" behavior in the outer join case I referred above, but at least in a left join the column order of the left dataframe should mandate in all cases.
Expected Output
Preserve original column order
I looked deeper with my debugger and traced to the relevant merge code, https://github.com/pandas-dev/pandas/blob/master/pandas/core/reshape/merge.py#L1097-L1102:
In this place the code decides which redundant column (the ON column) to remove, If the left dataframe is empty, it adds it to a drop list that is executed later, removing the column from there, while if not empty it removes from the right dataframe, which is exactly what I'm observing.
Doing some archaelogy, I went back until I found the commit where this was first implemented 9 years ago, 0da0066:
Apparently, it always dropped from the right dataframe before but changed because of an error with the left dataframe being empty. So the relevant question is if that would still be a problem today, since in my example the right dataframe was also empty but I saw no error with that... could it be that the fix is not needed anymore because the old error is resolved in other ways and go back to a consistent behavior??
I made a temporal local edit of my files, removing the IF to always drop from the dataframe to test this case, and it worked without errors!:
The columns have the correct order.
Since the removal of the IF was successful, I decided to also test the code in the old issue report with the error associated to this IF mentioned above:
And no errors!! So it means that fixing the column order by removing that IF will not cause a regression on that other report.
Please consider implementing this if it does not cause any test regression, it would be an easy fix.
Best regards!
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : b5958ee
python : 3.6.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-80-generic
Version : #90~18.04.1-Ubuntu SMP Tue Jul 13 19:40:02 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : es_ES.UTF-8
LOCALE : es_ES.UTF-8
pandas : 1.1.5
numpy : 1.19.5
pytz : 2021.1
dateutil : 2.8.1
pip : 21.2.4
setuptools : 39.0.1
Cython : 0.29.20
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.2.9
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 2.10
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.2.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 3.0.0
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.17
tables : None
tabulate : 0.8.9
xarray : None
xlrd : None
xlwt : None
numba : None
The text was updated successfully, but these errors were encountered: