Suggestion: do not sort by default after doing index.difference esp. when performing difference on columns / or print warning #18282
Labels
Compat
pandas objects compatability with Numpy or Python functions
Duplicate Report
Duplicate issue or pull request
Reshaping
Concat, Merge/Join, Stack/Unstack, Explode
Code Sample, a copy-pastable example if possible
Problem description
I assumed that pandas.columns.difference has the same functionality as the longer expression given above under workaround.
It took me a long time to discover that "Index.difference" does not only remove the column but also sorts them in a lexical order.
This can cause major issues e.g. when transforming the table into numpy via DataFrame.values while not tracking columns.
It took me a long time to figure out that the degraded performance in my system was due to this call. I now know that the sorting is a documented feature; but maybe this is not an ideal behavior.
By reporting this I would like to put this to the discussion
Output of
pd.show_versions()
pandas: 0.21.0
pytest: None
pip: 9.0.1
setuptools: 33.1.1.post20170320
Cython: None
numpy: 1.13.3
scipy: 0.19.0
pyarrow: None
xarray: None
IPython: 5.3.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.5
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: