Skip to content

loc() does not swap two rows in multi-index pandas dataframe #22797

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
eisthfroyalblue opened this issue Sep 21, 2018 · 7 comments · Fixed by #28933
Closed

loc() does not swap two rows in multi-index pandas dataframe #22797

eisthfroyalblue opened this issue Sep 21, 2018 · 7 comments · Fixed by #28933
Labels
Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Milestone

Comments

@eisthfroyalblue
Copy link

df = pd.DataFrame(np.arange(12).reshape((4, 3)),
    index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
    columns=[['Ohio', 'Ohio', 'Colorado'],
    ['Green', 'Red', 'Green']])

df.loc[['b','a'],:]  # does not swap

# df.reindex(index=['b', 'a'], level=0) # This works

Problem description

df.loc[['b','a'],:] does not swap the rows 'a' and 'b', nor does for a Series.

Expected Output

The output should be the same as the one obtained via df.reindex(index=['b', 'a'], level=0)

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.3
pytest: 3.2.1
pip: 10.0.1
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.14.0
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
pandas_gbq: None
pandas_datareader: 0.7.0

@WillAyd
Copy link
Member

WillAyd commented Sep 21, 2018

I don't think we make guarantees about the order of returned values from a .loc operation so I am inclined to say this is not a bug but let's see what others say

@WillAyd WillAyd added Indexing Related to indexing on series/frames, not to indexes themselves Needs Discussion Requires discussion from core team before further action labels Sep 21, 2018
@eisthfroyalblue
Copy link
Author

For a single-indexed table, we can reindex it using .loc() instead of .reindex()

df = pd.DataFrame([[1,2,3],[4,5,6]], index=['a','b'])

df.loc[['b','a']]

Out[9]: 
   0  1  2
b  4  5  6
a  1  2  3

In his book, "Python for Data Analysis 2ed" Wes McKinney says "you can reindex more succinctly by label-indexing with loc, and many users prefer to use it", although he used only a single-index table
for illustration.

This may be not a bug. however it is natural for one to expect the same effect for the multi-index table.

@normanius
Copy link

I personally think that this inconsistency between single- and multi-index dataframes is dangerous.

In python, a list not only represents a collection of objects, but also sets an ordering of these objects. Not respecting this ordering under certain conditions (single- vs. multi-index) is non-intuitive.

For sure the inconsistency is not emphasized enough in the documentation. I didn't come across a corresponding warning in the article on advanced indexing, neither is there any note in the documentation of .loc

By the way: SO brought me here, see my related SO question.

@pl-phan
Copy link

pl-phan commented Jul 23, 2019

I too got mixed up, by using .loc[mylist] on a multi-index dataframe, it did not preserve the order of mylist. I realized it way later than I should have.

Surely it is strange that it preserves the order for single-index, but not multi-index.

Even if it doesn't qualifies as a bug, is it possible to add a warning ?
Only when .loc[key] is used on a multi-index structure, if key is an iterable, just to warn that the order might not be preserved (and maybe recommending using .reindex(index=key, level=0) instead)

I tried to locate in the code where the warning should be written, without success. :(

@TomAugspurger
Copy link
Contributor

It's a bug. Not sure where though, so investigation would be welcome.

@TomAugspurger TomAugspurger added MultiIndex and removed Needs Discussion Requires discussion from core team before further action labels Jul 30, 2019
@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone Jul 30, 2019
@nrebena
Copy link
Contributor

nrebena commented Sep 29, 2019

So, after investigation (thanks pdb), what happened is that we obtain the indexes to return in the order the keys are given, but the way the indexes are stitched together reorder them.

More precisely, the culprit seems to be the or operator in pandas/core/indexes/multi.py:3043, that sort the results.

I succeeded to get the expected result for the given example by using union(...,sort=False) instead, but their is still something to do so that the following work, when multiple level are to be ordered.

import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(12).reshape((4, 3)),
    index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
    columns=[['Ohio', 'Ohio', 'Colorado'],
    ['Green', 'Red', 'Green']])

df.loc[(['b','a'],[2, 1]),:]

# out
     Ohio     Colorado
    Green Red    Green
b 1     6   7        8
  2     9  10       11
a 1     0   1        2
  2     3   4        5

nrebena added a commit to nrebena/pandas that referenced this issue Oct 6, 2019
Testing return order of MultiIndex.loc

MultiIndex.loc try to return the result in the same order as the key
given.
nrebena added a commit to nrebena/pandas that referenced this issue Oct 6, 2019
From issue pandas-dev#22797. Loc did not respect order for MultiIndex Dataframe.
It does for single index.

When possible, the returned row now respect the given order for keys.
nrebena added a commit to nrebena/pandas that referenced this issue Oct 8, 2019
Testing return order of MultiIndex.loc

MultiIndex.loc try to return the result in the same order as the key
given.
nrebena added a commit to nrebena/pandas that referenced this issue Oct 8, 2019
From issue pandas-dev#22797. Loc did not respect order for MultiIndex Dataframe.
It does for single index.

When possible, the returned row now respect the given order for keys.
nrebena added a commit to nrebena/pandas that referenced this issue Oct 11, 2019
Testing return order of MultiIndex.loc

MultiIndex.loc try to return the result in the same order as the key
given.
nrebena added a commit to nrebena/pandas that referenced this issue Oct 11, 2019
From issue pandas-dev#22797. When given a list like object as indexer, the
returned result did not respect the order of the indexer, but the order
of the MultiIndex levels.
nrebena added a commit to nrebena/pandas that referenced this issue Nov 14, 2019
Testing return order of MultiIndex.loc

MultiIndex.loc try to return the result in the same order as the key
given.
nrebena added a commit to nrebena/pandas that referenced this issue Nov 14, 2019
From issue pandas-dev#22797. Loc did not respect order for MultiIndex Dataframe.
It does for single index.

When possible, the returned row now respect the given order for keys.
@nrebena
Copy link
Contributor

nrebena commented Dec 16, 2019

So, working on this issues I was confronted to the following problems: if we want to maintains the order given for each level, how should we treat slice(None)? Example follow:

df = pd.DataFrame(
      np.arange(12).reshape((4, 3)),
      index=[["a", "a", "b", "b"], [1, 2, 1, 2]],
      columns=[["Ohio", "Ohio", "Colorado"], ["Green", "Red", "Green"]],
      )

df.loc[(slice(None), [2,1]), :]                                                                                                
# actual output, same as df, order of slice(None) level take absolut precedence
     Ohio     Colorado
    Green Red    Green
a 1     0   1        2
  2     3   4        5
b 1     6   7        8
  2     9  10       11

# Second level is prioritary on first level
     Ohio     Colorado
    Green Red    Green
a 2     3   4        5
b 2     9  10       11
a 1     0   1        2
b 1     6   7        8

# or
# Keep order of first level, then second level
     Ohio     Colorado
    Green Red    Green
a 2     3   4        5
  1     0   1        2
b 2     9  10       11
  1     6   7        8

Ideas are welcome.

nrebena added a commit to nrebena/pandas that referenced this issue Jan 11, 2020
Testing return order of MultiIndex.loc

MultiIndex.loc try to return the result in the same order as the key
given.
nrebena added a commit to nrebena/pandas that referenced this issue Jan 11, 2020
From issue pandas-dev#22797. Loc did not respect order for MultiIndex Dataframe.
It does for single index.

When possible, the returned row now respect the given order for keys.
@jreback jreback modified the milestones: Contributions Welcome, 1.1 Jan 26, 2020
nrebena added a commit to nrebena/pandas that referenced this issue Jan 26, 2020
Testing return order of MultiIndex.loc

MultiIndex.loc try to return the result in the same order as the key
given.
nrebena added a commit to nrebena/pandas that referenced this issue Jan 26, 2020
From issue pandas-dev#22797. When given a list like object as indexer, the
returned result did not respect the order of the indexer, but the order
of the MultiIndex levels.
nrebena added a commit to nrebena/pandas that referenced this issue Jan 27, 2020
Testing return order of MultiIndex.loc

MultiIndex.loc try to return the result in the same order as the key
given.
nrebena added a commit to nrebena/pandas that referenced this issue Jan 27, 2020
From issue pandas-dev#22797. When given a list like object as indexer, the
returned result did not respect the order of the indexer, but the order
of the MultiIndex levels.
nrebena added a commit to nrebena/pandas that referenced this issue Feb 2, 2020
Testing return order of MultiIndex.loc

MultiIndex.loc try to return the result in the same order as the key
given.
nrebena added a commit to nrebena/pandas that referenced this issue Feb 2, 2020
From issue pandas-dev#22797. When given a list like object as indexer, the
returned result did not respect the order of the indexer, but the order
of the MultiIndex levels.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
8 participants