Skip to content

Multi index slicing working only for the last label in the first level index #12697

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ssunkara1 opened this issue Mar 22, 2016 · 3 comments · Fixed by #15255
Closed

Multi index slicing working only for the last label in the first level index #12697

ssunkara1 opened this issue Mar 22, 2016 · 3 comments · Fixed by #15255
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Milestone

Comments

@ssunkara1
Copy link

When I try to slice a multi indexed data frame, the slice interval is ignored for all the entries of the first level index except the last one. This only happens if the length of the slice is beyond a particular value (around 30 in the cases that I observed, but it is not the same across data frames).

This bug seems to have been introduced in version 0.17.1 as this works fine in version 0.16.4

Code Sample

import pandas as pd
import numpy as np

freq = ['a', 'b', 'c', 'd']
idx = pd.MultiIndex.from_product([freq, np.arange(500)])

dfmi = pd.DataFrame(np.random.randn(2000), index=idx, columns=['Test'])
sliced_df = dfmi.loc[pd.IndexSlice[:, 30:70], :]

print sliced_df.loc['a']
print sliced_df.loc['d']

Current Output

         Test
0   -2.288252
1    0.501113
2   -0.581190
3    0.366600
..        ...
496 -1.124694
497 -0.106180
498 -0.348668
499  0.659645

[500 rows x 1 columns]
        Test
30  0.079055
31  2.455371
32  0.014673
33  0.966548
..        ...
67  0.997713
68  1.235465
69 -0.320166
70 -0.968143

Expected Output

        Test
30 -1.025443
31 -1.305710
32  0.614858
33 -0.606788
34 -0.673230
.. ...
68 -1.129218
69 -1.747830
70 -0.611186
        Test
30 -0.679267
31 -0.590352
32  1.000755
33 -0.106813
34 -1.214385
.. ...
68 -1.467416
69 -0.008881
70  0.040510

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-63-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.0
nose: 1.3.7
pip: 8.1.0
setuptools: 20.2.2
Cython: 0.22.1
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.0.0-dev
sphinx: 1.4a1
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.2.0
numexpr: 2.4.4
matplotlib: 1.5.1
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.5
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.38.0

@jreback jreback added Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Difficulty Intermediate labels Mar 23, 2016
@jreback
Copy link
Contributor

jreback commented Mar 23, 2016

so this is chained indexing on a multi-index. So a fully-qualified selection works:

In [33]: dfmi.loc[pd.IndexSlice['d', 30:70], :]
Out[33]: 
          Test
d 30  0.244715
  31  0.692720
  32  0.100552
  33 -0.319824
  34 -0.380916
...        ...
  66  4.098519
  67 -1.592476
  68  1.057753
  69 -1.329330
  70  0.048740

[41 rows x 1 columns]

In [34]: dfmi.loc[pd.IndexSlice['a', 30:70], :]
Out[34]: 
          Test
a 30  0.295775
  31 -1.544894
  32  0.098592
  33 -1.111341
  34 -0.165816
...        ...
  66 -0.241730
  67  0.545091
  68  0.958949
  69  0.186351
  70  0.792855

[41 rows x 1 columns]

Might be something odd going on with the indexers. If you'd have a look and see if you can spot where this is happening would be helpful. You can just use your tests script and step thru the debugger.

@jreback jreback added this to the 0.18.1 milestone Mar 23, 2016
@ruoyu0088
Copy link

I found the problem. It's the line in convert_indexer() in MultiIndex._get_level_indexer():

m[np.in1d(labels,r,assume_unique=True)] = True

Here is the document of the assume_unique of in1d():

    assume_unique : bool, optional
        If True, the input arrays are both assumed to be unique, which
        can speed up the calculation.  Default is False.

but labels is not unique in this case. So set assume_unique=False fix the problem:

m[np.in1d(labels,r,assume_unique=False)] = True

@jreback jreback modified the milestones: 0.18.2, 0.18.1 Apr 26, 2016
@jreback
Copy link
Contributor

jreback commented Apr 26, 2016

@ruoyu0088 if you want to do a PR for this would be great (and set assume_unique=level.is_unique) I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants