Skip to content

reorder_categories() on categorical index of interval values produces bizarre results #23452

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tmelconian opened this issue Nov 1, 2018 · 3 comments · Fixed by #36648
Closed
Assignees
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions

Comments

@tmelconian
Copy link

Problem description

Categorical indexes where the individual values are of type interval produce bizarre results when reordered with .reorder_categories. This important because these indexes and columns are produced by .cut, so they occur in analysis even if the user never directly creates them.

Minimal Example

With strings, this works as expected:

d = pd.DataFrame(index=pd.CategoricalIndex(['a','b','c'],ordered=True), data={'x':[1,2,3]})
d.index=d.index.reorder_categories(d.index.categories[::-1])
print(d.sort_index())
   x
c  3
b  2
a  1

With intervals, it does not:

e=pd.DataFrame(index=pd.CategoricalIndex([pd.Interval(0,1), pd.Interval(1,2), pd.Interval(2,3)],ordered=True),data={'x':[1,2,3]})
e.index=e.index.reorder_categories(e.index.categories[::-1])
print(e.sort_index())
     x
NaN  1
NaN  2
NaN  3

Expected Output

        x
(2, 3]  3
(1, 2]  2
(0, 1]  1

More Realistic Example

tips=sns.load_dataset('tips')
tips['total_bin']=pd.cut(tips['total_bill'],10)
tips.total_bin.cat.reorder_categories(tips.total_bin.cat.categories[::-1],inplace=True)
tips.head()
   total_bill   tip     sex smoker  day    time  size total_bin
0       16.99  1.01  Female     No  Sun  Dinner     2       NaN
1       10.34  1.66    Male     No  Sun  Dinner     3       NaN
2       21.01  3.50    Male     No  Sun  Dinner     3       NaN
3       23.68  3.31    Male     No  Sun  Dinner     2       NaN
4       24.59  3.61  Female     No  Sun  Dinner     4       NaN

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 3.16.0-4-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.0
pytest: 3.5.1
pip: 10.0.1
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jschendel
Copy link
Member

Thanks, I think the root cause of this issue is a bug in IntervalIndex.get_indexer, as CategoricalIndex eventually calls this if you trace through the code:

In [2]: ii1 = pd.interval_range(0, 3)

In [3]: ii1
Out[3]:
IntervalIndex([(0, 1], (1, 2], (2, 3]],
              closed='right',
              dtype='interval[int64]')

In [4]: ii2 = ii1[::-1]

In [5]: ii2
Out[5]:
IntervalIndex([(2, 3], (1, 2], (0, 1]],
              closed='right',
              dtype='interval[int64]')

In [6]: ii2.get_indexer(ii1)
Out[6]: array([-1, -1, -1], dtype=int64)

The output in Out[6] should be [2, 1, 0].

Coincidentally, I'm currently working on another issue that fixes the bug above (among other things), which does indeed fix the issue you've described.

On my WIP branch:

In [2]: ii1 = pd.interval_range(0, 3)

In [3]: ii2 = ii1[::-1]

In [4]: ii2.get_indexer(ii1)
Out[4]: array([2, 1, 0], dtype=int64)

In [5]: e=pd.DataFrame(index=pd.CategoricalIndex([pd.Interval(0,1), pd.Interval(1,2), pd.Interval(2,3)],ordered=True),data={'x':[1,2,3]})
   ...: e.index=e.index.reorder_categories(e.index.categories[::-1])
   ...: e.sort_index()
   ...:
Out[5]:
        x
(2, 3]  3
(1, 2]  2
(0, 1]  1

Not quite ready to open up a PR yet, but I'm planning for my WIP branch to be incorporated in the next release (0.24.0), so this bug should be fixed then too.

@jschendel jschendel added the Interval Interval data type label Nov 1, 2018
@jschendel jschendel added this to the 0.24.0 milestone Nov 1, 2018
@jschendel jschendel added the Categorical Categorical Data Type label Nov 1, 2018
@jreback jreback modified the milestones: 0.24.0, Contributions Welcome Nov 25, 2018
@mroeschke
Copy link
Member

This looks fixed on master. Could use a test

In [54]: e=pd.DataFrame(index=pd.CategoricalIndex([pd.Interval(0,1), pd.Interval(1,2), pd.Interval(2,3)]
    ...: ,ordered=True),data={'x':[1,2,3]})
    ...: e.index=e.index.reorder_categories(e.index.categories[::-1])
    ...: print(e.sort_index())
        x
(2, 3]  3
(1, 2]  2
(0, 1]  1

In [55]: pd.__version__
Out[55]: '1.1.0.dev0+1390.gf3fdab389'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Categorical Categorical Data Type Interval Interval data type labels Apr 27, 2020
@fokoid
Copy link
Contributor

fokoid commented Sep 25, 2020

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants