Skip to content

BUG: CategoricalIndex allows reindexing with non-unique CategoricalIndex #23963

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 2, 2018

Conversation

qwhelan
Copy link
Contributor

@qwhelan qwhelan commented Nov 28, 2018

CategoricalIndex.reindex() throws a ValueError if the target index is non-unique and not also a CategoricalIndex. This means a non-unique CategoricalIndex can be used to reindex, resulting in duplicated data:

In [1]: import pandas as pd
        categorical_index = pd.CategoricalIndex(['bar', 'foo', 'bar', 'foo'], ['bar', 'foo'])
        s = pd.Series(range(4), index=categorical_index)

In [2]: s
Out[2]: bar    0
        foo    1
        bar    2
        foo    3
        dtype: int64

In [3]: s.reindex(categorical_index.as_ordered())
Out[3]: bar    0
        bar    2
        foo    1
        foo    3
        bar    0
        bar    2
        foo    1
        foo    3
        dtype: int64

This PR removes the extraneous type check and adds tests checking that an exception is thrown for non-unique target indexes, regardless of type.

@pep8speaks
Copy link

pep8speaks commented Nov 28, 2018

Hello @qwhelan! Thanks for updating the PR.

Comment last updated on November 28, 2018 at 22:26 Hours UTC

@codecov
Copy link

codecov bot commented Nov 28, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@6dd130a). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #23963   +/-   ##
=========================================
  Coverage          ?    92.3%           
=========================================
  Files             ?      161           
  Lines             ?    51558           
  Branches          ?        0           
=========================================
  Hits              ?    47592           
  Misses            ?     3966           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.7% <100%> (?)
#single 42.43% <57.14%> (?)
Impacted Files Coverage Δ
pandas/core/indexes/category.py 97.88% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6dd130a...9550880. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you a whatsnew entry, and replicate the df.round() (that should work now) in the OP

@jreback jreback added Bug Indexing Related to indexing on series/frames, not to indexes themselves Categorical Categorical Data Type labels Nov 28, 2018
@qwhelan qwhelan force-pushed the non_unique_categoricalindex branch from 54ab47e to 8366590 Compare November 28, 2018 21:19
@qwhelan
Copy link
Contributor Author

qwhelan commented Nov 28, 2018

@jreback I stumbled across this bug while writing some asv benchmarks and only noticed #21809 after opening this PR, so it ended up being a few more lines to handle the df.round() case.

As we're now raising on non-unique CategoricalIndexes, we need to whitelist non-unique targets that are also the current index to allow df.round() to run. As the construction of the result involves a .reindex() call, we'd otherwise get ValueErrors when calling df.round() on a non-uniquely indexed DataFrame.

This is already done in Index.reindex(), so this whitelisting is bringing CategoricalIndex back in line with the base implementation.

@qwhelan qwhelan force-pushed the non_unique_categoricalindex branch from 8366590 to eec207a Compare November 28, 2018 21:34
@qwhelan
Copy link
Contributor Author

qwhelan commented Nov 28, 2018

Additionally, this is the fixed output from #21809:

print(dfb.round(3))
         a      b      c
low  0.087  0.230  0.411
low  0.311  0.566  0.545
low  0.807  0.918  0.522
hi   0.425  0.072  0.899
hi   0.421  0.582  0.214
hi   0.447  0.468  0.101

@qwhelan qwhelan force-pushed the non_unique_categoricalindex branch 2 times, most recently from 6450827 to 0087ccf Compare November 29, 2018 00:36
idx = pd.CategoricalIndex(['low'] * 3 + ['hi'] * 3)
dfb = pd.DataFrame(np.random.rand(6, 3), columns=list('abc'),
index=idx)
assert dfb.shape == (6, 3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you construct the result and use tm.assert_frame_equal here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a tm.assert_frame_equal() check but keeping the explicit dims check as I don't want to inadvertently rely on CategoricalIndex.reindex.

@qwhelan qwhelan force-pushed the non_unique_categoricalindex branch from 0087ccf to 95337f9 Compare November 29, 2018 21:49
@qwhelan
Copy link
Contributor Author

qwhelan commented Nov 29, 2018

@jreback Changes implemented, the one failing test is a flaky hypothesis failure in an unrelated file.

@jreback jreback added this to the 0.24.0 milestone Dec 2, 2018
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cosmestic change in tests. ping when green.

@qwhelan qwhelan force-pushed the non_unique_categoricalindex branch from 95337f9 to 9550880 Compare December 2, 2018 18:33
@qwhelan
Copy link
Contributor Author

qwhelan commented Dec 2, 2018

@jreback Green other than the unrelated flaky test

@jreback jreback merged commit 5259e33 into pandas-dev:master Dec 2, 2018
@jreback
Copy link
Contributor

jreback commented Dec 2, 2018

thanks @qwhelan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CategoricalIndex reidex duplicates values
3 participants