Skip to content

BUG: cut does not respect order of passed labels #16459

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ProsperousHeart opened this issue May 23, 2017 · 8 comments
Closed

BUG: cut does not respect order of passed labels #16459

ProsperousHeart opened this issue May 23, 2017 · 8 comments
Labels
Categorical Categorical Data Type Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@ProsperousHeart
Copy link
Contributor

ProsperousHeart commented May 23, 2017

This issue was found when working on #16432

    Examples
    --------
    >>> pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3, retbins=True)
    ([(0.19, 3.367], (0.19, 3.367], (0.19, 3.367], (3.367, 6.533], (6.533, 9.7], (0.19, 3.367]]
    Categories (3, interval[float64]): [(0.19, 3.367] < (3.367, 6.533] < 
    ...        (6.533, 9.7]], array([ 0.1905    ,  3.36666667,  6.53333333,
    ...         9.7       ]))

    >>> result = pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]),
    ...                          3, labels=["good","medium","bad"])
    [good, good, good, medium, bad, good]
    Categories (3, object): [good < medium < bad]

Problem description

When running pytest -x --doctest-modules core/reshape/tile.py against the work inside of the "reshape-3439" branch, we see the following:

(pandas_dev) C:\Users\kkeeton\AppData\Local\conda\conda\envs\pandas_dev\Lib\site-packages\pandas>pytest -x --doctest-modules core/reshape/tile.py
============================= test session starts =============================
platform win32 -- Python 3.6.1, pytest-3.0.7, py-1.4.33, pluggy-0.4.0
rootdir: C:\Users\kkeeton\AppData\Local\conda\conda\envs\pandas_dev\Lib\site-packages\pandas, inifile:
plugins: cov-2.3.1
collected 2 items

core\reshape\tile.py F

================================== FAILURES ===================================
___________________ [doctest] pandas.core.reshape.tile.cut ____________________
072     the resulting Categorical object
073
074
075     Examples
076     --------
077     >>> pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3, retbins=True)
078     ([(0.19, 3.367], (0.19, 3.367], (0.19, 3.367], (3.367, 6.533], (6.533, 9.7], (0.19, 3.367]]
079     Categories (3, interval[float64]): [(0.19, 3.367] < (3.367, 6.533] < (6.533, 9.7]], array([ 0.1905    ,  3.36666667,  6.53333333,  9.7       ]))
080
081     >>> pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]),
Expected:
    [good, good, good, medium, bad, good]
    Categories (3, object): [good < medium < bad]
Got:
    [good, good, good, medium, bad, good]
    Categories (3, object): [bad < good < medium]

C:\Users\kkeeton\AppData\Local\conda\conda\envs\pandas_dev\Lib\site-packages\pandas\core\reshape\tile.py:81: DocTestFailure

This is backwards.

Expected Output

Should clear this - this is not an issue.

Output of pd.show_versions()

Unable to run the pd.show_versions()
@jorisvandenbossche jorisvandenbossche added Categorical Categorical Data Type Regression Functionality that used to work in a prior pandas version labels May 23, 2017
@jorisvandenbossche jorisvandenbossche added this to the 0.20.2 milestone May 23, 2017
@jorisvandenbossche jorisvandenbossche changed the title DocTest Shows Wrong Order BUG: cut does not respect order of passed labels May 23, 2017
@jorisvandenbossche
Copy link
Member

In [3]: pd.__version__
Out[3]: '0.19.2'

In [4]: pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3, labels=["good","medium","bad"])
Out[4]: 
[good, good, good, medium, bad, good]
Categories (3, object): [good < medium < bad]
In [12]: pd.__version__
Out[12]: '0.21.0.dev+68.gcc734ba'

In [13]: pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3, labels=["good","medium","bad"])
Out[13]: 
[good, good, good, medium, bad, good]
Categories (3, object): [bad < good < medium]

@economy
Copy link
Contributor

economy commented May 23, 2017

I'm looking into this.

core/reshape/tile.py line 255: looks like pd.Categorical(ordered=True) orders the labels alphabetically regardless of how they're passed in.

In [14]: labels = ['okay', 'good', 'bad']
In [15]: pd.Categorical(labels, ordered=True)

Out [15]: [okay, good, bad]
Categories (3, object): [bad < good < okay]

@economy
Copy link
Contributor

economy commented May 23, 2017

If I have a categorical, and I want it to be ordered, and I pass a list of strings as labels, it doesn't seem like I'd ever want the sorted version of those labels to be my rank ordering. Is there a case where I'd actually want to sort those strings alphabetically?

@TomAugspurger
Copy link
Contributor

@economy pd.Categorical(labels, categories=labels, ordered=True). I thought it respected the observed order with ordered=True when categories=None, but I guess not.

@economy
Copy link
Contributor

economy commented May 23, 2017

So, if I'm reading this correctly, updating the test to explicitly set categories fixes the issue and follows the appropriate way to use pd.Categorical

@TomAugspurger
Copy link
Contributor

I think we need a new test in test_tile.py You can just copy the example #16459 (comment) and assert that it matches the output from 0.19.2 (with the order "good" < "medium" < "bad".

Ideally you would

  1. checkout a new branch
  2. write the test
  3. verify that the test fails
  4. write the fix (by passing categories=labels
  5. document the change

@economy
Copy link
Contributor

economy commented May 23, 2017

@TomAugspurger, I can create a new test for this, but pd.cut() still doesn't act as expected.

If I update core/reshape/tile.py line 255 to call pd.Categorical using the categories attribute correctly, the original test passes and output for pd.cut() is as described (expected) above.

@TomAugspurger
Copy link
Contributor

the original test passes and output for pd.cut() is as described (expected) above.

Do you mean the one reported by @ProsperousHeart? That is a doctest, which aren't setup to run automatically (yet, @ProsperousHeart is getting them fixed up so that we can do that). In the meantime we need a regular unit test to verify that the fix works.

economy added a commit to economy/pandas that referenced this issue May 23, 2017
TomAugspurger pushed a commit to economy/pandas that referenced this issue May 30, 2017
@jreback jreback closed this as completed in d419be4 Jun 1, 2017
TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this issue Jun 1, 2017
closes pandas-dev#16459

Author: economy <[email protected]>

This patch had conflicts when merged, resolved by
Committer: Jeff Reback <[email protected]>

Closes pandas-dev#16466 from economy/fix_cut and squashes the following commits:

29128b3 [economy] comments and whatsnew edits
3898b72 [economy] BUG: fixed wrong order of ordered labels in pd.cut()

(cherry picked from commit d419be4)
TomAugspurger pushed a commit that referenced this issue Jun 4, 2017
closes #16459

Author: economy <[email protected]>

This patch had conflicts when merged, resolved by
Committer: Jeff Reback <[email protected]>

Closes #16466 from economy/fix_cut and squashes the following commits:

29128b3 [economy] comments and whatsnew edits
3898b72 [economy] BUG: fixed wrong order of ordered labels in pd.cut()

(cherry picked from commit d419be4)
Kiv pushed a commit to Kiv/pandas that referenced this issue Jun 11, 2017
closes pandas-dev#16459

Author: economy <[email protected]>

This patch had conflicts when merged, resolved by
Committer: Jeff Reback <[email protected]>

Closes pandas-dev#16466 from economy/fix_cut and squashes the following commits:

29128b3 [economy] comments and whatsnew edits
3898b72 [economy] BUG: fixed wrong order of ordered labels in pd.cut()
stangirala pushed a commit to stangirala/pandas that referenced this issue Jun 11, 2017
closes pandas-dev#16459

Author: economy <[email protected]>

This patch had conflicts when merged, resolved by
Committer: Jeff Reback <[email protected]>

Closes pandas-dev#16466 from economy/fix_cut and squashes the following commits:

29128b3 [economy] comments and whatsnew edits
3898b72 [economy] BUG: fixed wrong order of ordered labels in pd.cut()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants