Skip to content

BUG: Categorical.searchsorted(): use provided categorical order #14697

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

nathalier
Copy link
Contributor

@nathalier nathalier commented Nov 20, 2016

Previously, it used lexical order instead of the provided categorical
order.

Tests updated accordingly.

Closes #14522

Documentation also needs to be updated.
Could you please review? If it's OK I'll append doc update to this pull request.
Thanks!

@codecov-io
Copy link

codecov-io commented Nov 20, 2016

Current coverage is 84.57% (diff: 100%)

Merging #14697 into master will increase coverage by <.01%

@@             master     #14697   diff @@
==========================================
  Files           144        144          
  Lines         51057      51059     +2   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43180      43183     +3   
+ Misses         7877       7876     -1   
  Partials          0          0          

Powered by Codecov. Last update f1cfe5b...86b42d0

@@ -1548,50 +1548,47 @@ def test_memory_usage(self):

def test_searchsorted(self):
# https://github.com/pandas-dev/pandas/issues/8420
s1 = pd.Series(['apple', 'bread', 'bread', 'cheese', 'milk'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check these for Series (as well as Categorical)

@jreback jreback added Bug Categorical Categorical Data Type Indexing Related to indexing on series/frames, not to indexes themselves labels Nov 21, 2016
@jreback
Copy link
Contributor

jreback commented Nov 21, 2016

add a whatsnew note in 0.19.2


return self.codes.searchsorted(values_as_codes, sorter=sorter)
if -1 in values_as_codes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do this check after you search otherwise you end up scanning the data twice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, could you explain it please?
I don't see how it's possible to avoid this check and additional data scanning, if we want exception to be thrown for values not from categories list.
searchsorted() returnes 0 for -1 code (absent category), as well as for smallest category value present in the list. Thus, searchsorted() does not allow to differentiate between the value from categories which should be put on 0th position and value not from categories (0 is also returned).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see here: https://github.com/pandas-dev/pandas/blob/master/pandas/computation/pytables.py#L203. The idea IS to use searchsorted. Then check the 0's (only). If they are not actual categories, then you would raise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't searching for 0's in the result of searchsorted() generally the same as searching for -1 among codes to be inserted?

@jreback
Copy link
Contributor

jreback commented Dec 16, 2016

can you rebase

Previously, it used lexical order instead of the provided categorical
order.

Tests updated accordingly.

Closes pandas-dev#14522
@@ -96,3 +96,5 @@ Bug Fixes
- Bug in ``.plot(kind='kde')`` which did not drop missing values to generate the KDE Plot, instead generating an empty plot. (:issue:`14821`)

- Bug in ``unstack()`` if called with a list of column(s) as an argument, regardless of the dtypes of all columns, they get coerced to ``object`` (:issue:`11847`)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move to 0.20.0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and put under Other API changes

self.assert_numpy_array_equal(res, exp)
self.assert_numpy_array_equal(res, chk)
# https://github.com/pandas-dev/pandas/issues/14522

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a 1-line about what the guarantees are here

@jreback jreback added this to the 0.20.0 milestone Dec 20, 2016
@jreback
Copy link
Contributor

jreback commented Dec 20, 2016

lgtm. minor doc-changes. pls also review categorical.rst if any changes are needed.

@jreback
Copy link
Contributor

jreback commented Dec 30, 2016

thanks!

@jreback jreback closed this in 0252385 Dec 30, 2016
@nathalier
Copy link
Contributor Author

Sorry I was unavailable for the last two weeks..
Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Categorical.searchsorted() uses lexical order instead of the provided categorical order
3 participants