Skip to content

PERF: faster categorical ops for equal or larger than scalar #29820

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Nov 25, 2019

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Nov 23, 2019

np.nan is always encoded to -1 in a Categorical, so encoded smaller than all other possible values.

So, if we're checking if a Categorical is equal or larger than a scalar, it is not necessary to find locations of -1, as those will always be False already.

Performance

>>> n = 1_000_000
>>> c = pd.Categorical([np.nan] * n + ['b'] * n + ['c'] * n,
                       dtype=pd.CategoricalDtype(['a', 'b', 'c']), ordered=True)
>>> %timeit c == 'b'
9.79 ms ± 32.4 µs per loop  # master
3.94 ms ± 38.7 µs per loop  # this PR
# the improvement isn't available for less than etc. ops
>>> %timeit c <= 'b'
10.1 ms ± 44.1 µs per loop  # both master and this PR

@topper-123 topper-123 changed the title PERF: faster categorical ops for equal or larger scalars PERF: faster categorical ops for equal or larger than scalar Nov 23, 2019
@gfyoung gfyoung added Categorical Categorical Data Type Performance Memory or execution speed performance labels Nov 23, 2019
- Performance improvement when comparing a :meth:`Categorical` with a scalar and the scalar is not found in the categories (:issue:`29750`)
- Performance improvement when comparing a :class:`Categorical` with a scalar and the scalar is not found in the categories (:issue:`29750`)
- Performance improvement when checking if values in a :class:`Categorical` are equal, equal or larger or larger than a given scalar.
The improvement is not present if checking if the :class:`Categorical` is less than or less than or equal than the scalar (:issue:`xxxxx`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we have an issue number here? Otherwise, we should use the PR number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, added.

if opname not in {"eq", "__eq__", "ge", "__ge__", "gt", "__gt__"}:
# check for NaN needed if we are not equal or larger
mask = self._codes == -1
ret[mask] = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a performance test for this?

Copy link
Contributor Author

@topper-123 topper-123 Nov 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, no ASV's for this ATM. I actually can't get ASV to run locally, maybe a Windows issue?

Anyway, I've added I've a ASV test set, but haven't been able to run it myself, unforfunately. Isn't there a web page, where we post ASVs?

# check for NaN in self
mask = self._codes == -1
ret[mask] = False
if opname not in {"eq", "__eq__", "ge", "__ge__", "gt", "__gt__"}:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very strange that you are checking for eq and eq it’s just eq

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, changed.

def time_union(self):
union_categoricals([self.a, self.b])


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just moved this down so the constructor checks are first, which seems logical.

@jreback jreback added this to the 1.0 milestone Nov 25, 2019
@jreback jreback merged commit de3db0a into pandas-dev:master Nov 25, 2019
@jreback
Copy link
Contributor

jreback commented Nov 25, 2019

thanks @topper-123

@topper-123 topper-123 deleted the categorical_scalar_perf branch November 25, 2019 22:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants