Skip to content

PERF: custom ops for RangeIndex.[all|any|__contains__] #26617

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Jun 2, 2019

Follow-up to #26565. Make RangeIndex._data be created in fewer cases.

Performance examples (but the larger gain is from memory savings by not creating the _data array):

>>>  rng = pd.RangeIndex(-5_000_000, 0, 6)
>>> %timeit rng.all()
789 µs ± 5.24 µs per loop  # master
216 ns ± 0.957 ns per loop  # this PR
>>> %timeit rng.any()
763 µs ± 6.08 µs per loop  # master
124 ns ± 9.15 ns per loop  # this PR
>>> %timeit (-2 in rng)
380 ns ± 1.85 ns per loop  # master
689 ns ± 44.7 ns per loop  # this PR, slightly slower

@topper-123 topper-123 added Index Related to the Index class or subclasses Performance Memory or execution speed performance labels Jun 2, 2019
@topper-123 topper-123 added this to the 0.25.0 milestone Jun 2, 2019
def __contains__(self, key):
hash(key)
try:
key = com.ensure_python_int(key)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Python int is needed, else Python iterates over the range, making the operation very slow.

@topper-123 topper-123 force-pushed the RangeIndex_all_any_contains branch from 8bca43c to cfd68e5 Compare June 2, 2019 09:27
@codecov
Copy link

codecov bot commented Jun 2, 2019

Codecov Report

Merging #26617 into master will decrease coverage by <.01%.
The diff coverage is 92%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26617      +/-   ##
==========================================
- Coverage   91.85%   91.85%   -0.01%     
==========================================
  Files         174      174              
  Lines       50722    50741      +19     
==========================================
+ Hits        46593    46606      +13     
- Misses       4129     4135       +6
Flag Coverage Δ
#multiple 90.39% <92%> (-0.01%) ⬇️
#single 41.79% <72%> (-0.08%) ⬇️
Impacted Files Coverage Δ
pandas/core/common.py 98.48% <100%> (+0.06%) ⬆️
pandas/core/indexes/base.py 96.71% <100%> (-0.01%) ⬇️
pandas/core/indexes/range.py 97.6% <87.5%> (-0.46%) ⬇️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 97% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 437efa6...cfd68e5. Read the comment docs.

@codecov
Copy link

codecov bot commented Jun 2, 2019

Codecov Report

Merging #26617 into master will decrease coverage by 1.46%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26617      +/-   ##
==========================================
- Coverage   91.87%   90.41%   -1.47%     
==========================================
  Files         174      174              
  Lines       50661    50668       +7     
==========================================
- Hits        46547    45809     -738     
- Misses       4114     4859     +745
Flag Coverage Δ
#multiple 90.41% <100%> (ø) ⬆️
#single ?
Impacted Files Coverage Δ
pandas/core/indexes/base.py 96.71% <100%> (-0.01%) ⬇️
pandas/core/indexes/range.py 98.52% <100%> (+0.04%) ⬆️
pandas/io/feather_format.py 21.05% <0%> (-68.43%) ⬇️
pandas/core/computation/pytables.py 62.5% <0%> (-27.75%) ⬇️
pandas/io/pytables.py 63.82% <0%> (-26.48%) ⬇️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/computation/common.py 84.21% <0%> (-5.27%) ⬇️
pandas/core/computation/expr.py 94.5% <0%> (-3.03%) ⬇️
pandas/io/clipboard/clipboards.py 31.88% <0%> (-2.9%) ⬇️
pandas/io/formats/printing.py 84.49% <0%> (-1.07%) ⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 891a419...4aea46d. Read the comment docs.

@@ -490,3 +490,13 @@ def f(x):
f = mapper

return f


def ensure_python_int(value: Union[int, Any]) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put in dtypes.cast but i don’t think u need this as we already have ensure_platform_int

Copy link
Contributor Author

@topper-123 topper-123 Jun 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensure_platform_int returns a numpy int(32|64), so won't work here:

>>> rng = range(1, 1_000_000)
>>> %timeit 999_999 in rng
133 ns ± 2.46 ns per loop
>>> %timeit np.int64(999_999) in rng
306 ms ± 66.3 ms per loop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok in any event move it to collocate with the ensure _ routines

@topper-123 topper-123 force-pushed the RangeIndex_all_any_contains branch 3 times, most recently from 4826efc to 3ca546e Compare June 2, 2019 15:14
@@ -1333,3 +1334,13 @@ def maybe_cast_to_integer_array(arr, dtype, copy=False):
if is_integer_dtype(dtype) and (is_float_dtype(arr) or
is_object_dtype(arr)):
raise ValueError("Trying to coerce float values to integers")


def ensure_python_int(value: Union[int, Any]) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave the wrong direction, can you put in core/dtypes/common (right below the other ensure_*)

and can you add a doc-string (I know its trivial)

the typing seems odd here, shouldn't this be Union[int, np.integer] ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure. In principle a float or a np.floating and possibly others could for some subvalues pass the test (e.g. if value is 1.0). But ok, passing a float would be a code smell, so just as well make the programmer see passing those as typing errors.

def __contains__(self, key):
hash(key)
try:
key = cast.ensure_python_int(key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import directly

def all(self):
return 0 not in self._range

def any(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will need to wait for this until #26581 is merged, as mypy doesn't understand/see attributes defined in __new__ or _simple_new . If that one is merged, I can go through this again wrt. typing.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comments

@topper-123 topper-123 force-pushed the RangeIndex_all_any_contains branch 3 times, most recently from 34706ce to fa4c649 Compare June 3, 2019 05:34
@topper-123
Copy link
Contributor Author

Comments addressed. The typing issue must wait until #26581 is merged for mypy to see _range.

@topper-123 topper-123 force-pushed the RangeIndex_all_any_contains branch 2 times, most recently from 3d52e46 to 949d452 Compare June 5, 2019 14:39
@jreback
Copy link
Contributor

jreback commented Jun 5, 2019

can you add the issue number onto the one you already have in Performance section; merge master as well. ping on green.

@topper-123 topper-123 force-pushed the RangeIndex_all_any_contains branch from 949d452 to 71ab9ed Compare June 5, 2019 23:34
@topper-123 topper-123 force-pushed the RangeIndex_all_any_contains branch from 71ab9ed to 4aea46d Compare June 5, 2019 23:38
@topper-123 topper-123 merged commit 649ad5c into pandas-dev:master Jun 6, 2019
@topper-123 topper-123 deleted the RangeIndex_all_any_contains branch June 19, 2019 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Index Related to the Index class or subclasses Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants