Skip to content

PERF: Using Numpy C-API arange #32681

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 14, 2020
Merged

Conversation

ShaharNaveh
Copy link
Member

  • closes #xxxx
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

This PR was opened as @jbrockmendel suggested (ref #32177 (comment))


Benchmarks:

Master:

In [1]: import pandas._libs.internals as internals

In [2]: %timeit internals.BlockPlacement(slice(1_000_000)).as_array
1.55 ms ± 143 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

PR:

In [1]: import pandas._libs.internals as internals

In [2]: %timeit internals.BlockPlacement(slice(1_000_000)).as_array
1.46 ms ± 3.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

@jbrockmendel
Copy link
Member

Since the perf numbers are pretty similar, can you report a few rounds of each

@ShaharNaveh
Copy link
Member Author

Since the perf numbers are pretty similar, can you report a few rounds of each

Sure!


In [1]: import pandas._libs.internals as internals                                                            

In [2]: %timeit internals.BlockPlacement(slice(1)).as_array                                                   
1.46 µs ± 45.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # master
586 ns ± 35.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # PR

In [3]: %timeit internals.BlockPlacement(slice(10)).as_array                                                  
1.65 µs ± 21.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # master
832 ns ± 19.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # PR

In [4]: %timeit internals.BlockPlacement(slice(100)).as_array                                                 
1.74 µs ± 14.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # master
879 ns ± 12 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # PR

In [5]: %timeit internals.BlockPlacement(slice(1_000)).as_array                                               
2.94 µs ± 105 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) # master
2.1 µs ± 77.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # PR

In [6]: %timeit internals.BlockPlacement(slice(10_000)).as_array                                              
10.1 µs ± 198 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) # master
9.39 µs ± 170 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) # PR

In [7]: %timeit internals.BlockPlacement(slice(100_000)).as_array                                             
81 µs ± 1.75 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) # master
83 µs ± 1.36 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) # PR

In [8]: %timeit internals.BlockPlacement(slice(1_000_000)).as_array                                           
1.72 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) # master
1.56 ms ± 13.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) # PR

@jbrockmendel
Copy link
Member

thats a nice improvement for the small cases, thanks

@@ -105,7 +107,9 @@ cdef class BlockPlacement:
Py_ssize_t start, stop, end, _
if not self._has_array:
start, stop, step, _ = slice_get_indices_ex(self._as_slice)
self._as_array = np.arange(start, stop, step, dtype=np.int64)
# NOTE: this is the C-optimized equivalent of
# np.arange(start, stop, step, dtype=np.int64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: i like to add an extra space at the beginning of multi-line comments specifically to avoid confusing this for commented-out code. quotation marks or backticks or ... are also good

@jbrockmendel
Copy link
Member

lgtm, thanks for accomodating my nitpicking

@ShaharNaveh
Copy link
Member Author

@jbrockmendel I have a list of your nitpicks and requests, I try to address them when I have the time :)

@WillAyd
Copy link
Member

WillAyd commented Mar 13, 2020

Why does this generate anything different from Cython? Shouldn't it generate the same result?

@jbrockmendel
Copy link
Member

Why does this generate anything different from Cython? Shouldn't it generate the same result?

The generated C code has to call np.arange in python-space (and lookup np.int64, and ...)

@WillAyd
Copy link
Member

WillAyd commented Mar 13, 2020 via email

@topper-123 topper-123 added Performance Memory or execution speed performance Internals Related to non-user accessible pandas implementation labels Mar 14, 2020
@topper-123 topper-123 added this to the 1.1 milestone Mar 14, 2020
@jreback jreback merged commit 5c7a901 into pandas-dev:master Mar 14, 2020
@jreback
Copy link
Contributor

jreback commented Mar 14, 2020

thanks I agree we should not just wholesale convert calls to use the c-api, but in specific cases where we are actually returning a numpy array (as opposed to iteration), ok

@ShaharNaveh ShaharNaveh deleted the PERF-arange branch March 18, 2020 12:47
SeeminSyed pushed a commit to CSCD01-team01/pandas that referenced this pull request Mar 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Internals Related to non-user accessible pandas implementation Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants