-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
CLN: Use generators when objects are re-iterated over in core/internals #58319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one question but generally lgtm
@@ -1525,7 +1526,9 @@ def _insert_update_mgr_locs(self, loc) -> None: | |||
When inserting a new Block at location 'loc', we increment | |||
all of the mgr_locs of blocks above that by one. | |||
""" | |||
for blkno, count in _fast_count_smallints(self.blknos[loc:]): | |||
# Faster version of set(arr) for sequences of small numbers | |||
blknos = np.bincount(self.blknos[loc:]).nonzero()[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the advantage of this over np.unique?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in the comment it appears to be more performant
In [3]: import numpy as np
In [4]: arr = np.array([1])
In [5]: %timeit np.unique(arr)
1.95 µs ± 8.22 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
In [7]: %timeit np.bincount(arr).nonzero()[0]
416 ns ± 56.8 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
In [8]: arr = np.arange(1000)
In [9]: %timeit np.unique(arr)
5.97 µs ± 27.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [10]: %timeit np.bincount(arr).nonzero()[0]
3.51 µs ± 16.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [11]: arr = np.arange(100_000)
In [12]: %timeit np.unique(arr)
607 µs ± 4.32 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [13]: %timeit np.bincount(arr).nonzero()[0]
294 µs ± 3.27 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its a nice idea but unless this is getting called in a loop I'd have a slight preference to just go with np.unique
since its more expressive. Seems like we might be relying on some implementation details with bincount that could be hard to generalize
Though if you really want this I'd also say its not a blocker for me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this was just transferring over the logic from _fast_count_smallints
I would prefer keeping this as-is in this PR but not opposed to changing this to np.unique
in the future
@@ -1525,7 +1526,9 @@ def _insert_update_mgr_locs(self, loc) -> None: | |||
When inserting a new Block at location 'loc', we increment | |||
all of the mgr_locs of blocks above that by one. | |||
""" | |||
for blkno, count in _fast_count_smallints(self.blknos[loc:]): | |||
# Faster version of set(arr) for sequences of small numbers | |||
blknos = np.bincount(self.blknos[loc:]).nonzero()[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its a nice idea but unless this is getting called in a loop I'd have a slight preference to just go with np.unique
since its more expressive. Seems like we might be relying on some implementation details with bincount that could be hard to generalize
Though if you really want this I'd also say its not a blocker for me
…ls (pandas-dev#58319) * Make _split generator * More iterators * Remove typing
…ls (pandas-dev#58319) * Make _split generator * More iterators * Remove typing
No description provided.