-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Fix edge cases in merge_asof() by comparing factorized keys (#13709) #13836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
this is have the same perf characteristics? (I was manually doing the perf comparisons).....prob should have some asv's for this.... |
lc = left_count[i] | ||
rc = right_count[i] | ||
|
||
if rc == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really simplified!
Current coverage is 85.23% (diff: 100%)@@ master #13836 diff @@
==========================================
Files 140 140
Lines 50455 50449 -6
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
- Hits 43014 43001 -13
- Misses 7441 7448 +7
Partials 0 0
|
@jreback I had thought about perf issues, but since the logic is a lot smaller (avoids the call to groupsort_indexer(), for example), it has to be faster. |
actually I suspect it's much slower the bottleneck never was the cython code that is why the check_duplicates helps |
Can you show me the code you used for the perf tests? I'll run them manually and report back. |
notebook is here these used the original trades.csv, quotes.csv that you had sent me:
they are not in the repo; we should prob generate some for test purposes |
I've refreshed my data sample since that notebook, but my DataFrames are a similar size.
I ran the numbers with a fresh build both before and after my code change. Before:
After:
So my code change is moderately faster, which is what I expect. The Python logic didn't really change since the |
@@ -436,7 +427,7 @@ def _merger(x, y): | |||
if by is not None: | |||
result, groupby = _groupby_and_merge(by, on, left, right, | |||
lambda x, y: _merger(x, y), | |||
check_duplicates=check_duplicates) | |||
check_duplicates=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that check_duplicates
is always False
now, since it is no longer needed.
try it with duplicates |
Are you referring to duplicate timestamps and tickers? That's already in my sample data:
And of course the regression tests with duplicate data pass. |
can u add an asv (prob need to generate data given a seed) |
Will do. |
I added a couple of simple benchmarks. The new version slightly outperforms, as I expected.
|
np.random.seed(0) | ||
one_count = 200000 | ||
two_count = 1000000 | ||
self.df1 = pd.DataFrame({'time': np.random.random_integers(0, one_count/20, one_count), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use np.random.randint(....)
as random_integers
is deprecated
@chrisaycock lgtm. just a very small change in asv. ping when green. |
Updated as per your notes. Thanks. |
@jreback It's green! |
thank you sir! |
closes #13709
Also removes unnecessary check_duplicates.