Skip to content

BUG: Fix edge cases in merge_asof() by comparing factorized keys (#13709) #13836

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 1, 2016
Merged

BUG: Fix edge cases in merge_asof() by comparing factorized keys (#13709) #13836

merged 1 commit into from
Aug 1, 2016

Conversation

chrisaycock
Copy link
Contributor

@chrisaycock chrisaycock commented Jul 29, 2016

closes #13709

Also removes unnecessary check_duplicates.

@jreback
Copy link
Contributor

jreback commented Jul 29, 2016

this is have the same perf characteristics? (I was manually doing the perf comparisons).....prob should have some asv's for this....

@jreback jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jul 29, 2016
lc = left_count[i]
rc = right_count[i]

if rc == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really simplified!

@codecov-io
Copy link

codecov-io commented Jul 29, 2016

Current coverage is 85.23% (diff: 100%)

Merging #13836 into master will decrease coverage by 0.01%

@@             master     #13836   diff @@
==========================================
  Files           140        140          
  Lines         50455      50449     -6   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          43014      43001    -13   
- Misses         7441       7448     +7   
  Partials          0          0          

Powered by Codecov. Last update 59f2557...c2e049c

@chrisaycock
Copy link
Contributor Author

@jreback I had thought about perf issues, but since the logic is a lot smaller (avoids the call to groupsort_indexer(), for example), it has to be faster.

@jreback
Copy link
Contributor

jreback commented Jul 29, 2016

actually I suspect it's much slower

the bottleneck never was the cython code
rather the grouping / indexing

that is why the check_duplicates helps

@chrisaycock
Copy link
Contributor Author

Can you show me the code you used for the perf tests? I'll run them manually and report back.

@jreback
Copy link
Contributor

jreback commented Jul 29, 2016

notebook is here

these used the original trades.csv, quotes.csv that you had sent me:

-rw-rw-r--@ 1 jreback  staff  17202707 Jun  2 18:19 trades.csv
-rw-rw-r--@ 1 jreback  staff  47853000 Jun  2 18:19 quotes.csv

they are not in the repo; we should prob generate some for test purposes

@chrisaycock
Copy link
Contributor Author

I've refreshed my data sample since that notebook, but my DataFrames are a similar size.

In [3]: trades.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319018 entries, 0 to 319017
Data columns (total 4 columns):
time        319018 non-null datetime64[ns]
ticker      319018 non-null category
price       316669 non-null float64
quantity    319018 non-null int64
dtypes: category(1), datetime64[ns](1), float64(1), int64(1)
memory usage: 7.6 MB

In [4]: quotes.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1032780 entries, 0 to 1032779
Data columns (total 4 columns):
time      1032780 non-null datetime64[ns]
ticker    1032780 non-null category
bid       1030431 non-null float64
ask       1030431 non-null float64
dtypes: category(1), datetime64[ns](1), float64(2)
memory usage: 24.6 MB

I ran the numbers with a fresh build both before and after my code change.

Before:

In [5]: %timeit -n 1 -r 1 pd.merge_asof(trades, quotes, on='time', by='ticker')
1 loop, best of 1: 970 ms per loop

After:

In [5]: %timeit -n 1 -r 1 pd.merge_asof(trades, quotes, on='time', by='ticker')
1 loop, best of 1: 835 ms per loop

So my code change is moderately faster, which is what I expect. The Python logic didn't really change since the check_duplicates is now gone (equivalent to the False case). And since the Cython logic was never much of a bottleneck, changes there provide only modest performance improvements.

@@ -436,7 +427,7 @@ def _merger(x, y):
if by is not None:
result, groupby = _groupby_and_merge(by, on, left, right,
lambda x, y: _merger(x, y),
check_duplicates=check_duplicates)
check_duplicates=False)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that check_duplicates is always False now, since it is no longer needed.

@jreback
Copy link
Contributor

jreback commented Jul 29, 2016

try it with duplicates

@chrisaycock
Copy link
Contributor Author

Are you referring to duplicate timestamps and tickers? That's already in my sample data:

In [16]: quotes
Out[16]:
                           time ticker     bid     ask
0       2016-07-15 13:30:00.000   GOOG  724.24  725.73
1       2016-07-15 13:30:00.002   MSFT   53.83   53.95
2       2016-07-15 13:30:00.002   MSFT   53.83   53.95
3       2016-07-15 13:30:00.004   GOOG  722.75  725.73
4       2016-07-15 13:30:00.036   GOOG  725.48  725.73
5       2016-07-15 13:30:00.036   AAPL   98.92   99.05
6       2016-07-15 13:30:00.036   GOOG  722.75  725.73
7       2016-07-15 13:30:00.036   AAPL   98.92   99.05
8       2016-07-15 13:30:00.037   MSFT   53.84   53.95
9       2016-07-15 13:30:00.037   MSFT   53.84   53.95
...                         ...    ...     ...     ...

And of course the regression tests with duplicate data pass.

@jreback
Copy link
Contributor

jreback commented Jul 29, 2016

can u add an asv (prob need to generate data given a seed)

@chrisaycock
Copy link
Contributor Author

Will do.

@chrisaycock
Copy link
Contributor Author

chrisaycock commented Jul 29, 2016

I added a couple of simple benchmarks. The new version slightly outperforms, as I expected.

· Running 4 total benchmarks (2 commits * 1 environments * 2 benchmarks)
[  0.00%] · For pandas commit hash aab3e3cc:
[  0.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..............................
[  0.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 25.00%] ··· Running join_merge.merge_asof_by.time_merge_asof_by                                              539.69ms
[ 50.00%] ··· Running join_merge.merge_asof_noby.time_merge_asof_noby                                           77.95ms
[ 50.00%] · For pandas commit hash 2166ac13:
[ 50.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
[ 50.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.00%] ··· Running join_merge.merge_asof_by.time_merge_asof_by                                              708.91ms
[100.00%] ··· Running join_merge.merge_asof_noby.time_merge_asof_noby                                           97.48ms
BENCHMARKS NOT SIGNIFICANTLY CHANGED.

np.random.seed(0)
one_count = 200000
two_count = 1000000
self.df1 = pd.DataFrame({'time': np.random.random_integers(0, one_count/20, one_count),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use np.random.randint(....) as random_integers is deprecated

@jreback jreback added this to the 0.19.0 milestone Aug 1, 2016
@jreback
Copy link
Contributor

jreback commented Aug 1, 2016

@chrisaycock lgtm. just a very small change in asv. ping when green.

)

Also removes unnecessary check_duplicates.

Added asv benchmarks for merge_asof()
@chrisaycock
Copy link
Contributor Author

Updated as per your notes. Thanks.

@chrisaycock
Copy link
Contributor Author

@jreback It's green!

@jreback jreback merged commit 9b2797d into pandas-dev:master Aug 1, 2016
@jreback
Copy link
Contributor

jreback commented Aug 1, 2016

thank you sir!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pd.merge_asof() matches out of tolerance when timestamps are duplicated
3 participants