BUG: Fix edge cases in merge_asof() by comparing factorized keys (#13709) #13836

chrisaycock · 2016-07-29T15:57:25Z

Also removes unnecessary check_duplicates.

jreback · 2016-07-29T16:21:52Z

this is have the same perf characteristics? (I was manually doing the perf comparisons).....prob should have some asv's for this....

jreback · 2016-07-29T16:23:58Z

pandas/src/join.pyx

-        lc = left_count[i]
-        rc = right_count[i]
-
-        if rc == 0:


really simplified!

codecov-io · 2016-07-29T16:29:23Z

Current coverage is 85.23% (diff: 100%)

Merging #13836 into master will decrease coverage by 0.01%

@@             master     #13836   diff @@
==========================================
  Files           140        140          
  Lines         50455      50449     -6   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          43014      43001    -13   
- Misses         7441       7448     +7   
  Partials          0          0

Powered by Codecov. Last update 59f2557...c2e049c

chrisaycock · 2016-07-29T17:08:33Z

@jreback I had thought about perf issues, but since the logic is a lot smaller (avoids the call to groupsort_indexer(), for example), it has to be faster.

jreback · 2016-07-29T17:24:19Z

actually I suspect it's much slower

the bottleneck never was the cython code
rather the grouping / indexing

that is why the check_duplicates helps

chrisaycock · 2016-07-29T17:26:17Z

Can you show me the code you used for the perf tests? I'll run them manually and report back.

jreback · 2016-07-29T17:48:47Z

notebook is here

these used the original trades.csv, quotes.csv that you had sent me:

-rw-rw-r--@ 1 jreback  staff  17202707 Jun  2 18:19 trades.csv
-rw-rw-r--@ 1 jreback  staff  47853000 Jun  2 18:19 quotes.csv

they are not in the repo; we should prob generate some for test purposes

chrisaycock · 2016-07-29T18:13:08Z

I've refreshed my data sample since that notebook, but my DataFrames are a similar size.

In [3]: trades.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319018 entries, 0 to 319017
Data columns (total 4 columns):
time        319018 non-null datetime64[ns]
ticker      319018 non-null category
price       316669 non-null float64
quantity    319018 non-null int64
dtypes: category(1), datetime64[ns](1), float64(1), int64(1)
memory usage: 7.6 MB

In [4]: quotes.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1032780 entries, 0 to 1032779
Data columns (total 4 columns):
time      1032780 non-null datetime64[ns]
ticker    1032780 non-null category
bid       1030431 non-null float64
ask       1030431 non-null float64
dtypes: category(1), datetime64[ns](1), float64(2)
memory usage: 24.6 MB

I ran the numbers with a fresh build both before and after my code change.

Before:

In [5]: %timeit -n 1 -r 1 pd.merge_asof(trades, quotes, on='time', by='ticker')
1 loop, best of 1: 970 ms per loop

After:

In [5]: %timeit -n 1 -r 1 pd.merge_asof(trades, quotes, on='time', by='ticker')
1 loop, best of 1: 835 ms per loop

So my code change is moderately faster, which is what I expect. The Python logic didn't really change since the check_duplicates is now gone (equivalent to the False case). And since the Cython logic was never much of a bottleneck, changes there provide only modest performance improvements.

chrisaycock · 2016-07-29T18:14:54Z

pandas/tools/merge.py

@@ -436,7 +427,7 @@ def _merger(x, y):
    if by is not None:
        result, groupby = _groupby_and_merge(by, on, left, right,
                                             lambda x, y: _merger(x, y),
-                                             check_duplicates=check_duplicates)
+                                             check_duplicates=False)


Note that check_duplicates is always False now, since it is no longer needed.

jreback · 2016-07-29T18:15:30Z

try it with duplicates

chrisaycock · 2016-07-29T18:28:31Z

Are you referring to duplicate timestamps and tickers? That's already in my sample data:

In [16]: quotes
Out[16]:
                           time ticker     bid     ask
0       2016-07-15 13:30:00.000   GOOG  724.24  725.73
1       2016-07-15 13:30:00.002   MSFT   53.83   53.95
2       2016-07-15 13:30:00.002   MSFT   53.83   53.95
3       2016-07-15 13:30:00.004   GOOG  722.75  725.73
4       2016-07-15 13:30:00.036   GOOG  725.48  725.73
5       2016-07-15 13:30:00.036   AAPL   98.92   99.05
6       2016-07-15 13:30:00.036   GOOG  722.75  725.73
7       2016-07-15 13:30:00.036   AAPL   98.92   99.05
8       2016-07-15 13:30:00.037   MSFT   53.84   53.95
9       2016-07-15 13:30:00.037   MSFT   53.84   53.95
...                         ...    ...     ...     ...

And of course the regression tests with duplicate data pass.

jreback · 2016-07-29T18:33:52Z

can u add an asv (prob need to generate data given a seed)

chrisaycock · 2016-07-29T18:49:17Z

Will do.

chrisaycock · 2016-07-29T21:12:55Z

I added a couple of simple benchmarks. The new version slightly outperforms, as I expected.

· Running 4 total benchmarks (2 commits * 1 environments * 2 benchmarks)
[  0.00%] · For pandas commit hash aab3e3cc:
[  0.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..............................
[  0.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 25.00%] ··· Running join_merge.merge_asof_by.time_merge_asof_by                                              539.69ms
[ 50.00%] ··· Running join_merge.merge_asof_noby.time_merge_asof_noby                                           77.95ms
[ 50.00%] · For pandas commit hash 2166ac13:
[ 50.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
[ 50.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.00%] ··· Running join_merge.merge_asof_by.time_merge_asof_by                                              708.91ms
[100.00%] ··· Running join_merge.merge_asof_noby.time_merge_asof_noby                                           97.48ms
BENCHMARKS NOT SIGNIFICANTLY CHANGED.

jreback · 2016-08-01T10:06:30Z

asv_bench/benchmarks/join_merge.py

+        np.random.seed(0)
+        one_count = 200000
+        two_count = 1000000
+        self.df1 = pd.DataFrame({'time': np.random.random_integers(0, one_count/20, one_count),


use np.random.randint(....) as random_integers is deprecated

jreback · 2016-08-01T10:11:04Z

@chrisaycock lgtm. just a very small change in asv. ping when green.

) Also removes unnecessary check_duplicates. Added asv benchmarks for merge_asof()

chrisaycock · 2016-08-01T14:03:22Z

Updated as per your notes. Thanks.

chrisaycock · 2016-08-01T20:52:30Z

@jreback It's green!

jreback · 2016-08-01T20:56:59Z

thank you sir!

chrisaycock mentioned this pull request Jul 29, 2016

pd.merge_asof() matches out of tolerance when timestamps are duplicated #13709

Closed

jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jul 29, 2016

jreback reviewed Jul 29, 2016
View reviewed changes

pandas/src/join.pyx

lc = left_count[i]

rc = right_count[i]

if rc == 0:

Copy link

Contributor

jreback Jul 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really simplified!

chrisaycock reviewed Jul 29, 2016
View reviewed changes

jreback reviewed Aug 1, 2016
View reviewed changes

jreback added this to the 0.19.0 milestone Aug 1, 2016

BUG: Fix edge cases in merge_asof() by comparing factorized keys (#13709

c2e049c

) Also removes unnecessary check_duplicates. Added asv benchmarks for merge_asof()

jreback merged commit 9b2797d into pandas-dev:master Aug 1, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix edge cases in merge_asof() by comparing factorized keys (#13709) #13836

BUG: Fix edge cases in merge_asof() by comparing factorized keys (#13709) #13836

chrisaycock commented Jul 29, 2016 •

edited by jreback

Loading

jreback commented Jul 29, 2016

jreback Jul 29, 2016

codecov-io commented Jul 29, 2016 •

edited

Loading

chrisaycock commented Jul 29, 2016

jreback commented Jul 29, 2016

chrisaycock commented Jul 29, 2016

jreback commented Jul 29, 2016

chrisaycock commented Jul 29, 2016

chrisaycock Jul 29, 2016

jreback commented Jul 29, 2016

chrisaycock commented Jul 29, 2016

jreback commented Jul 29, 2016

chrisaycock commented Jul 29, 2016

chrisaycock commented Jul 29, 2016 •

edited

Loading

jreback Aug 1, 2016

jreback commented Aug 1, 2016

chrisaycock commented Aug 1, 2016

chrisaycock commented Aug 1, 2016

jreback commented Aug 1, 2016

BUG: Fix edge cases in merge_asof() by comparing factorized keys (#13709) #13836

BUG: Fix edge cases in merge_asof() by comparing factorized keys (#13709) #13836

Conversation

chrisaycock commented Jul 29, 2016 • edited by jreback Loading

jreback commented Jul 29, 2016

jreback Jul 29, 2016

Choose a reason for hiding this comment

codecov-io commented Jul 29, 2016 • edited Loading

Current coverage is 85.23% (diff: 100%)

chrisaycock commented Jul 29, 2016

jreback commented Jul 29, 2016

chrisaycock commented Jul 29, 2016

jreback commented Jul 29, 2016

chrisaycock commented Jul 29, 2016

chrisaycock Jul 29, 2016

Choose a reason for hiding this comment

jreback commented Jul 29, 2016

chrisaycock commented Jul 29, 2016

jreback commented Jul 29, 2016

chrisaycock commented Jul 29, 2016

chrisaycock commented Jul 29, 2016 • edited Loading

jreback Aug 1, 2016

Choose a reason for hiding this comment

jreback commented Aug 1, 2016

chrisaycock commented Aug 1, 2016

chrisaycock commented Aug 1, 2016

jreback commented Aug 1, 2016

chrisaycock commented Jul 29, 2016 •

edited by jreback

Loading

codecov-io commented Jul 29, 2016 •

edited

Loading

chrisaycock commented Jul 29, 2016 •

edited

Loading