ENH: merge_asof() has type specializations and can take multiple 'by' parameters (#13936) #14783

chrisaycock · 2016-12-01T19:18:27Z

closes Expand type specializations and multiple "by" parameters in merge_asof() #13936
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

…meters (#13936)

chrisaycock · 2016-12-01T20:34:42Z

Hmm, AppVeyor failed with

Build execution time has reached the maximum allowed time for your plan (60 minutes).

I notice the same thing happened on #14771 and #14757.

gfyoung · 2016-12-01T23:07:25Z

@chrisaycock : Rebase onto master to re-trigger Appveyor. Don't know exactly why Appveyor does this.

chrisaycock · 2016-12-01T23:14:22Z

@gfyoung Thanks, re-running now.

jreback · 2016-12-04T19:23:21Z

doc/source/whatsnew/v0.19.2.txt

+Other enhancements
+^^^^^^^^^^^^^^^^^^
+
+- ``pd.merge_asof()`` can take multiple columns in ``by`` parameter and has specialized types (:issue:`13936`)


specialized dtypes, and elabortae on what this does (e.g. perf)

Yes, it's to improve performance rather than cast all integer types to int64 (see my benchmarks pasted below). I can add a description to whatsnew line.

jreback · 2016-12-04T19:26:42Z

pandas/src/joins_func_helper.pxi.in

-by_dtypes = [('PyObjectHashTable', 'object'), ('Int64HashTable', 'int64_t')]
+# by_dtype, table_type, init_table, s1, s2, s3, g1, g2
+by_dtypes = [('int64_t', 'Int64HashTable', 'Int64HashTable(right_size)',
+              '.set_item(', ', ', ')',


why are you changing this for object table? it makes the code much more complicated. We don't use python objects anywhere in cython (instead we use the PyObjectHashTable).

The PyObjectHashTable turns out to be slower than dict; see the by_object test:

[ 0.00%] · For pandas commit hash c33c4cbe: [ 0.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt................................... [ 0.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt [ 10.00%] ··· Running join_merge.merge_asof_by_int.time_merge_asof_by_int 23.11ms [ 20.00%] ··· Running join_merge.merge_asof_by_object.time_merge_asof_by_object 25.29ms [ 30.00%] ··· Running join_merge.merge_asof_int32_noby.time_merge_asof_int32_noby 13.11ms [ 40.00%] ··· Running join_merge.merge_asof_multiby.time_merge_asof_multiby 470.48ms [ 50.00%] ··· Running join_merge.merge_asof_noby.time_merge_asof_noby 9.54ms [ 50.00%] · For pandas commit hash 725453de: [ 50.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.................................. [ 50.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt [ 60.00%] ··· Running join_merge.merge_asof_by_int.time_merge_asof_by_int 23.18ms [ 70.00%] ··· Running join_merge.merge_asof_by_object.time_merge_asof_by_object 39.56ms [ 80.00%] ··· Running join_merge.merge_asof_int32_noby.time_merge_asof_int32_noby 15.71ms [ 90.00%] ··· Running join_merge.merge_asof_multiby.time_merge_asof_multiby failed [100.00%] ··· Running join_merge.merge_asof_noby.time_merge_asof_noby 9.38ms before after ratio [725453de] [c33c4cbe] - 15.71ms 13.11ms 0.83 join_merge.merge_asof_int32_noby.time_merge_asof_int32_noby - 39.56ms 25.29ms 0.64 join_merge.merge_asof_by_object.time_merge_asof_by_object SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

I stumbled on this when I tried the dict out of desperation since PyObjectHashTabe wasn't able to handle tuples the way I wanted.

I did end-up adding the stringify() function, which solved a lot of problems, though has its own significant overheard as evidenced by the spike in runtime for the multiby test. I could switch back to PyObjectHashTable since the stringify() now simplifies a lot of the code. I just wish there were a fast way to hash numpy arrays.

jreback · 2016-12-10T15:35:00Z

pls update as @jorisvandenbossche changed the asv's a bit

codecov-io · 2016-12-12T18:56:52Z

Current coverage is 85.31% (diff: 100%)

Merging #14783 into master will increase coverage by <.01%

@@             master     #14783   diff @@
==========================================
  Files           144        144          
  Lines         51004      51023    +19   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43511      43532    +21   
+ Misses         7493       7491     -2   
  Partials          0          0

Powered by Codecov. Last update a8cabb8...ffcf0c2

chrisaycock · 2016-12-12T19:00:00Z

@jreback I've update the benchmarks:

· Running 10 total benchmarks (2 commits * 1 environments * 5 benchmarks)
[  0.00%] · For pandas commit hash fafbb022:
[  0.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...................................
[  0.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 10.00%] ··· Running join_merge.MergeAsof.time_by_int                                                           21.26ms
[ 20.00%] ··· Running join_merge.MergeAsof.time_by_object                                                        25.12ms
[ 30.00%] ··· Running join_merge.MergeAsof.time_multiby                                                         471.15ms
[ 40.00%] ··· Running join_merge.MergeAsof.time_noby                                                              9.62ms
[ 50.00%] ··· Running join_merge.MergeAsof.time_on_int32                                                         12.68ms
[ 50.00%] · For pandas commit hash 14e48153:
[ 50.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..................................
[ 50.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 60.00%] ··· Running join_merge.MergeAsof.time_by_int                                                           21.49ms
[ 70.00%] ··· Running join_merge.MergeAsof.time_by_object                                                        38.56ms
[ 80.00%] ··· Running join_merge.MergeAsof.time_multiby                                                           failed
[ 90.00%] ··· Running join_merge.MergeAsof.time_noby                                                              9.37ms
[100.00%] ··· Running join_merge.MergeAsof.time_on_int32                                                         15.42ms   
    before     after       ratio
  [14e48153] [fafbb022]
-   15.42ms    12.68ms      0.82  join_merge.MergeAsof.time_on_int32
-   38.56ms    25.12ms      0.65  join_merge.MergeAsof.time_by_object

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

Please advise on the dict vs PyObjectHashTable. Should I revert that change? The dict definitely makes the code more complicated, but the benchmark improves significantly.

jreback · 2016-12-12T19:04:43Z

@chrisaycock this add significant complexity to the code and not in favor of this approach. That said, you might be able to use / create somethign like #14859 which does something very similar. IOW, you make a new hashtable type that has specialized properties (IIRC you are using tuples here?)

chrisaycock · 2016-12-12T19:08:34Z

@jreback I'll just remove that change for now. The point of this PR is to get the "multi-by", so I'll focus on that. Thanks!

jreback · 2016-12-12T19:10:08Z

@chrisaycock absolutley.

and FYI, you should generally just categorize your object dtypes anyhow. (we could even do that under the hood for merging purposes).

pls make a new issue w.r.t. these points if you would like

jreback · 2016-12-12T21:23:21Z

pandas/src/joins_func_helper.pxi.in

+    result = np.empty(n, dtype=np.object)
+
+    for i in range(n):
+        result[i] = xt[i].tostring()


why do you need this again? (I see you are using it), but what is the input that you are giving it?

maybe a doc-string would help

I've added a couple comments to address this. When the by parameter has multiple entries, then we want to store the entire array in the hash table. Unfortunately, NumPy arrays aren't hashable. After lots of digging, the fastest thing to do is to convert to a string.

oh if all u need is hashing we just added this:

https://github.com/pandas-dev/pandas/blob/master/pandas/tools/hashing.py

jreback · 2016-12-12T21:24:59Z

pandas/tools/tests/test_merge_asof.py

@@ -452,6 +563,69 @@ def test_on_float(self):

        assert_frame_equal(result, expected)

+    def test_on_specialized_type(self):
+        # GH13936
+        for dtype in [np.uint16, np.uint32, np.uint64,


are int8/uint8 not allowed (ok by me), just test that you raise on them (or do they work some other way)? same about float16 (should just raise)

chrisaycock · 2016-12-12T22:55:41Z

AppVeyor is producing a very odd error. It keeps failing for test_multiby_heterogeneous_types, but computes a different resulting value each time!

The first run computes:

[nan, 51.95, nan, nan, 720.5]

The second run computes:

[nan, nan, nan, nan, 720.5]

The third run computes:

[nan, 51.95, nan, nan, nan]

There are no code differences between the second and third run! (I changed a line in the whatsnew to force a re-run in AppVeyor).

I have been unable to reproduce the error on my machine in either Py2 or Py3.

Given that the values change each run, I wonder if there is some element of randomness. The test_multiby_heterogeneous_types will invoke stringify() on a two-dimensional NumPy array where each array element contains a number and a string. But since I can't reproduce the error on my system (the results are always the same), I'm kind of at a loss for how to track this down.

jreback · 2016-12-12T23:39:26Z

@chrisaycock I can run your PR on windows tomorrow. but in any event, you should simply use the hash_array instead of stringifying things (IIUC what you are doing)

chrisaycock · 2016-12-13T00:17:53Z

@jreback I'm trying to use the array as a key, like

In [19]: xs = np.arange(3)

In [20]: h = PyObjectHashTable()

In [21]: h.set_item(xs, 1)
...
TypeError: unhashable type: 'numpy.ndarray'

The NumPy array is not hashable, so both PyObjectHashTable and dict can't take it.

As for why I'm doing it this way, my asof_join logic stores the last-seen index for the by parameter as the function linearly scans the array of timestamps. That's why the algo works in a single pass, which makes it extremely fast.

To handle an array in the by parameter, I need to hash multiple values at once. Since a NumPy array isn't hashable, I went with turning it into a string. (.tostring() returns the raw bytes, which is sufficient for what I need.)

This has worked reasonably well in testing, at least on my Mac.

jreback · 2016-12-13T11:51:58Z

In [9]: from pandas.tools.hashing import hash_array

In [10]: arr = np.arange(3)

In [11]: arr
Out[11]: array([0, 1, 2])

In [13]: np.bitwise_xor.reduce(hash_array(arr))
Out[13]: 10178186328804430191

I had a pr to make this a convenience function ,but its pretty trivial

jreback · 2016-12-13T14:18:28Z

@chrisaycock so this passed for me on windows (just noticed it failed on 3.4 on travis though).

so I suspect something is not stable in the multi-grouping......

chrisaycock · 2016-12-13T14:55:40Z

@jreback Thanks for checking. I'll dig some more into this.

chrisaycock · 2016-12-14T16:52:09Z

Just a note for posterity: After a lot of trial and error, I was able to get .tostring() to return different values for the same input:

In [291]: np.array([0, 'QSDU'], dtype=object).tostring()
Out[291]: b'PH!\x00\x01\x00\x00\x00\x10qL\x0c\x01\x00\x00\x00'

In [292]: np.array([0, 'QSDU'], dtype=object).tostring()
Out[292]: b'PH!\x00\x01\x00\x00\x00\xe0%>\x0c\x01\x00\x00\x00'

In [293]: np.array([0, 'QSDU'], dtype=object).tostring()
Out[293]: b'PH!\x00\x01\x00\x00\x00\xf0!>\x0c\x01\x00\x00\x00'

In [294]: np.array([0, 'QSDU'], dtype=object).tostring()
Out[294]: b'PH!\x00\x01\x00\x00\x000Wq\x0c\x01\x00\x00\x00'

In [295]: np.array([0, 'QSDU'], dtype=object).tostring()
Out[295]: b'PH!\x00\x01\x00\x00\x00@Zq\x0c\x01\x00\x00\x00'

So the culprit is definitely this approach. I also tried the Categorical, which worked but was 20 times slower. Since multi-by is already 20 times slower than what I want, I abandoned this path.

I will continue digging.

chrisaycock · 2016-12-15T15:20:31Z

@jreback Using a tuple representation solved the problem. (TravisCI failed an unrelated test with "Network is unreachable".) Any further thoughts on this PR?

jreback · 2016-12-15T23:13:40Z

pandas/tools/merge.py


 _type_casters = {
    'int64_t': _ensure_int64,
    'double': _ensure_float64,
    'object': _ensure_object,
 }

+_cyton_types = {


_cython_types ?

jreback · 2016-12-15T23:16:26Z

pandas/tools/tests/test_merge_asof.py

@@ -221,6 +221,117 @@ def test_missing_right_by(self):
        expected.loc[expected.ticker == 'MSFT', ['bid', 'ask']] = np.nan
        assert_frame_equal(result, expected)

+    def test_multiby(self):


can you add a test for the raising on an invalid merge type (I think only float16?)

Good idea. I've added it to test_on_specialized_type() and test_on_specialized_type_by_int().

jreback · 2016-12-15T23:16:59Z

just a couple of minor changes. ping when green.

chrisaycock · 2016-12-16T21:54:48Z

@jreback It's all green!

jreback · 2016-12-16T23:10:34Z

thanks!

jreback · 2016-12-17T14:38:14Z

@chrisaycock very odd, some builds failed: https://travis-ci.org/pandas-dev/pandas/jobs/184684273

but these did clear up subsequently any ideas?

chrisaycock · 2016-12-17T16:53:13Z

@jreback That's weird. The type specialization was the easy part; the "multi-by" logic was what caused all the grief on this PR.

I can't reproduce the error from the tip of the master branch in either 2.7 or 3.5 on my Mac. I will dig into this and see what I can find.

jreback · 2016-12-17T17:12:59Z

i suppose it's a cython caching problem

chrisaycock · 2016-12-17T17:18:17Z

@jreback Hmm, ok. Is there a way to prevent that in future tests? That NoneType error means that _asof_by_function() and _asof_function() can't find the Cython specialization for the desired data type. I should probably put a better error statement in there instead of relying on the NoneType message anyway.

By the way, did test_on_specialized_type() change in between the time I submitted the PR and now?

I had used an else statement:

https://github.com/pandas-dev/pandas/pull/14783/files#diff-e00646757b932a2684fda4588f99009cR678

Now there's a continue statement:

https://github.com/chrisaycock/pandas/blob/master/pandas/tools/tests/test_merge_asof.py#L678

The effects are the same, so the nature of the test didn't change. I just think it's strange that I can't find anything in git blame.

jreback · 2016-12-17T17:30:25Z

i made a tiny change in he tests when committing

yeah i think this is a cache issue -

chrisaycock · 2016-12-17T17:31:43Z

@jreback Alright, thanks. I'll add a better error statement for missing type specialization in a future commit.

jreback · 2016-12-17T17:32:48Z

i don't think that's necessary

it's compiled code so can't be missing

… parameters (pandas-dev#13936) closes pandas-dev#13936 Author: Christopher C. Aycock <[email protected]> Closes pandas-dev#14783 from chrisaycock/GH13936 and squashes the following commits: ffcf0c2 [Christopher C. Aycock] Added test to reject float16; fixed typos 1f208a8 [Christopher C. Aycock] Use tuple representation instead of strings 77eb47b [Christopher C. Aycock] Merge master branch into GH13936 89256f0 [Christopher C. Aycock] Test 8-bit integers and raise error on 16-bit floats; add comments 0ad1687 [Christopher C. Aycock] Fixed whatsnew 2bce3cc [Christopher C. Aycock] Revert dict back to PyObjectHashTable in response to code review fafbb02 [Christopher C. Aycock] Updated benchmarks to reflect new ASV setup 5eeb7d9 [Christopher C. Aycock] Merge master into GH13936 c33c4cb [Christopher C. Aycock] Merge branch 'master' into GH13936 46cc309 [Christopher C. Aycock] Update documentation f01142c [Christopher C. Aycock] Merge master branch 75157fc [Christopher C. Aycock] merge_asof() has type specializations and can take multiple 'by' parameters (pandas-dev#13936)

… parameters (pandas-dev#13936) closes pandas-dev#13936 Author: Christopher C. Aycock <[email protected]> Closes pandas-dev#14783 from chrisaycock/GH13936 and squashes the following commits: ffcf0c2 [Christopher C. Aycock] Added test to reject float16; fixed typos 1f208a8 [Christopher C. Aycock] Use tuple representation instead of strings 77eb47b [Christopher C. Aycock] Merge master branch into GH13936 89256f0 [Christopher C. Aycock] Test 8-bit integers and raise error on 16-bit floats; add comments 0ad1687 [Christopher C. Aycock] Fixed whatsnew 2bce3cc [Christopher C. Aycock] Revert dict back to PyObjectHashTable in response to code review fafbb02 [Christopher C. Aycock] Updated benchmarks to reflect new ASV setup 5eeb7d9 [Christopher C. Aycock] Merge master into GH13936 c33c4cb [Christopher C. Aycock] Merge branch 'master' into GH13936 46cc309 [Christopher C. Aycock] Update documentation f01142c [Christopher C. Aycock] Merge master branch 75157fc [Christopher C. Aycock] merge_asof() has type specializations and can take multiple 'by' parameters (pandas-dev#13936) (cherry picked from commit e7df751)

Christopher C. Aycock added 3 commits December 1, 2016 13:52

merge_asof() has type specializations and can take multiple 'by' para…

75157fc

…meters (#13936)

Merge master branch

f01142c

Update documentation

46cc309

Merge branch 'master' into GH13936

c33c4cb

jreback added Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Dec 4, 2016

jreback reviewed Dec 4, 2016

View reviewed changes

Christopher C. Aycock added 2 commits December 12, 2016 11:19

Merge master into GH13936

5eeb7d9

Updated benchmarks to reflect new ASV setup

fafbb02

Christopher C. Aycock added 2 commits December 12, 2016 14:17

Revert dict back to PyObjectHashTable in response to code review

2bce3cc

Fixed whatsnew

0ad1687

jreback reviewed Dec 12, 2016

View reviewed changes

Test 8-bit integers and raise error on 16-bit floats; add comments

89256f0

Christopher C. Aycock added 2 commits December 14, 2016 13:53

Merge master branch into GH13936

77eb47b

Use tuple representation instead of strings

1f208a8

jreback reviewed Dec 15, 2016

View reviewed changes

jreback added this to the 0.19.2 milestone Dec 15, 2016

Added test to reject float16; fixed typos

ffcf0c2

jreback closed this in e7df751 Dec 16, 2016

chrisaycock deleted the GH13936 branch December 17, 2016 03:50

ENH: merge_asof() has type specializations and can take multiple 'by' parameters (#13936) #14783

ENH: merge_asof() has type specializations and can take multiple 'by' parameters (#13936) #14783

Conversation

chrisaycock commented Dec 1, 2016

chrisaycock commented Dec 1, 2016

gfyoung commented Dec 1, 2016

chrisaycock commented Dec 1, 2016

Choose a reason for hiding this comment

chrisaycock Dec 4, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 10, 2016

codecov-io commented Dec 12, 2016 • edited Loading

Current coverage is 85.31% (diff: 100%)

chrisaycock commented Dec 12, 2016

jreback commented Dec 12, 2016

chrisaycock commented Dec 12, 2016

jreback commented Dec 12, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrisaycock commented Dec 12, 2016

jreback commented Dec 12, 2016

chrisaycock commented Dec 13, 2016

jreback commented Dec 13, 2016

jreback commented Dec 13, 2016

chrisaycock commented Dec 13, 2016

chrisaycock commented Dec 14, 2016

chrisaycock commented Dec 15, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 15, 2016

chrisaycock commented Dec 16, 2016

jreback commented Dec 16, 2016

jreback commented Dec 17, 2016

chrisaycock commented Dec 17, 2016

jreback commented Dec 17, 2016

chrisaycock commented Dec 17, 2016

jreback commented Dec 17, 2016

chrisaycock commented Dec 17, 2016

jreback commented Dec 17, 2016

chrisaycock Dec 4, 2016 •

edited

Loading

codecov-io commented Dec 12, 2016 •

edited

Loading

chrisaycock commented Dec 15, 2016 •

edited

Loading