-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: merge_asof() has type specializations and can take multiple 'by' parameters (#13936) #14783
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@chrisaycock : Rebase onto |
@gfyoung Thanks, re-running now. |
Other enhancements | ||
^^^^^^^^^^^^^^^^^^ | ||
|
||
- ``pd.merge_asof()`` can take multiple columns in ``by`` parameter and has specialized types (:issue:`13936`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
specialized dtypes, and elabortae on what this does (e.g. perf)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's to improve performance rather than cast all integer types to int64 (see my benchmarks pasted below). I can add a description to whatsnew line.
by_dtypes = [('PyObjectHashTable', 'object'), ('Int64HashTable', 'int64_t')] | ||
# by_dtype, table_type, init_table, s1, s2, s3, g1, g2 | ||
by_dtypes = [('int64_t', 'Int64HashTable', 'Int64HashTable(right_size)', | ||
'.set_item(', ', ', ')', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you changing this for object table? it makes the code much more complicated. We don't use python objects anywhere in cython (instead we use the PyObjectHashTable).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PyObjectHashTable
turns out to be slower than dict
; see the by_object test:
[ 0.00%] · For pandas commit hash c33c4cbe:
[ 0.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...................................
[ 0.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 10.00%] ··· Running join_merge.merge_asof_by_int.time_merge_asof_by_int 23.11ms
[ 20.00%] ··· Running join_merge.merge_asof_by_object.time_merge_asof_by_object 25.29ms
[ 30.00%] ··· Running join_merge.merge_asof_int32_noby.time_merge_asof_int32_noby 13.11ms
[ 40.00%] ··· Running join_merge.merge_asof_multiby.time_merge_asof_multiby 470.48ms
[ 50.00%] ··· Running join_merge.merge_asof_noby.time_merge_asof_noby 9.54ms
[ 50.00%] · For pandas commit hash 725453de:
[ 50.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..................................
[ 50.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 60.00%] ··· Running join_merge.merge_asof_by_int.time_merge_asof_by_int 23.18ms
[ 70.00%] ··· Running join_merge.merge_asof_by_object.time_merge_asof_by_object 39.56ms
[ 80.00%] ··· Running join_merge.merge_asof_int32_noby.time_merge_asof_int32_noby 15.71ms
[ 90.00%] ··· Running join_merge.merge_asof_multiby.time_merge_asof_multiby failed
[100.00%] ··· Running join_merge.merge_asof_noby.time_merge_asof_noby 9.38ms
before after ratio
[725453de] [c33c4cbe]
- 15.71ms 13.11ms 0.83 join_merge.merge_asof_int32_noby.time_merge_asof_int32_noby
- 39.56ms 25.29ms 0.64 join_merge.merge_asof_by_object.time_merge_asof_by_object
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
I stumbled on this when I tried the dict
out of desperation since PyObjectHashTabe
wasn't able to handle tuples the way I wanted.
I did end-up adding the stringify()
function, which solved a lot of problems, though has its own significant overheard as evidenced by the spike in runtime for the multiby test. I could switch back to PyObjectHashTable
since the stringify()
now simplifies a lot of the code. I just wish there were a fast way to hash numpy arrays.
pls update as @jorisvandenbossche changed the asv's a bit |
Current coverage is 85.31% (diff: 100%)@@ master #14783 diff @@
==========================================
Files 144 144
Lines 51004 51023 +19
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 43511 43532 +21
+ Misses 7493 7491 -2
Partials 0 0
|
@jreback I've update the benchmarks:
Please advise on the |
@chrisaycock this add significant complexity to the code and not in favor of this approach. That said, you might be able to use / create somethign like #14859 which does something very similar. IOW, you make a new hashtable type that has specialized properties (IIRC you are using tuples here?) |
@jreback I'll just remove that change for now. The point of this PR is to get the "multi-by", so I'll focus on that. Thanks! |
@chrisaycock absolutley. and FYI, you should generally just categorize your object dtypes anyhow. (we could even do that under the hood for merging purposes). pls make a new issue w.r.t. these points if you would like |
result = np.empty(n, dtype=np.object) | ||
|
||
for i in range(n): | ||
result[i] = xt[i].tostring() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need this again? (I see you are using it), but what is the input that you are giving it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a doc-string would help
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a couple comments to address this. When the by
parameter has multiple entries, then we want to store the entire array in the hash table. Unfortunately, NumPy arrays aren't hashable. After lots of digging, the fastest thing to do is to convert to a string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh if all u need is hashing we just added this:
https://github.com/pandas-dev/pandas/blob/master/pandas/tools/hashing.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh if all u need is hashing we just added this:
https://github.com/pandas-dev/pandas/blob/master/pandas/tools/hashing.py
@@ -452,6 +563,69 @@ def test_on_float(self): | |||
|
|||
assert_frame_equal(result, expected) | |||
|
|||
def test_on_specialized_type(self): | |||
# GH13936 | |||
for dtype in [np.uint16, np.uint32, np.uint64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are int8/uint8 not allowed (ok by me), just test that you raise on them (or do they work some other way)? same about float16 (should just raise)
AppVeyor is producing a very odd error. It keeps failing for The first run computes:
The second run computes:
The third run computes:
There are no code differences between the second and third run! (I changed a line in the whatsnew to force a re-run in AppVeyor). I have been unable to reproduce the error on my machine in either Py2 or Py3. Given that the values change each run, I wonder if there is some element of randomness. The |
@chrisaycock I can run your PR on windows tomorrow. but in any event, you should simply use the |
@jreback I'm trying to use the array as a key, like
The NumPy array is not hashable, so both As for why I'm doing it this way, my asof_join logic stores the last-seen index for the To handle an array in the This has worked reasonably well in testing, at least on my Mac. |
I had a pr to make this a convenience function ,but its pretty trivial |
@chrisaycock so this passed for me on windows (just noticed it failed on 3.4 on travis though). so I suspect something is not stable in the multi-grouping...... |
@jreback Thanks for checking. I'll dig some more into this. |
Just a note for posterity: After a lot of trial and error, I was able to get In [291]: np.array([0, 'QSDU'], dtype=object).tostring()
Out[291]: b'PH!\x00\x01\x00\x00\x00\x10qL\x0c\x01\x00\x00\x00'
In [292]: np.array([0, 'QSDU'], dtype=object).tostring()
Out[292]: b'PH!\x00\x01\x00\x00\x00\xe0%>\x0c\x01\x00\x00\x00'
In [293]: np.array([0, 'QSDU'], dtype=object).tostring()
Out[293]: b'PH!\x00\x01\x00\x00\x00\xf0!>\x0c\x01\x00\x00\x00'
In [294]: np.array([0, 'QSDU'], dtype=object).tostring()
Out[294]: b'PH!\x00\x01\x00\x00\x000Wq\x0c\x01\x00\x00\x00'
In [295]: np.array([0, 'QSDU'], dtype=object).tostring()
Out[295]: b'PH!\x00\x01\x00\x00\x00@Zq\x0c\x01\x00\x00\x00' So the culprit is definitely this approach. I also tried the I will continue digging. |
@jreback Using a tuple representation solved the problem. (TravisCI failed an unrelated test with "Network is unreachable".) Any further thoughts on this PR? |
|
||
_type_casters = { | ||
'int64_t': _ensure_int64, | ||
'double': _ensure_float64, | ||
'object': _ensure_object, | ||
} | ||
|
||
_cyton_types = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_cython_types ?
@@ -221,6 +221,117 @@ def test_missing_right_by(self): | |||
expected.loc[expected.ticker == 'MSFT', ['bid', 'ask']] = np.nan | |||
assert_frame_equal(result, expected) | |||
|
|||
def test_multiby(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a test for the raising on an invalid merge type (I think only float16?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. I've added it to test_on_specialized_type()
and test_on_specialized_type_by_int()
.
just a couple of minor changes. ping when green. |
@jreback It's all green! |
thanks! |
@chrisaycock very odd, some builds failed: https://travis-ci.org/pandas-dev/pandas/jobs/184684273 but these did clear up subsequently any ideas? |
@jreback That's weird. The type specialization was the easy part; the "multi-by" logic was what caused all the grief on this PR. I can't reproduce the error from the tip of the master branch in either 2.7 or 3.5 on my Mac. I will dig into this and see what I can find. |
i suppose it's a cython caching problem |
@jreback Hmm, ok. Is there a way to prevent that in future tests? That By the way, did I had used an Now there's a The effects are the same, so the nature of the test didn't change. I just think it's strange that I can't find anything in |
i made a tiny change in he tests when committing yeah i think this is a cache issue - |
@jreback Alright, thanks. I'll add a better error statement for missing type specialization in a future commit. |
i don't think that's necessary it's compiled code so can't be missing |
… parameters (pandas-dev#13936) closes pandas-dev#13936 Author: Christopher C. Aycock <[email protected]> Closes pandas-dev#14783 from chrisaycock/GH13936 and squashes the following commits: ffcf0c2 [Christopher C. Aycock] Added test to reject float16; fixed typos 1f208a8 [Christopher C. Aycock] Use tuple representation instead of strings 77eb47b [Christopher C. Aycock] Merge master branch into GH13936 89256f0 [Christopher C. Aycock] Test 8-bit integers and raise error on 16-bit floats; add comments 0ad1687 [Christopher C. Aycock] Fixed whatsnew 2bce3cc [Christopher C. Aycock] Revert dict back to PyObjectHashTable in response to code review fafbb02 [Christopher C. Aycock] Updated benchmarks to reflect new ASV setup 5eeb7d9 [Christopher C. Aycock] Merge master into GH13936 c33c4cb [Christopher C. Aycock] Merge branch 'master' into GH13936 46cc309 [Christopher C. Aycock] Update documentation f01142c [Christopher C. Aycock] Merge master branch 75157fc [Christopher C. Aycock] merge_asof() has type specializations and can take multiple 'by' parameters (pandas-dev#13936)
… parameters (pandas-dev#13936) closes pandas-dev#13936 Author: Christopher C. Aycock <[email protected]> Closes pandas-dev#14783 from chrisaycock/GH13936 and squashes the following commits: ffcf0c2 [Christopher C. Aycock] Added test to reject float16; fixed typos 1f208a8 [Christopher C. Aycock] Use tuple representation instead of strings 77eb47b [Christopher C. Aycock] Merge master branch into GH13936 89256f0 [Christopher C. Aycock] Test 8-bit integers and raise error on 16-bit floats; add comments 0ad1687 [Christopher C. Aycock] Fixed whatsnew 2bce3cc [Christopher C. Aycock] Revert dict back to PyObjectHashTable in response to code review fafbb02 [Christopher C. Aycock] Updated benchmarks to reflect new ASV setup 5eeb7d9 [Christopher C. Aycock] Merge master into GH13936 c33c4cb [Christopher C. Aycock] Merge branch 'master' into GH13936 46cc309 [Christopher C. Aycock] Update documentation f01142c [Christopher C. Aycock] Merge master branch 75157fc [Christopher C. Aycock] merge_asof() has type specializations and can take multiple 'by' parameters (pandas-dev#13936) (cherry picked from commit e7df751)
git diff upstream/master | flake8 --diff