Skip to content

ENH: merge_asof() has type specializations and can take multiple 'by' parameters (#13936) #14783

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from

Conversation

chrisaycock
Copy link
Contributor

@chrisaycock
Copy link
Contributor Author

Hmm, AppVeyor failed with

Build execution time has reached the maximum allowed time for your plan (60 minutes).

I notice the same thing happened on #14771 and #14757.

@gfyoung
Copy link
Member

gfyoung commented Dec 1, 2016

@chrisaycock : Rebase onto master to re-trigger Appveyor. Don't know exactly why Appveyor does this.

@chrisaycock
Copy link
Contributor Author

@gfyoung Thanks, re-running now.

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Dec 4, 2016
Other enhancements
^^^^^^^^^^^^^^^^^^

- ``pd.merge_asof()`` can take multiple columns in ``by`` parameter and has specialized types (:issue:`13936`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specialized dtypes, and elabortae on what this does (e.g. perf)

Copy link
Contributor Author

@chrisaycock chrisaycock Dec 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's to improve performance rather than cast all integer types to int64 (see my benchmarks pasted below). I can add a description to whatsnew line.

by_dtypes = [('PyObjectHashTable', 'object'), ('Int64HashTable', 'int64_t')]
# by_dtype, table_type, init_table, s1, s2, s3, g1, g2
by_dtypes = [('int64_t', 'Int64HashTable', 'Int64HashTable(right_size)',
'.set_item(', ', ', ')',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you changing this for object table? it makes the code much more complicated. We don't use python objects anywhere in cython (instead we use the PyObjectHashTable).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PyObjectHashTable turns out to be slower than dict; see the by_object test:

[  0.00%] · For pandas commit hash c33c4cbe:
[  0.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...................................
[  0.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 10.00%] ··· Running join_merge.merge_asof_by_int.time_merge_asof_by_int                                        23.11ms
[ 20.00%] ··· Running join_merge.merge_asof_by_object.time_merge_asof_by_object                                  25.29ms
[ 30.00%] ··· Running join_merge.merge_asof_int32_noby.time_merge_asof_int32_noby                                13.11ms
[ 40.00%] ··· Running join_merge.merge_asof_multiby.time_merge_asof_multiby                                     470.48ms
[ 50.00%] ··· Running join_merge.merge_asof_noby.time_merge_asof_noby                                             9.54ms
[ 50.00%] · For pandas commit hash 725453de:
[ 50.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..................................
[ 50.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 60.00%] ··· Running join_merge.merge_asof_by_int.time_merge_asof_by_int                                        23.18ms
[ 70.00%] ··· Running join_merge.merge_asof_by_object.time_merge_asof_by_object                                  39.56ms
[ 80.00%] ··· Running join_merge.merge_asof_int32_noby.time_merge_asof_int32_noby                                15.71ms
[ 90.00%] ··· Running join_merge.merge_asof_multiby.time_merge_asof_multiby                                       failed
[100.00%] ··· Running join_merge.merge_asof_noby.time_merge_asof_noby                                             9.38ms
    before     after       ratio
  [725453de] [c33c4cbe]
-   15.71ms    13.11ms      0.83  join_merge.merge_asof_int32_noby.time_merge_asof_int32_noby
-   39.56ms    25.29ms      0.64  join_merge.merge_asof_by_object.time_merge_asof_by_object

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

I stumbled on this when I tried the dict out of desperation since PyObjectHashTabe wasn't able to handle tuples the way I wanted.

I did end-up adding the stringify() function, which solved a lot of problems, though has its own significant overheard as evidenced by the spike in runtime for the multiby test. I could switch back to PyObjectHashTable since the stringify() now simplifies a lot of the code. I just wish there were a fast way to hash numpy arrays.

@jreback
Copy link
Contributor

jreback commented Dec 10, 2016

pls update as @jorisvandenbossche changed the asv's a bit

@codecov-io
Copy link

codecov-io commented Dec 12, 2016

Current coverage is 85.31% (diff: 100%)

Merging #14783 into master will increase coverage by <.01%

@@             master     #14783   diff @@
==========================================
  Files           144        144          
  Lines         51004      51023    +19   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43511      43532    +21   
+ Misses         7493       7491     -2   
  Partials          0          0          

Powered by Codecov. Last update a8cabb8...ffcf0c2

@chrisaycock
Copy link
Contributor Author

@jreback I've update the benchmarks:

· Running 10 total benchmarks (2 commits * 1 environments * 5 benchmarks)
[  0.00%] · For pandas commit hash fafbb022:
[  0.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...................................
[  0.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 10.00%] ··· Running join_merge.MergeAsof.time_by_int                                                           21.26ms
[ 20.00%] ··· Running join_merge.MergeAsof.time_by_object                                                        25.12ms
[ 30.00%] ··· Running join_merge.MergeAsof.time_multiby                                                         471.15ms
[ 40.00%] ··· Running join_merge.MergeAsof.time_noby                                                              9.62ms
[ 50.00%] ··· Running join_merge.MergeAsof.time_on_int32                                                         12.68ms
[ 50.00%] · For pandas commit hash 14e48153:
[ 50.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..................................
[ 50.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 60.00%] ··· Running join_merge.MergeAsof.time_by_int                                                           21.49ms
[ 70.00%] ··· Running join_merge.MergeAsof.time_by_object                                                        38.56ms
[ 80.00%] ··· Running join_merge.MergeAsof.time_multiby                                                           failed
[ 90.00%] ··· Running join_merge.MergeAsof.time_noby                                                              9.37ms
[100.00%] ··· Running join_merge.MergeAsof.time_on_int32                                                         15.42ms   
    before     after       ratio
  [14e48153] [fafbb022]
-   15.42ms    12.68ms      0.82  join_merge.MergeAsof.time_on_int32
-   38.56ms    25.12ms      0.65  join_merge.MergeAsof.time_by_object

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

Please advise on the dict vs PyObjectHashTable. Should I revert that change? The dict definitely makes the code more complicated, but the benchmark improves significantly.

@jreback
Copy link
Contributor

jreback commented Dec 12, 2016

@chrisaycock this add significant complexity to the code and not in favor of this approach. That said, you might be able to use / create somethign like #14859 which does something very similar. IOW, you make a new hashtable type that has specialized properties (IIRC you are using tuples here?)

@chrisaycock
Copy link
Contributor Author

@jreback I'll just remove that change for now. The point of this PR is to get the "multi-by", so I'll focus on that. Thanks!

@jreback
Copy link
Contributor

jreback commented Dec 12, 2016

@chrisaycock absolutley.

and FYI, you should generally just categorize your object dtypes anyhow. (we could even do that under the hood for merging purposes).

pls make a new issue w.r.t. these points if you would like

result = np.empty(n, dtype=np.object)

for i in range(n):
result[i] = xt[i].tostring()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need this again? (I see you are using it), but what is the input that you are giving it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a doc-string would help

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a couple comments to address this. When the by parameter has multiple entries, then we want to store the entire array in the hash table. Unfortunately, NumPy arrays aren't hashable. After lots of digging, the fastest thing to do is to convert to a string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -452,6 +563,69 @@ def test_on_float(self):

assert_frame_equal(result, expected)

def test_on_specialized_type(self):
# GH13936
for dtype in [np.uint16, np.uint32, np.uint64,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are int8/uint8 not allowed (ok by me), just test that you raise on them (or do they work some other way)? same about float16 (should just raise)

@chrisaycock
Copy link
Contributor Author

AppVeyor is producing a very odd error. It keeps failing for test_multiby_heterogeneous_types, but computes a different resulting value each time!

The first run computes:

[nan, 51.95, nan, nan, 720.5]

The second run computes:

[nan, nan, nan, nan, 720.5]

The third run computes:

[nan, 51.95, nan, nan, nan]

There are no code differences between the second and third run! (I changed a line in the whatsnew to force a re-run in AppVeyor).

I have been unable to reproduce the error on my machine in either Py2 or Py3.

Given that the values change each run, I wonder if there is some element of randomness. The test_multiby_heterogeneous_types will invoke stringify() on a two-dimensional NumPy array where each array element contains a number and a string. But since I can't reproduce the error on my system (the results are always the same), I'm kind of at a loss for how to track this down.

@jreback
Copy link
Contributor

jreback commented Dec 12, 2016

@chrisaycock I can run your PR on windows tomorrow. but in any event, you should simply use the hash_array instead of stringifying things (IIUC what you are doing)

@chrisaycock
Copy link
Contributor Author

@jreback I'm trying to use the array as a key, like

In [19]: xs = np.arange(3)

In [20]: h = PyObjectHashTable()

In [21]: h.set_item(xs, 1)
...
TypeError: unhashable type: 'numpy.ndarray'

The NumPy array is not hashable, so both PyObjectHashTable and dict can't take it.

As for why I'm doing it this way, my asof_join logic stores the last-seen index for the by parameter as the function linearly scans the array of timestamps. That's why the algo works in a single pass, which makes it extremely fast.

To handle an array in the by parameter, I need to hash multiple values at once. Since a NumPy array isn't hashable, I went with turning it into a string. (.tostring() returns the raw bytes, which is sufficient for what I need.)

This has worked reasonably well in testing, at least on my Mac.

@jreback
Copy link
Contributor

jreback commented Dec 13, 2016

In [9]: from pandas.tools.hashing import hash_array

In [10]: arr = np.arange(3)

In [11]: arr
Out[11]: array([0, 1, 2])

In [13]: np.bitwise_xor.reduce(hash_array(arr))
Out[13]: 10178186328804430191

I had a pr to make this a convenience function ,but its pretty trivial

@jreback
Copy link
Contributor

jreback commented Dec 13, 2016

@chrisaycock so this passed for me on windows (just noticed it failed on 3.4 on travis though).

so I suspect something is not stable in the multi-grouping......

@chrisaycock
Copy link
Contributor Author

@jreback Thanks for checking. I'll dig some more into this.

@chrisaycock
Copy link
Contributor Author

Just a note for posterity: After a lot of trial and error, I was able to get .tostring() to return different values for the same input:

In [291]: np.array([0, 'QSDU'], dtype=object).tostring()
Out[291]: b'PH!\x00\x01\x00\x00\x00\x10qL\x0c\x01\x00\x00\x00'

In [292]: np.array([0, 'QSDU'], dtype=object).tostring()
Out[292]: b'PH!\x00\x01\x00\x00\x00\xe0%>\x0c\x01\x00\x00\x00'

In [293]: np.array([0, 'QSDU'], dtype=object).tostring()
Out[293]: b'PH!\x00\x01\x00\x00\x00\xf0!>\x0c\x01\x00\x00\x00'

In [294]: np.array([0, 'QSDU'], dtype=object).tostring()
Out[294]: b'PH!\x00\x01\x00\x00\x000Wq\x0c\x01\x00\x00\x00'

In [295]: np.array([0, 'QSDU'], dtype=object).tostring()
Out[295]: b'PH!\x00\x01\x00\x00\x00@Zq\x0c\x01\x00\x00\x00'

So the culprit is definitely this approach. I also tried the Categorical, which worked but was 20 times slower. Since multi-by is already 20 times slower than what I want, I abandoned this path.

I will continue digging.

@chrisaycock
Copy link
Contributor Author

chrisaycock commented Dec 15, 2016

@jreback Using a tuple representation solved the problem. (TravisCI failed an unrelated test with "Network is unreachable".) Any further thoughts on this PR?


_type_casters = {
'int64_t': _ensure_int64,
'double': _ensure_float64,
'object': _ensure_object,
}

_cyton_types = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_cython_types ?

@@ -221,6 +221,117 @@ def test_missing_right_by(self):
expected.loc[expected.ticker == 'MSFT', ['bid', 'ask']] = np.nan
assert_frame_equal(result, expected)

def test_multiby(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test for the raising on an invalid merge type (I think only float16?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I've added it to test_on_specialized_type() and test_on_specialized_type_by_int().

@jreback jreback added this to the 0.19.2 milestone Dec 15, 2016
@jreback
Copy link
Contributor

jreback commented Dec 15, 2016

just a couple of minor changes. ping when green.

@chrisaycock
Copy link
Contributor Author

@jreback It's all green!

@jreback
Copy link
Contributor

jreback commented Dec 16, 2016

thanks!

@jreback jreback closed this in e7df751 Dec 16, 2016
@chrisaycock chrisaycock deleted the GH13936 branch December 17, 2016 03:50
@jreback
Copy link
Contributor

jreback commented Dec 17, 2016

@chrisaycock very odd, some builds failed: https://travis-ci.org/pandas-dev/pandas/jobs/184684273

but these did clear up subsequently any ideas?

@chrisaycock
Copy link
Contributor Author

@jreback That's weird. The type specialization was the easy part; the "multi-by" logic was what caused all the grief on this PR.

I can't reproduce the error from the tip of the master branch in either 2.7 or 3.5 on my Mac. I will dig into this and see what I can find.

@jreback
Copy link
Contributor

jreback commented Dec 17, 2016

i suppose it's a cython caching problem

@chrisaycock
Copy link
Contributor Author

@jreback Hmm, ok. Is there a way to prevent that in future tests? That NoneType error means that _asof_by_function() and _asof_function() can't find the Cython specialization for the desired data type. I should probably put a better error statement in there instead of relying on the NoneType message anyway.

By the way, did test_on_specialized_type() change in between the time I submitted the PR and now?

I had used an else statement:

https://github.com/pandas-dev/pandas/pull/14783/files#diff-e00646757b932a2684fda4588f99009cR678

Now there's a continue statement:

https://github.com/chrisaycock/pandas/blob/master/pandas/tools/tests/test_merge_asof.py#L678

The effects are the same, so the nature of the test didn't change. I just think it's strange that I can't find anything in git blame.

@jreback
Copy link
Contributor

jreback commented Dec 17, 2016

i made a tiny change in he tests when committing

yeah i think this is a cache issue -

@chrisaycock
Copy link
Contributor Author

@jreback Alright, thanks. I'll add a better error statement for missing type specialization in a future commit.

@jreback
Copy link
Contributor

jreback commented Dec 17, 2016

i don't think that's necessary

it's compiled code so can't be missing

ischurov pushed a commit to ischurov/pandas that referenced this pull request Dec 19, 2016
… parameters (pandas-dev#13936)

closes pandas-dev#13936

Author: Christopher C. Aycock <[email protected]>

Closes pandas-dev#14783 from chrisaycock/GH13936 and squashes the following commits:

ffcf0c2 [Christopher C. Aycock] Added test to reject float16; fixed typos
1f208a8 [Christopher C. Aycock] Use tuple representation instead of strings
77eb47b [Christopher C. Aycock] Merge master branch into GH13936
89256f0 [Christopher C. Aycock] Test 8-bit integers and raise error on 16-bit floats; add comments
0ad1687 [Christopher C. Aycock] Fixed whatsnew
2bce3cc [Christopher C. Aycock] Revert dict back to PyObjectHashTable in response to code review
fafbb02 [Christopher C. Aycock] Updated benchmarks to reflect new ASV setup
5eeb7d9 [Christopher C. Aycock] Merge master into GH13936
c33c4cb [Christopher C. Aycock] Merge branch 'master' into GH13936
46cc309 [Christopher C. Aycock] Update documentation
f01142c [Christopher C. Aycock] Merge master branch
75157fc [Christopher C. Aycock] merge_asof() has type specializations and can take multiple 'by' parameters (pandas-dev#13936)
jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this pull request Dec 24, 2016
… parameters (pandas-dev#13936)

closes pandas-dev#13936

Author: Christopher C. Aycock <[email protected]>

Closes pandas-dev#14783 from chrisaycock/GH13936 and squashes the following commits:

ffcf0c2 [Christopher C. Aycock] Added test to reject float16; fixed typos
1f208a8 [Christopher C. Aycock] Use tuple representation instead of strings
77eb47b [Christopher C. Aycock] Merge master branch into GH13936
89256f0 [Christopher C. Aycock] Test 8-bit integers and raise error on 16-bit floats; add comments
0ad1687 [Christopher C. Aycock] Fixed whatsnew
2bce3cc [Christopher C. Aycock] Revert dict back to PyObjectHashTable in response to code review
fafbb02 [Christopher C. Aycock] Updated benchmarks to reflect new ASV setup
5eeb7d9 [Christopher C. Aycock] Merge master into GH13936
c33c4cb [Christopher C. Aycock] Merge branch 'master' into GH13936
46cc309 [Christopher C. Aycock] Update documentation
f01142c [Christopher C. Aycock] Merge master branch
75157fc [Christopher C. Aycock] merge_asof() has type specializations and can take multiple 'by' parameters (pandas-dev#13936)

(cherry picked from commit e7df751)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expand type specializations and multiple "by" parameters in merge_asof()
4 participants