Skip to content

merge_asof fails when using left_index/right_index with a tolerance parameter #15135

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ajcr opened this issue Jan 15, 2017 · 5 comments
Closed
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@ajcr
Copy link
Contributor

ajcr commented Jan 15, 2017

Problem description

Using the DataFrames trades and quotes from the merge_asof documentation as an example, there's an issue when using merge_asof to merge on indexes using left_index/right_index (rather than column names) and additionally specifying a tolerance parameter.

Admittedly, the left_index/right_index parameters are not documented for this function (in the prototype description), but they are still valid parameters (and have an example in the docs). It could be seen as inconsistent that they sometimes work as expected and sometimes raise a cryptic error message.

For example, first set the 'time' column of each DataFrame as the index:

>>> quotes = quotes.set_index('time')
>>> trades = trades.set_index('time')

Then pd.merge_asof(trades, quotes, left_index=True, right_index=True, by='ticker') works exactly as pd.asof_merge(trades, quotes, on='time', by='ticker') (as expected).

However, pd.merge_asof(trades, quotes, left_index=True, right_index=True, by='ticker', tolerance=pd.Timedelta('2ms')) (specifying a tolerance) raises an error:

/.../anaconda3/lib/python3.4/site-packages/pandas/tools/merge.py in _get_merge_keys(self)
   1147 
   1148             else:
-> 1149                 raise MergeError("key must be integer or timestamp")
   1150 
   1151         # validate allow_exact_matches

MergeError: key must be integer or timestamp

I would have expected it work exactly as pd.merge_asof(trades, quotes, on='time', by='ticker', tolerance=pd.Timedelta('2ms')) works. Failing this, an error telling me that I cannot use a tolerance when merging on indexes would be welcome.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.4.4.final.0
python-bits: 64
OS: Linux
OS-release: 3.16.0-4-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.19.2
nose: 1.3.4
pip: 9.0.1
setuptools: 31.0.1.post20161215
Cython: 0.23.4
numpy: 1.11.3
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.7.2
IPython: 5.1.0
sphinx: 1.2.3
patsy: 0.4.1
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: 3.1.1
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: None
xlsxwriter: 0.6.7
lxml: 3.6.4
bs4: 4.3.2
html5lib: 0.99999
httplib2: None
apiclient: None
sqlalchemy: 0.9.9
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.36.0
pandas_datareader: None

@ajcr
Copy link
Contributor Author

ajcr commented Jan 15, 2017

If you patch pandas/tools/merge.py at around line 1131 with something like

if self.tolerance is not None:

    if self.left_index:
        lt = self.left.index
    else:
        lt = left_join_keys[-1]

Then the expected result is obtained (i.e. the merge is successful).

Happy to turn this into a PR if you agree it's the way to go.

@jreback
Copy link
Contributor

jreback commented Jan 15, 2017

hmm, that does look buggy.

cc @chrisaycock

@jreback jreback added Bug Difficulty Intermediate Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jan 15, 2017
@jreback jreback added this to the 0.20.0 milestone Jan 15, 2017
@jreback
Copy link
Contributor

jreback commented Jan 15, 2017

FYI these are documented in 0.19.2, when they were added (left_index, right_index) (you show'd an older link)

@chrisaycock
Copy link
Contributor

Hmm, left_join_keys is returned from _get_merge_keys() (starting at line 755). That function should have set this correctly.

_get_merge_keys() first checks if both left_on and right_on have been set, then goes through the cases of whether just one of them is set (and assumes that left_index/right_index would have been set because of prior error checking). That's where you see lines:

right_keys = [self.right.index.values]

and later:

left_keys = [self.left.index.values]

I don't see a terminating else that I would expect at line 857 that would handle when both left_index and right_index are set. Perhaps this was never needed for non-asof joins.

I can look into this later today to see if the fix is required there, though I would want to run a bunch of regression tests.

By the way, if I set the index in just one DataFrame, then it works, leading evidence to my above hypothesis:

In [16]: pd.merge_asof(trades, quotes, left_on='time', right_index=True, by='ticker', tolerance=pd.Timedelta('2ms'))
Out[16]: 
                     time ticker   price  quantity     bid     ask
0 2016-05-25 13:30:00.023   MSFT   51.95        75   51.95   51.96
1 2016-05-25 13:30:00.038   MSFT   51.95       155     NaN     NaN
2 2016-05-25 13:30:00.048   GOOG  720.77       100  720.50  720.93
3 2016-05-25 13:30:00.048   GOOG  720.92       100  720.50  720.93
4 2016-05-25 13:30:00.048   AAPL   98.00       100     NaN     NaN

and

In [22]: pd.merge_asof(trades, quotes, left_index=True, right_on='time', by='ticker', tolerance=pd.Timedelta('2ms'))
Out[22]: 
  ticker   price  quantity                    time     bid     ask
1   MSFT   51.95        75 2016-05-25 13:30:00.023   51.95   51.96
7   MSFT   51.95       155 2016-05-25 13:30:00.038     NaN     NaN
4   GOOG  720.77       100 2016-05-25 13:30:00.048  720.50  720.93
4   GOOG  720.92       100 2016-05-25 13:30:00.048  720.50  720.93
7   AAPL   98.00       100 2016-05-25 13:30:00.048     NaN     NaN

Good catch, @ajcr.

@chrisaycock
Copy link
Contributor

My proposal turned-out to be pretty difficult because of the integrated by logic. That's probably why it wasn't there in the first place. I just went with @ajcr's solution instead.

AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017
…of() (pandas-dev#15135)

closes pandas-dev#15135

Author: Christopher C. Aycock <[email protected]>

Closes pandas-dev#15139 from chrisaycock/GH15135 and squashes the following commits:

17c902a [Christopher C. Aycock] BUG: Allow tolerance when left_index/right_index enabled for merge_asof() (pandas-dev#15135)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants