Skip to content

Commit 6d8c04c

Browse files
committed
ENH: add pd.asof_merge
closes #1870 xref #2941 http://nbviewer.jupyter.org/gist/jreback/5f089d308750c89b2a7d7446b790c056 is a notebook of example usage and timings Author: Jeff Reback <[email protected]> Closes #13358 from jreback/asof and squashes the following commits: 4592fa2 [Jeff Reback] TST: reorg tests/series/test_timeseries -> test_asof
1 parent fca35fb commit 6d8c04c

30 files changed

+1975
-278
lines changed

doc/source/api.rst

+3
Original file line numberDiff line numberDiff line change
@@ -151,6 +151,8 @@ Data manipulations
151151
cut
152152
qcut
153153
merge
154+
merge_ordered
155+
merge_asof
154156
concat
155157
get_dummies
156158
factorize
@@ -943,6 +945,7 @@ Time series-related
943945
:toctree: generated/
944946

945947
DataFrame.asfreq
948+
DataFrame.asof
946949
DataFrame.shift
947950
DataFrame.first_valid_index
948951
DataFrame.last_valid_index

doc/source/merging.rst

+129-37
Original file line numberDiff line numberDiff line change
@@ -104,7 +104,7 @@ some configurable handling of "what to do with the other axes":
104104
- ``ignore_index`` : boolean, default False. If True, do not use the index
105105
values on the concatenation axis. The resulting axis will be labeled 0, ...,
106106
n - 1. This is useful if you are concatenating objects where the
107-
concatenation axis does not have meaningful indexing information. Note
107+
concatenation axis does not have meaningful indexing information. Note
108108
the index values on the other axes are still respected in the join.
109109
- ``copy`` : boolean, default True. If False, do not copy data unnecessarily.
110110

@@ -544,12 +544,12 @@ Here's a description of what each argument is for:
544544
can be avoided are somewhat pathological but this option is provided
545545
nonetheless.
546546
- ``indicator``: Add a column to the output DataFrame called ``_merge``
547-
with information on the source of each row. ``_merge`` is Categorical-type
548-
and takes on a value of ``left_only`` for observations whose merge key
549-
only appears in ``'left'`` DataFrame, ``right_only`` for observations whose
550-
merge key only appears in ``'right'`` DataFrame, and ``both`` if the
551-
observation's merge key is found in both.
552-
547+
with information on the source of each row. ``_merge`` is Categorical-type
548+
and takes on a value of ``left_only`` for observations whose merge key
549+
only appears in ``'left'`` DataFrame, ``right_only`` for observations whose
550+
merge key only appears in ``'right'`` DataFrame, and ``both`` if the
551+
observation's merge key is found in both.
552+
553553
.. versionadded:: 0.17.0
554554

555555

@@ -718,7 +718,7 @@ The merge indicator
718718
df2 = DataFrame({'col1':[1,2,2],'col_right':[2,2,2]})
719719
merge(df1, df2, on='col1', how='outer', indicator=True)
720720
721-
The ``indicator`` argument will also accept string arguments, in which case the indicator function will use the value of the passed string as the name for the indicator column.
721+
The ``indicator`` argument will also accept string arguments, in which case the indicator function will use the value of the passed string as the name for the indicator column.
722722

723723
.. ipython:: python
724724
@@ -1055,34 +1055,6 @@ them together on their indexes. The same is true for ``Panel.join``.
10551055
labels=['left', 'right', 'right2'], vertical=False);
10561056
plt.close('all');
10571057
1058-
.. _merging.ordered_merge:
1059-
1060-
Merging Ordered Data
1061-
~~~~~~~~~~~~~~~~~~~~
1062-
1063-
New in v0.8.0 is the ordered_merge function for combining time series and other
1064-
ordered data. In particular it has an optional ``fill_method`` keyword to
1065-
fill/interpolate missing data:
1066-
1067-
.. ipython:: python
1068-
1069-
left = DataFrame({'k': ['K0', 'K1', 'K1', 'K2'],
1070-
'lv': [1, 2, 3, 4],
1071-
's': ['a', 'b', 'c', 'd']})
1072-
1073-
right = DataFrame({'k': ['K1', 'K2', 'K4'],
1074-
'rv': [1, 2, 3]})
1075-
1076-
result = ordered_merge(left, right, fill_method='ffill', left_by='s')
1077-
1078-
.. ipython:: python
1079-
:suppress:
1080-
1081-
@savefig merging_ordered_merge.png
1082-
p.plot([left, right], result,
1083-
labels=['left', 'right'], vertical=True);
1084-
plt.close('all');
1085-
10861058
.. _merging.combine_first.update:
10871059

10881060
Merging together values within Series or DataFrame columns
@@ -1132,4 +1104,124 @@ values inplace:
11321104
@savefig merging_update.png
11331105
p.plot([df1_copy, df2], df1,
11341106
labels=['df1', 'df2'], vertical=False);
1135-
plt.close('all');
1107+
plt.close('all');
1108+
1109+
.. _merging.time_series:
1110+
1111+
Timeseries friendly merging
1112+
---------------------------
1113+
1114+
.. _merging.merge_ordered:
1115+
1116+
Merging Ordered Data
1117+
~~~~~~~~~~~~~~~~~~~~
1118+
1119+
The ``pd.merge_ordered()`` function allows combining time series and other
1120+
ordered data. In particular it has an optional ``fill_method`` keyword to
1121+
fill/interpolate missing data:
1122+
1123+
.. ipython:: python
1124+
1125+
left = DataFrame({'k': ['K0', 'K1', 'K1', 'K2'],
1126+
'lv': [1, 2, 3, 4],
1127+
's': ['a', 'b', 'c', 'd']})
1128+
1129+
right = DataFrame({'k': ['K1', 'K2', 'K4'],
1130+
'rv': [1, 2, 3]})
1131+
1132+
result = pd.merge_ordered(left, right, fill_method='ffill', left_by='s')
1133+
1134+
.. ipython:: python
1135+
:suppress:
1136+
1137+
@savefig merging_ordered_merge.png
1138+
p.plot([left, right], result,
1139+
labels=['left', 'right'], vertical=True);
1140+
plt.close('all');
1141+
1142+
.. _merging.merge_asof:
1143+
1144+
Merging AsOf
1145+
~~~~~~~~~~~~
1146+
1147+
.. versionadded:: 0.18.2
1148+
1149+
An ``pd.merge_asof()`` this is similar to an ordered left-join except that we
1150+
match on nearest key rather than equal keys.
1151+
1152+
For each row in the ``left`` DataFrame, we select the last row in the ``right``
1153+
DataFrame whose ``on`` key is less than the left's key. Both DataFrames must
1154+
be sorted by the key.
1155+
1156+
Optionally an asof merge can perform a group-wise merge. This matches the ``by`` key equally,
1157+
in addition to the nearest match on the ``on`` key.
1158+
1159+
For example; we might have ``trades`` and ``quotes`` and we want to ``asof`` merge them.
1160+
1161+
.. ipython:: python
1162+
1163+
trades = pd.DataFrame({
1164+
'time': pd.to_datetime(['20160525 13:30:00.023',
1165+
'20160525 13:30:00.038',
1166+
'20160525 13:30:00.048',
1167+
'20160525 13:30:00.048',
1168+
'20160525 13:30:00.048']),
1169+
'ticker': ['MSFT', 'MSFT',
1170+
'GOOG', 'GOOG', 'AAPL'],
1171+
'price': [51.95, 51.95,
1172+
720.77, 720.92, 98.00],
1173+
'quantity': [75, 155,
1174+
100, 100, 100]},
1175+
columns=['time', 'ticker', 'price', 'quantity'])
1176+
1177+
quotes = pd.DataFrame({
1178+
'time': pd.to_datetime(['20160525 13:30:00.023',
1179+
'20160525 13:30:00.023',
1180+
'20160525 13:30:00.030',
1181+
'20160525 13:30:00.041',
1182+
'20160525 13:30:00.048',
1183+
'20160525 13:30:00.049',
1184+
'20160525 13:30:00.072',
1185+
'20160525 13:30:00.075']),
1186+
'ticker': ['GOOG', 'MSFT', 'MSFT',
1187+
'MSFT', 'GOOG', 'AAPL', 'GOOG',
1188+
'MSFT'],
1189+
'bid': [720.50, 51.95, 51.97, 51.99,
1190+
720.50, 97.99, 720.50, 52.01],
1191+
'ask': [720.93, 51.96, 51.98, 52.00,
1192+
720.93, 98.01, 720.88, 52.03]},
1193+
columns=['time', 'ticker', 'bid', 'ask'])
1194+
1195+
.. ipython:: python
1196+
1197+
trades
1198+
quotes
1199+
1200+
By default we are taking the asof of the quotes.
1201+
1202+
.. ipython:: python
1203+
1204+
pd.merge_asof(trades, quotes,
1205+
on='time',
1206+
by='ticker')
1207+
1208+
We only asof within ``2ms`` betwen the quote time and the trade time.
1209+
1210+
.. ipython:: python
1211+
1212+
pd.merge_asof(trades, quotes,
1213+
on='time',
1214+
by='ticker',
1215+
tolerance=pd.Timedelta('2ms'))
1216+
1217+
We only asof within ``10ms`` betwen the quote time and the trade time and we exclude exact matches on time.
1218+
Note that though we exclude the exact matches (of the quotes), prior quotes DO propogate to that point
1219+
in time.
1220+
1221+
.. ipython:: python
1222+
1223+
pd.merge_asof(trades, quotes,
1224+
on='time',
1225+
by='ticker',
1226+
tolerance=pd.Timedelta('10ms'),
1227+
allow_exact_matches=False)

doc/source/whatsnew/v0.18.2.txt

+93-1
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,97 @@ Highlights include:
1919
New features
2020
~~~~~~~~~~~~
2121

22+
.. _whatsnew_0182.enhancements.asof_merge:
23+
24+
``pd.merge_asof()`` for asof-style time-series joining
25+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
26+
27+
A long-time requested feature has been added through the :func:`merge_asof` function, to
28+
support asof style joining of time-series. (:issue:`1870`). Full documentation is
29+
:ref:`here <merging.merge_asof>`
30+
31+
The :func:`merge_asof`` performs an asof merge, which is similar to a left-join
32+
except that we match on nearest key rather than equal keys.
33+
34+
.. ipython:: python
35+
36+
left = pd.DataFrame({'a': [1, 5, 10],
37+
'left_val': ['a', 'b', 'c']})
38+
right = pd.DataFrame({'a': [1, 2, 3, 6, 7],
39+
'right_val': [1, 2, 3, 6, 7]})
40+
41+
left
42+
right
43+
44+
We typically want to match exactly when possible, and use the most
45+
recent value otherwise.
46+
47+
.. ipython:: python
48+
49+
pd.merge_asof(left, right, on='a')
50+
51+
We can also match rows ONLY with prior data, and not an exact match.
52+
53+
.. ipython:: python
54+
55+
pd.merge_asof(left, right, on='a', allow_exact_matches=False)
56+
57+
58+
In a typical time-series example, we have ``trades`` and ``quotes`` and we want to ``asof-join`` them.
59+
This also illustrates using the ``by`` parameter to group data before merging.
60+
61+
.. ipython:: python
62+
63+
trades = pd.DataFrame({
64+
'time': pd.to_datetime(['20160525 13:30:00.023',
65+
'20160525 13:30:00.038',
66+
'20160525 13:30:00.048',
67+
'20160525 13:30:00.048',
68+
'20160525 13:30:00.048']),
69+
'ticker': ['MSFT', 'MSFT',
70+
'GOOG', 'GOOG', 'AAPL'],
71+
'price': [51.95, 51.95,
72+
720.77, 720.92, 98.00],
73+
'quantity': [75, 155,
74+
100, 100, 100]},
75+
columns=['time', 'ticker', 'price', 'quantity'])
76+
77+
quotes = pd.DataFrame({
78+
'time': pd.to_datetime(['20160525 13:30:00.023',
79+
'20160525 13:30:00.023',
80+
'20160525 13:30:00.030',
81+
'20160525 13:30:00.041',
82+
'20160525 13:30:00.048',
83+
'20160525 13:30:00.049',
84+
'20160525 13:30:00.072',
85+
'20160525 13:30:00.075']),
86+
'ticker': ['GOOG', 'MSFT', 'MSFT',
87+
'MSFT', 'GOOG', 'AAPL', 'GOOG',
88+
'MSFT'],
89+
'bid': [720.50, 51.95, 51.97, 51.99,
90+
720.50, 97.99, 720.50, 52.01],
91+
'ask': [720.93, 51.96, 51.98, 52.00,
92+
720.93, 98.01, 720.88, 52.03]},
93+
columns=['time', 'ticker', 'bid', 'ask'])
94+
95+
.. ipython:: python
96+
97+
trades
98+
quotes
99+
100+
An asof merge joins on the ``on``, typically a datetimelike field, which is ordered, and
101+
in this case we are using a grouper in the ``by`` field. This is like a left-outer join, except
102+
that forward filling happens automatically taking the most recent non-NaN value.
103+
104+
.. ipython:: python
105+
106+
pd.merge_asof(trades, quotes,
107+
on='time',
108+
by='ticker')
109+
110+
This returns a merged DataFrame with the entries in the same order as the original left
111+
passed DataFrame (``trades`` in this case). With the fields of the ``quotes`` merged.
112+
22113
.. _whatsnew_0182.enhancements.read_csv_dupe_col_names_support:
23114

24115
``pd.read_csv`` has improved support for duplicate column names
@@ -124,8 +215,8 @@ Other enhancements
124215
idx.where([True, False, True])
125216

126217
- ``Categorical.astype()`` now accepts an optional boolean argument ``copy``, effective when dtype is categorical (:issue:`13209`)
218+
- ``DataFrame`` has gained the ``.asof()`` method to return the last non-NaN values according to the selected subset (:issue:`13358`)
127219
- Consistent with the Python API, ``pd.read_csv()`` will now interpret ``+inf`` as positive infinity (:issue:`13274`)
128-
129220
- The ``DataFrame`` constructor will now respect key ordering if a list of ``OrderedDict`` objects are passed in (:issue:`13304`)
130221
- ``pd.read_html()`` has gained support for the ``decimal`` option (:issue:`12907`)
131222
- A ``union_categorical`` function has been added for combining categoricals, see :ref:`Unioning Categoricals<categorical.union>` (:issue:`13361`)
@@ -335,6 +426,7 @@ Deprecations
335426
- ``compact_ints`` and ``use_unsigned`` have been deprecated in ``pd.read_csv()`` and will be removed in a future version (:issue:`13320`)
336427
- ``buffer_lines`` has been deprecated in ``pd.read_csv()`` and will be removed in a future version (:issue:`13360`)
337428
- ``as_recarray`` has been deprecated in ``pd.read_csv()`` and will be removed in a future version (:issue:`13373`)
429+
- top-level ``pd.ordered_merge()`` has been renamed to ``pd.merge_ordered()`` and the original name will be removed in a future version (:issue:`13358`)
338430

339431
.. _whatsnew_0182.performance:
340432

pandas/__init__.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,8 @@
4343
from pandas.io.api import *
4444
from pandas.computation.api import *
4545

46-
from pandas.tools.merge import merge, concat, ordered_merge
46+
from pandas.tools.merge import (merge, concat, ordered_merge,
47+
merge_ordered, merge_asof)
4748
from pandas.tools.pivot import pivot_table, crosstab
4849
from pandas.tools.plotting import scatter_matrix, plot_params
4950
from pandas.tools.tile import cut, qcut

0 commit comments

Comments
 (0)