Enhancement: 'Joiner'-Function for pd.merge/DataFrame.join #8962

cstotzer · 2014-12-02T10:50:54Z

Would it be possible to add a parameter to pd.merge/DataFrame.join which accepts a function that does the actual merging of the two joined rows.

def my_joiner(left,right):
    left.a = right.a if left.a == NaN else left.a
    left.b = right.b if left.cnt < 3 else left.b
    ... 
    return left

pd.merge(df1,df2,joiner=my_joiner)

As I often use merge/join to impute missing or statistically unreliable data I find myself writeing code like the following over and over again:

df3 = pd.merge(df1,df2)
df3['a'] = df3.a_x.fillna(df3.a_y)
df3['b'] = df3.apply(lambda x: x.b_x if x.cnt > 3 else x.b_y)
...
df3 = df3.drop([a_x, a_y, b_x, b_y])

Which is rather cumbersome and cluttered.

The text was updated successfully, but these errors were encountered:

jreback · 2014-12-02T11:08:41Z

can u post a small but complete copy-pastable example

cstotzer · 2014-12-02T11:51:13Z

Sure, here goes

import pandas
import numpy

df_left = DataFrame([[1,3,4],[2,3,5],[NaN,2,8]], index=['a','b','c'], columns=['one', 'two', 'three'])
df_right = DataFrame([[4,2],[3,6]], index=['b','c'], columns=['one','two'])

# What I do as of now:
# ================
df_merged = merge(df_left, df_right, left_index = True, right_index = True, how='left')
# Coalescing two columns (like SQLs NVL/COALESCE functions)
df_merged['one'] = df_merged.one_x.fillna(df_merged.one_y)
# Setting a columns value depending on another (three might be the rowcount)
df_merged['two'] = df_merged.apply(lambda row: row.two_y if row.three <= 3 else row.two_x, axis=1)
# possibly many more of these
df_final = df_merged.drop(['one_x','one_y','two_x','two_y'], axis=1)

#what I would like to do:
# =================
def my_join_func(left, right):
    out = Series()
    out['one'] = right.one if left.one == NaN else left.one
    out['two'] = right.two if left.three <= 3 else left.two
    return out

df_final = merge(df_left, df_right, join_func=my_join_func, left_index = True, right_index = True, how='left')
# or
df_left.join(df_right, join_func=my_join_func, how='left')

jreback · 2014-12-02T12:24:35Z

this looks like #7480

yes? (though I like your example a bit better).

cstotzer · 2014-12-02T13:02:19Z

Not really, i'm looking for an easier way to specify what column to take from which dataset in the join operation like in the SELECT-Clause of a SQL join.

SELECT COALESCE(l.one, r.one) AS one, 
       CASE WHEN r.three <= 3 THEN r.two ELSE l.two END AS two,
       three
  FROM df_left AS l
  LEFT JOIN df_right AS r ON l.index = r.index

jreback · 2014-12-02T13:08:41Z

having a joiner function is not going to be efficient at all. Can you not prefilter?

cstotzer · 2014-12-02T13:11:35Z

You figure that

pd.merge(left, right, ...).apply(func)

is more efficient?

jreback · 2014-12-02T13:19:55Z

no, .apply is not efficient. I mean that using a python function in the middle of a vectorized operation is not efficient

I mean which will do an if-then-else

result = result.where(......)

or you can do pd.merge(df_left.where(.....)....) (e.g. pre-filter the unwanted values to NaN)

bscheetz · 2018-05-08T18:12:03Z

@jreback @cstotzer should this be closed? Given that it's a goal of pandas to minimize complexity in function calls and this functionality exists (in a vectorized form) via DF/Series where functions, it seems like this is set.

banderlog · 2020-06-01T10:55:11Z

Please, consider next example:

We have two dataframes, something like this:

# DF1 (each SOME_ID have few rows, sorted by date)
          SOME_ID        DATE   F0             F1              F2
1516     775335.0   2019-09-16  625.0          10.0            2.5   
1537     775335.0   2019-10-07  400.0          10.1            2.5   
1565     775335.0   2019-11-04  521.0           9.8            2.5   
1593     775335.0   2019-12-02  264.0          10.2            2.5   
1628     775335.0   2020-01-06  566.0           9.6            5.0
...
9999     789969.0   2020-02-16  666.0           9.9            5.5

#DF2 (each SOME_ID have single row)
         SOME_ID  THRESHOLD_DATE
0       775335.0  2020-02-03
1       782756.0  2019-06-03
2       787437.0  2019-12-02
3       789498.0  2019-10-30
4       789969.0  2019-01-25
..           ...         ...

i want to do something like that:

select *
from DF1 a
inner join DF2 b
  on a.some_id = b.some_id
where b.threshold_date > a.date

In pandas I made this, because there is no condition join:

foo = []
for id, df_tmp in df1.groupby('SOME_ID'):
    th_date = df2.query('SOME_ID == @id')['THRESHOLD_DATE'].iloc[0]
    tmp = df_tmp.query('DATE >= @th_date')
    foo.append(tmp)
final_df = pd.concat(foo)

And it is very slow, like 5-7 seconds for 15K rows dataset, because of cycle in it.

How do this thing in pythonic/pandas way without conditional join?

banderlog · 2020-06-01T12:27:50Z

This thing ispeed is ~2sec for the same dataset:

(df1.groupby('SOME_ID').apply(lambda x: x[x['DATE'] >= df2.query('SOME_ID == @x.name')['THRESHOLD_DATE'].iloc[0]])).reset_index(drop=True)

But it is definitely is not more pythonic or more readable

mroeschke · 2021-04-11T05:55:51Z

xref #7480 and #10309 for specific cases of merging based on flexible conditions (that can be rewritten as a joiner function)

qbit-git · 2021-05-20T10:29:09Z

https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.join.html
pyspark join has the feature

ms7463 · 2021-07-01T06:24:06Z

Seems there are many issues related to conditional joining, and that this is the main one to track the issue, so I'll follow up here.

I recently opened (and closed due to being a duplicate issue of this and others) an issue with a proposed working pseudo-solution for this. The main idea is that you perform a lazy cross-join and then filter based on a provided function in a memory efficient way. In my example the "filter in a memory efficient way" was performed by breaking the 2 frames into chunks and cross joining/filtering the cartesian products of the chunks piece by piece so that you never blow up your memory.

However, as stated in that issue, I don't think chunking is the ultimate solution, that was just for quick demo purposes. I'm thinking it could instead use some optimized cython code, that runs through the cartesian product of rows of the 2 frames and performs the provided filter pair by pair (or at some other optimal interval).

In the proposed solution the api looks like this:

pd.lazy_cross_join(df, dates).loc[lambda dfx: (dfx.ts >= dfx.start_date) & (dfx.ts < dfx.end_date)]

Where lazy_cross_join returns a LazyMerge result and the actual joining and filtering is deferred until the .loc accessor is called on it. However, this idea would still work with the api originally proposed here:

pd.merge(df1, df2, condition=lambda dfx: (dfx.ts >= dfx.start_date) & (dfx.ts < dfx.end_date), how='cross')

(where dfx passed into to condition lambda is the "deferred/lazy" cross join result)

Issue/proposed pseudo-solution below:
#41150

trianta2 · 2021-08-27T17:14:32Z

A conditional join, similar to pyspark, would be very useful. The pyspark join expression API is simple and clean.

samukweku · 2022-08-26T06:37:41Z

@aa1371 @trianta2 pyjanitor has a conditional_join function that covers non equi joins. There is also a pending PR with significant performance improvement, which is implemented in numba. Some benchmark performance results are listed in the PR.

cstotzer changed the title ~~'Joiner'-Function for pd.merge/DataFrame.join~~ Enhancement: 'Joiner'-Function for pd.merge/DataFrame.join Dec 2, 2014

jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Dec 2, 2014

mroeschke added the Enhancement label Oct 13, 2019

samukweku mentioned this issue Sep 27, 2020

Conditional join pyjanitor-devs/pyjanitor#737

Closed

mroeschke mentioned this issue Apr 11, 2021

Conditional Join (merge) in pandas #7480

Closed

ericmjl mentioned this issue May 13, 2021

[ENH] Attempt to fix issues 737, 198, 752, 209, 366, 382, 695 pyjanitor-devs/pyjanitor#832

Merged

ms7463 mentioned this issue Aug 10, 2021

ENH: conditional merge - initial concept #42964

Closed

4 tasks

mroeschke mentioned this issue Nov 7, 2023

ENH: Merge with custom function #55873

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement: 'Joiner'-Function for pd.merge/DataFrame.join #8962

Enhancement: 'Joiner'-Function for pd.merge/DataFrame.join #8962

cstotzer commented Dec 2, 2014

jreback commented Dec 2, 2014

cstotzer commented Dec 2, 2014

jreback commented Dec 2, 2014

cstotzer commented Dec 2, 2014

jreback commented Dec 2, 2014

cstotzer commented Dec 2, 2014

jreback commented Dec 2, 2014

bscheetz commented May 8, 2018

banderlog commented Jun 1, 2020 •

edited

Loading

banderlog commented Jun 1, 2020

mroeschke commented Apr 11, 2021

qbit-git commented May 20, 2021

ms7463 commented Jul 1, 2021 •

edited

Loading

trianta2 commented Aug 27, 2021

samukweku commented Aug 26, 2022 •

edited

Loading

Enhancement: 'Joiner'-Function for pd.merge/DataFrame.join #8962

Enhancement: 'Joiner'-Function for pd.merge/DataFrame.join #8962

Comments

cstotzer commented Dec 2, 2014

jreback commented Dec 2, 2014

cstotzer commented Dec 2, 2014

jreback commented Dec 2, 2014

cstotzer commented Dec 2, 2014

jreback commented Dec 2, 2014

cstotzer commented Dec 2, 2014

jreback commented Dec 2, 2014

bscheetz commented May 8, 2018

banderlog commented Jun 1, 2020 • edited Loading

banderlog commented Jun 1, 2020

mroeschke commented Apr 11, 2021

qbit-git commented May 20, 2021

ms7463 commented Jul 1, 2021 • edited Loading

trianta2 commented Aug 27, 2021

samukweku commented Aug 26, 2022 • edited Loading

banderlog commented Jun 1, 2020 •

edited

Loading

ms7463 commented Jul 1, 2021 •

edited

Loading

samukweku commented Aug 26, 2022 •

edited

Loading