Skip to content

Enhancement: 'Joiner'-Function for pd.merge/DataFrame.join #8962

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
cstotzer opened this issue Dec 2, 2014 · 15 comments
Open

Enhancement: 'Joiner'-Function for pd.merge/DataFrame.join #8962

cstotzer opened this issue Dec 2, 2014 · 15 comments
Labels
Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@cstotzer
Copy link

cstotzer commented Dec 2, 2014

Would it be possible to add a parameter to pd.merge/DataFrame.join which accepts a function that does the actual merging of the two joined rows.

def my_joiner(left,right):
    left.a = right.a if left.a == NaN else left.a
    left.b = right.b if left.cnt < 3 else left.b
    ... 
    return left

pd.merge(df1,df2,joiner=my_joiner)

As I often use merge/join to impute missing or statistically unreliable data I find myself writeing code like the following over and over again:

df3 = pd.merge(df1,df2)
df3['a'] = df3.a_x.fillna(df3.a_y)
df3['b'] = df3.apply(lambda x: x.b_x if x.cnt > 3 else x.b_y)
...
df3 = df3.drop([a_x, a_y, b_x, b_y])

Which is rather cumbersome and cluttered.

@cstotzer cstotzer changed the title 'Joiner'-Function for pd.merge/DataFrame.join Enhancement: 'Joiner'-Function for pd.merge/DataFrame.join Dec 2, 2014
@jreback
Copy link
Contributor

jreback commented Dec 2, 2014

can u post a small but complete copy-pastable example

@cstotzer
Copy link
Author

cstotzer commented Dec 2, 2014

Sure, here goes

import pandas
import numpy

df_left = DataFrame([[1,3,4],[2,3,5],[NaN,2,8]], index=['a','b','c'], columns=['one', 'two', 'three'])
df_right = DataFrame([[4,2],[3,6]], index=['b','c'], columns=['one','two'])

# What I do as of now:
# ================
df_merged = merge(df_left, df_right, left_index = True, right_index = True, how='left')
# Coalescing two columns (like SQLs NVL/COALESCE functions)
df_merged['one'] = df_merged.one_x.fillna(df_merged.one_y)
# Setting a columns value depending on another (three might be the rowcount)
df_merged['two'] = df_merged.apply(lambda row: row.two_y if row.three <= 3 else row.two_x, axis=1)
# possibly many more of these
df_final = df_merged.drop(['one_x','one_y','two_x','two_y'], axis=1)

#what I would like to do:
# =================
def my_join_func(left, right):
    out = Series()
    out['one'] = right.one if left.one == NaN else left.one
    out['two'] = right.two if left.three <= 3 else left.two
    return out

df_final = merge(df_left, df_right, join_func=my_join_func, left_index = True, right_index = True, how='left')
# or
df_left.join(df_right, join_func=my_join_func, how='left')

@jreback
Copy link
Contributor

jreback commented Dec 2, 2014

this looks like #7480

yes? (though I like your example a bit better).

@cstotzer
Copy link
Author

cstotzer commented Dec 2, 2014

Not really, i'm looking for an easier way to specify what column to take from which dataset in the join operation like in the SELECT-Clause of a SQL join.

SELECT COALESCE(l.one, r.one) AS one, 
       CASE WHEN r.three <= 3 THEN r.two ELSE l.two END AS two,
       three
  FROM df_left AS l
  LEFT JOIN df_right AS r ON l.index = r.index

@jreback
Copy link
Contributor

jreback commented Dec 2, 2014

having a joiner function is not going to be efficient at all. Can you not prefilter?

@jreback jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Dec 2, 2014
@cstotzer
Copy link
Author

cstotzer commented Dec 2, 2014

You figure that

pd.merge(left, right, ...).apply(func) 

is more efficient?

@jreback
Copy link
Contributor

jreback commented Dec 2, 2014

no, .apply is not efficient. I mean that using a python function in the middle of a vectorized operation is not efficient

I mean which will do an if-then-else

result = result.where(......)

or you can do pd.merge(df_left.where(.....)....) (e.g. pre-filter the unwanted values to NaN)

@bscheetz
Copy link

bscheetz commented May 8, 2018

@jreback @cstotzer should this be closed? Given that it's a goal of pandas to minimize complexity in function calls and this functionality exists (in a vectorized form) via DF/Series where functions, it seems like this is set.

@banderlog
Copy link

banderlog commented Jun 1, 2020

Please, consider next example:

We have two dataframes, something like this:

# DF1 (each SOME_ID have few rows, sorted by date)
          SOME_ID        DATE   F0             F1              F2
1516     775335.0   2019-09-16  625.0          10.0            2.5   
1537     775335.0   2019-10-07  400.0          10.1            2.5   
1565     775335.0   2019-11-04  521.0           9.8            2.5   
1593     775335.0   2019-12-02  264.0          10.2            2.5   
1628     775335.0   2020-01-06  566.0           9.6            5.0
...
9999     789969.0   2020-02-16  666.0           9.9            5.5
#DF2 (each SOME_ID have single row)
         SOME_ID  THRESHOLD_DATE
0       775335.0  2020-02-03
1       782756.0  2019-06-03
2       787437.0  2019-12-02
3       789498.0  2019-10-30
4       789969.0  2019-01-25
..           ...         ...

i want to do something like that:

select *
from DF1 a
inner join DF2 b
  on a.some_id = b.some_id
where b.threshold_date > a.date

In pandas I made this, because there is no condition join:

foo = []
for id, df_tmp in df1.groupby('SOME_ID'):
    th_date = df2.query('SOME_ID == @id')['THRESHOLD_DATE'].iloc[0]
    tmp = df_tmp.query('DATE >= @th_date')
    foo.append(tmp)
final_df = pd.concat(foo)

And it is very slow, like 5-7 seconds for 15K rows dataset, because of cycle in it.

How do this thing in pythonic/pandas way without conditional join?

@banderlog
Copy link

This thing ispeed is ~2sec for the same dataset:

(df1.groupby('SOME_ID').apply(lambda x: x[x['DATE'] >= df2.query('SOME_ID == @x.name')['THRESHOLD_DATE'].iloc[0]])).reset_index(drop=True)

But it is definitely is not more pythonic or more readable

@mroeschke
Copy link
Member

xref #7480 and #10309 for specific cases of merging based on flexible conditions (that can be rewritten as a joiner function)

@qbit-git
Copy link

@ms7463
Copy link
Contributor

ms7463 commented Jul 1, 2021

Seems there are many issues related to conditional joining, and that this is the main one to track the issue, so I'll follow up here.

I recently opened (and closed due to being a duplicate issue of this and others) an issue with a proposed working pseudo-solution for this. The main idea is that you perform a lazy cross-join and then filter based on a provided function in a memory efficient way. In my example the "filter in a memory efficient way" was performed by breaking the 2 frames into chunks and cross joining/filtering the cartesian products of the chunks piece by piece so that you never blow up your memory.

However, as stated in that issue, I don't think chunking is the ultimate solution, that was just for quick demo purposes. I'm thinking it could instead use some optimized cython code, that runs through the cartesian product of rows of the 2 frames and performs the provided filter pair by pair (or at some other optimal interval).

In the proposed solution the api looks like this:

pd.lazy_cross_join(df, dates).loc[lambda dfx: (dfx.ts >= dfx.start_date) & (dfx.ts < dfx.end_date)]

Where lazy_cross_join returns a LazyMerge result and the actual joining and filtering is deferred until the .loc accessor is called on it. However, this idea would still work with the api originally proposed here:

pd.merge(df1, df2, condition=lambda dfx: (dfx.ts >= dfx.start_date) & (dfx.ts < dfx.end_date), how='cross')

(where dfx passed into to condition lambda is the "deferred/lazy" cross join result)

Issue/proposed pseudo-solution below:
#41150

@trianta2
Copy link

A conditional join, similar to pyspark, would be very useful. The pyspark join expression API is simple and clean.

@samukweku
Copy link
Contributor

samukweku commented Aug 26, 2022

@aa1371 @trianta2 pyjanitor has a conditional_join function that covers non equi joins. There is also a pending PR with significant performance improvement, which is implemented in numba. Some benchmark performance results are listed in the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants