-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Enhancement: 'Joiner'-Function for pd.merge/DataFrame.join #8962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
can u post a small but complete copy-pastable example |
Sure, here goes import pandas
import numpy
df_left = DataFrame([[1,3,4],[2,3,5],[NaN,2,8]], index=['a','b','c'], columns=['one', 'two', 'three'])
df_right = DataFrame([[4,2],[3,6]], index=['b','c'], columns=['one','two'])
# What I do as of now:
# ================
df_merged = merge(df_left, df_right, left_index = True, right_index = True, how='left')
# Coalescing two columns (like SQLs NVL/COALESCE functions)
df_merged['one'] = df_merged.one_x.fillna(df_merged.one_y)
# Setting a columns value depending on another (three might be the rowcount)
df_merged['two'] = df_merged.apply(lambda row: row.two_y if row.three <= 3 else row.two_x, axis=1)
# possibly many more of these
df_final = df_merged.drop(['one_x','one_y','two_x','two_y'], axis=1)
#what I would like to do:
# =================
def my_join_func(left, right):
out = Series()
out['one'] = right.one if left.one == NaN else left.one
out['two'] = right.two if left.three <= 3 else left.two
return out
df_final = merge(df_left, df_right, join_func=my_join_func, left_index = True, right_index = True, how='left')
# or
df_left.join(df_right, join_func=my_join_func, how='left') |
this looks like #7480 yes? (though I like your example a bit better). |
Not really, i'm looking for an easier way to specify what column to take from which dataset in the join operation like in the SELECT-Clause of a SQL join. SELECT COALESCE(l.one, r.one) AS one,
CASE WHEN r.three <= 3 THEN r.two ELSE l.two END AS two,
three
FROM df_left AS l
LEFT JOIN df_right AS r ON l.index = r.index |
having a |
You figure that
is more efficient? |
no, I mean which will do an if-then-else
or you can do |
Please, consider next example: We have two dataframes, something like this:
i want to do something like that: select *
from DF1 a
inner join DF2 b
on a.some_id = b.some_id
where b.threshold_date > a.date In pandas I made this, because there is no condition join: foo = []
for id, df_tmp in df1.groupby('SOME_ID'):
th_date = df2.query('SOME_ID == @id')['THRESHOLD_DATE'].iloc[0]
tmp = df_tmp.query('DATE >= @th_date')
foo.append(tmp)
final_df = pd.concat(foo) And it is very slow, like 5-7 seconds for 15K rows dataset, because of cycle in it. How do this thing in pythonic/pandas way without conditional join? |
This thing ispeed is ~2sec for the same dataset: (df1.groupby('SOME_ID').apply(lambda x: x[x['DATE'] >= df2.query('SOME_ID == @x.name')['THRESHOLD_DATE'].iloc[0]])).reset_index(drop=True) But it is definitely is not more pythonic or more readable |
https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.join.html |
Seems there are many issues related to conditional joining, and that this is the main one to track the issue, so I'll follow up here. I recently opened (and closed due to being a duplicate issue of this and others) an issue with a proposed working pseudo-solution for this. The main idea is that you perform a lazy cross-join and then filter based on a provided function in a memory efficient way. In my example the "filter in a memory efficient way" was performed by breaking the 2 frames into chunks and cross joining/filtering the cartesian products of the chunks piece by piece so that you never blow up your memory. However, as stated in that issue, I don't think chunking is the ultimate solution, that was just for quick demo purposes. I'm thinking it could instead use some optimized cython code, that runs through the cartesian product of rows of the 2 frames and performs the provided filter pair by pair (or at some other optimal interval). In the proposed solution the api looks like this:
Where lazy_cross_join returns a
(where dfx passed into to condition lambda is the "deferred/lazy" cross join result) Issue/proposed pseudo-solution below: |
A conditional join, similar to pyspark, would be very useful. The pyspark join expression API is simple and clean. |
@aa1371 @trianta2 pyjanitor has a conditional_join function that covers non equi joins. There is also a pending PR with significant performance improvement, which is implemented in numba. Some benchmark performance results are listed in the PR. |
Would it be possible to add a parameter to pd.merge/DataFrame.join which accepts a function that does the actual merging of the two joined rows.
As I often use merge/join to impute missing or statistically unreliable data I find myself writeing code like the following over and over again:
Which is rather cumbersome and cluttered.
The text was updated successfully, but these errors were encountered: