Skip to content

DataFrame self-joins #2996

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
willfurnass opened this issue Mar 9, 2013 · 7 comments
Closed

DataFrame self-joins #2996

willfurnass opened this issue Mar 9, 2013 · 7 comments

Comments

@willfurnass
Copy link
Contributor

Given the following DataFrame

area point test value
A 11 0 1234234
A 11 1 12341234
A 16 0 234234
A 16 1 2343
A 16 2 234234
C 4 0 234234
C 4 1 234234

it would be nice if there were a way of grouping say columns area and point and comparing the value per test > 1 with the value for test - 1.

This can be done by iterating over df.groupby(['area', 'point', 'test']) and using the sorting provided by groupby() on the specified columns to compare current and previous values. However, it would be neat if this could also be done in a more Pandas-esque way using something akin to a SQL self-join.

NB request first made in pystatsmodels Google Group; was asked by Wes to create a Github issue for this.

@wesm
Copy link
Member

wesm commented Mar 9, 2013

Doesn't this do what you want?

In [25]: merge(df, df, left_on=['point', 'area', 'test'], right_on=['point', 'area', df['test'] - 1])
Out[25]: 
  area  point  test  test_x  value_x  test_y   value_y
0    A     11     0       0  1234234       1  12341234
1    A     16     0       0   234234       1      2343
2    A     16     1       1     2343       2    234234
3    C      4     0       0   234234       1    234234

or

In [24]: merge(df, df, left_on='test', right_on=df['test'] - 1)
Out[24]: 
    test area_x  point_x  test_x   value_x area_y  point_y  test_y   value_y
0      0      A       11       0   1234234      A       11       1  12341234
1      0      A       11       0   1234234      A       16       1      2343
2      0      A       11       0   1234234      C        4       1    234234
3      0      A       16       0    234234      A       11       1  12341234
4      0      A       16       0    234234      A       16       1      2343
5      0      A       16       0    234234      C        4       1    234234
6      0      C        4       0    234234      A       11       1  12341234
7      0      C        4       0    234234      A       16       1      2343
8      0      C        4       0    234234      C        4       1    234234
9      1      A       11       1  12341234      A       16       2    234234
10     1      A       16       1      2343      A       16       2    234234
11     1      C        4       1    234234      A       16       2    234234

@willfurnass
Copy link
Contributor Author

Cheers, that does the trick. I'll close this Issue.

@mizzao
Copy link

mizzao commented Oct 7, 2015

In pandas 0.16.2 I get the following error when trying to attempt a join like this, using the following code to add one column:

df_filtered = df[['gameId', 'round', 'player', 'action']]
df.merge(df_filtered, left_on=['gameId', 'player', 'round'], right_on=['gameId', 'player', df_filtered.round + 1], how='left')

results in

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-597-6ba80432ed68> in <module>()
      1 # Add last action to dataframe
      2 df_filtered = df[['gameId', 'round', 'player', 'action']]
----> 3 pd.DataFrame.merge(df, df_filtered, left_on=['gameId', 'player', 'round'], right_on=['gameId', 'player', df_filtered.round + 1], how='left')

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in merge(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy)
   4069                      left_on=left_on, right_on=right_on,
   4070                      left_index=left_index, right_index=right_index, sort=sort,
-> 4071                      suffixes=suffixes, copy=copy)
   4072 
   4073     #----------------------------------------------------------------------

/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.pyc in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy)
     36                          right_index=right_index, sort=sort, suffixes=suffixes,
     37                          copy=copy)
---> 38     return op.get_result()
     39 if __debug__:
     40     merge.__doc__ = _merge_doc % '\nleft : DataFrame'

/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.pyc in get_result(self)
    203         result = typ(result_data).__finalize__(self, method='merge')
    204 
--> 205         self._maybe_add_join_keys(result, left_indexer, right_indexer)
    206 
    207         return result

/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.pyc in _maybe_add_join_keys(self, result, left_indexer, right_indexer)
    249                 right_na_indexer = right_indexer.take(na_indexer)
    250                 key_col.put(na_indexer, com.take_1d(self.right_join_keys[i],
--> 251                                                     right_na_indexer))
    252                 result.insert(i, name, key_col)
    253 

/usr/local/lib/python2.7/dist-packages/pandas/core/common.pyc in take_nd(arr, indexer, axis, out, fill_value, mask_info, allow_fill)
    842 
    843     indexer = _ensure_int64(indexer)
--> 844     func(arr, indexer, out, fill_value)
    845 
    846     if flip_order:

TypeError: Argument 'values' has incorrect type (expected numpy.ndarray, got Series)

This error also seems to have been reproduced by another user: http://stackoverflow.com/q/24091385/586086

@jreback
Copy link
Contributor

jreback commented Oct 8, 2015

you would have to show the input frame df_filtered.info() and df.head()

@TomAugspurger
Copy link
Contributor

Also, right_on=['gameId', 'player', df_filtered.round + 1] doesn't really make sense. right_on should be a list of strings, but df_filtered.round + 1 will be a Series.

Oh, and pandas 0.17 added a DataFrame.round method. Use df_filtered['round'].

@mizzao
Copy link

mizzao commented Oct 9, 2015

@TomAugspurger: @wesm used the series in place of strings in his post above. In any case, that syntax produces the same error:

df_test.merge(df_test, left_on=['gameId', 'player', 'round'], right_on=['gameId', 'player', df_test['round'] + 1], how='left')

@jreback, the info you requested (See here for a description of the data):

df_test.head()

    gameId  round   player  action
0   Afom9bWqYBgZXXXN8   1   PvQ8B5kuA9Fbq9N59   1
1   Afom9bWqYBgZXXXN8   1   PJmJgrqusFZ8KRShQ   0
2   Afom9bWqYBgZXXXN8   2   PvQ8B5kuA9Fbq9N59   0
3   Afom9bWqYBgZXXXN8   2   PJmJgrqusFZ8KRShQ   0
4   Afom9bWqYBgZXXXN8   3   PJmJgrqusFZ8KRShQ   0
5   Afom9bWqYBgZXXXN8   3   PvQ8B5kuA9Fbq9N59   0
20  QdZM4yPMnjGj8f25R   1   Q6knaWEruc6BDPQT7   1
21  QdZM4yPMnjGj8f25R   1   xnAjMcWaFRpfBbukz   1
22  QdZM4yPMnjGj8f25R   2   xnAjMcWaFRpfBbukz   1
23  QdZM4yPMnjGj8f25R   2   Q6knaWEruc6BDPQT7   0
24  QdZM4yPMnjGj8f25R   3   Q6knaWEruc6BDPQT7   1
25  QdZM4yPMnjGj8f25R   3   xnAjMcWaFRpfBbukz   1
40  riMD6ctT8DLwdhHpE   1   EKkrMpMqy2PRLm7ur   1
41  riMD6ctT8DLwdhHpE   1   EqbbmngPfZBEmPTzq   1
42  riMD6ctT8DLwdhHpE   2   EKkrMpMqy2PRLm7ur   1
43  riMD6ctT8DLwdhHpE   2   EqbbmngPfZBEmPTzq   1
44  riMD6ctT8DLwdhHpE   3   EqbbmngPfZBEmPTzq   1
45  riMD6ctT8DLwdhHpE   3   EKkrMpMqy2PRLm7ur   1
60  hyEjkAg5K4WpubJA9   1   7CHpY4setLKb9ssnN   1
61  hyEjkAg5K4WpubJA9   1   hbud2J3YvitEhj4xZ   0
62  hyEjkAg5K4WpubJA9   2   hbud2J3YvitEhj4xZ   0
63  hyEjkAg5K4WpubJA9   2   7CHpY4setLKb9ssnN   0
64  hyEjkAg5K4WpubJA9   3   7CHpY4setLKb9ssnN   0
65  hyEjkAg5K4WpubJA9   3   hbud2J3YvitEhj4xZ   1
80  ay5pmpeNcwqHJ8JBH   1   tWA9ZxSnKpZyWwYsQ   1
81  ay5pmpeNcwqHJ8JBH   1   2qiHdJgL4WQe5qrHQ   1
82  ay5pmpeNcwqHJ8JBH   2   2qiHdJgL4WQe5qrHQ   1
83  ay5pmpeNcwqHJ8JBH   2   tWA9ZxSnKpZyWwYsQ   1
84  ay5pmpeNcwqHJ8JBH   3   tWA9ZxSnKpZyWwYsQ   1
85  ay5pmpeNcwqHJ8JBH   3   2qiHdJgL4WQe5qrHQ   1

df_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 374251 entries, 0 to 374250
Data columns (total 4 columns):
gameId    374251 non-null object
player    374251 non-null object
round     374251 non-null int64
action    374251 non-null int64
dtypes: int64(2), object(2)
memory usage: 14.3+ MB

@jreback
Copy link
Contributor

jreback commented Oct 14, 2015

Is this what you want?

In [39]: df['round2']=df['round']+1

In [40]: df.merge(df, left_on=['gameId', 'player', 'round'], right_on=['gameId', 'player','round2'], how='left')
Out[40]: 
               gameId  round_x             player  action_x  round2_x  round_y  action_y  round2_y
0   Afom9bWqYBgZXXXN8        1  PvQ8B5kuA9Fbq9N59         1         2      NaN       NaN       NaN
1   Afom9bWqYBgZXXXN8        1  PJmJgrqusFZ8KRShQ         0         2      NaN       NaN       NaN
2   Afom9bWqYBgZXXXN8        2  PvQ8B5kuA9Fbq9N59         0         3        1         1         2
3   Afom9bWqYBgZXXXN8        2  PJmJgrqusFZ8KRShQ         0         3        1         0         2
4   Afom9bWqYBgZXXXN8        3  PJmJgrqusFZ8KRShQ         0         4        2         0         3
5   Afom9bWqYBgZXXXN8        3  PvQ8B5kuA9Fbq9N59         0         4        2         0         3
6   QdZM4yPMnjGj8f25R        1  Q6knaWEruc6BDPQT7         1         2      NaN       NaN       NaN
7   QdZM4yPMnjGj8f25R        1  xnAjMcWaFRpfBbukz         1         2      NaN       NaN       NaN
8   QdZM4yPMnjGj8f25R        2  xnAjMcWaFRpfBbukz         1         3        1         1         2
9   QdZM4yPMnjGj8f25R        2  Q6knaWEruc6BDPQT7         0         3        1         1         2
10  QdZM4yPMnjGj8f25R        3  Q6knaWEruc6BDPQT7         1         4        2         0         3
11  QdZM4yPMnjGj8f25R        3  xnAjMcWaFRpfBbukz         1         4        2         1         3
12  riMD6ctT8DLwdhHpE        1  EKkrMpMqy2PRLm7ur         1         2      NaN       NaN       NaN
13  riMD6ctT8DLwdhHpE        1  EqbbmngPfZBEmPTzq         1         2      NaN       NaN       NaN
14  riMD6ctT8DLwdhHpE        2  EKkrMpMqy2PRLm7ur         1         3        1         1         2
15  riMD6ctT8DLwdhHpE        2  EqbbmngPfZBEmPTzq         1         3        1         1         2
16  riMD6ctT8DLwdhHpE        3  EqbbmngPfZBEmPTzq         1         4        2         1         3
17  riMD6ctT8DLwdhHpE        3  EKkrMpMqy2PRLm7ur         1         4        2         1         3
18  hyEjkAg5K4WpubJA9        1  7CHpY4setLKb9ssnN         1         2      NaN       NaN       NaN
19  hyEjkAg5K4WpubJA9        1  hbud2J3YvitEhj4xZ         0         2      NaN       NaN       NaN
20  hyEjkAg5K4WpubJA9        2  hbud2J3YvitEhj4xZ         0         3        1         0         2
21  hyEjkAg5K4WpubJA9        2  7CHpY4setLKb9ssnN         0         3        1         1         2
22  hyEjkAg5K4WpubJA9        3  7CHpY4setLKb9ssnN         0         4        2         0         3
23  hyEjkAg5K4WpubJA9        3  hbud2J3YvitEhj4xZ         1         4        2         0         3
24  ay5pmpeNcwqHJ8JBH        1  tWA9ZxSnKpZyWwYsQ         1         2      NaN       NaN       NaN
25  ay5pmpeNcwqHJ8JBH        1  2qiHdJgL4WQe5qrHQ         1         2      NaN       NaN       NaN
26  ay5pmpeNcwqHJ8JBH        2  2qiHdJgL4WQe5qrHQ         1         3        1         1         2
27  ay5pmpeNcwqHJ8JBH        2  tWA9ZxSnKpZyWwYsQ         1         3        1         1         2
28  ay5pmpeNcwqHJ8JBH        3  tWA9ZxSnKpZyWwYsQ         1         4        2         1         3
29  ay5pmpeNcwqHJ8JBH        3  2qiHdJgL4WQe5qrHQ         1         4        2         1         3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants