Skip to content

BUG: bug in left join on multi-index with sort=True or nulls #9210

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 10, 2015

Conversation

behzadnouri
Copy link
Contributor

on master:

In [8]: left
Out[8]:
  1st 2nd  3rd
0   c   c   13
1   b   b   79
2   a   a   27
3   b   b   27
4   c   a   86

In [9]: right
Out[9]:
         4th
1st 2nd
c   a    -86
b   b    -79
c   c    -13
b   b    -27
a   a    -27

sort=True is ignored, and the result is not sorted by the join key:

In [10]: left.join(right, on=['1st', '2nd'], how='left', sort=True)
Out[10]:
  1st 2nd  3rd  4th
0   c   c   13  -13
1   b   b   79  -79
1   b   b   79  -27
2   a   a   27  -27
3   b   b   27  -79
3   b   b   27  -27
4   c   a   86  -86

in addition:

In [44]: left
Out[44]:
   1st  2nd  3rd
0  NaN    a   14
1    a  NaN   10
2    a    b   19
3  NaN  NaN   62
4    a    c   90

In [45]: right
Out[45]:
         4th
1st 2nd
NaN a    -14
a   c    -90
    NaN  -10
    b    -19
NaN NaN  -62

this works:

In [46]: merge(left, right.reset_index(), on=['1st', '2nd'], how='left')
Out[46]:
   1st  2nd  3rd  4th
0  NaN    a   14  -14
1    a  NaN   10  -10
2    a    b   19  -19
3  NaN  NaN   62  -62
4    a    c   90  -90

but this does not:

In [47]: left.join(right, on=['1st', '2nd'], how='left')
Out[47]:
   1st  2nd  3rd  4th
0  NaN    a   14  NaN
1    a  NaN   10  NaN
2    a    b   19  -19
3  NaN  NaN   62  NaN
4    a    c   90  -90

also, get_group_index called in these lines is subject to overflow, and should be avoided.

r 'join|merge' benchmarks:

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
join_dataframe_index_multi                   |  35.4387 |  36.5883 |   0.9686 |
join_dataframe_index_single_key_bigger_sort  |  24.7660 |  24.8604 |   0.9962 |
strings_join_split                           |  57.6473 |  57.6183 |   1.0005 |
join_dataframe_index_single_key_small        |  16.6840 |  16.6461 |   1.0023 |
merge_2intkey_sort                           |  61.2427 |  60.5460 |   1.0115 |
join_non_unique_equal                        |   0.9513 |   0.9391 |   1.0130 |
left_outer_join_index                        | 2887.7623 | 2839.2557 |   1.0171 |
i8merge                                      | 1534.6023 | 1506.8540 |   1.0184 |
join_dataframe_index_single_key_bigger       |  25.3410 |  24.8287 |   1.0206 |
merge_2intkey_nosort                         |  21.6643 |  21.2137 |   1.0212 |
join_dataframe_integer_key                   |   3.0307 |   2.9414 |   1.0304 |
join_dataframe_integer_2key                  |   7.7363 |   7.4220 |   1.0423 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [f02ef89] : bug in left join on multi-index with sort=True or nulls
Base   [b62754d] : Merge pull request #9206 from robertdavidwest/9203_resubmitted_in_single_commit

9203 SQUASHED - DOCS: doc string edited pandas/core/frame.duplicated()

@jreback jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jan 10, 2015
@jreback jreback added this to the 0.16.0 milestone Jan 10, 2015
@jreback
Copy link
Contributor

jreback commented Jan 10, 2015

any open issues that need to xref this?

@behzadnouri
Copy link
Contributor Author

i could not find one. i noticed the issue when i saw on current master _get_multiindex_indexer does not use the passed-in sort argument, and calls into get_group_index without overflow check.

@jreback
Copy link
Contributor

jreback commented Jan 10, 2015

@behzadnouri ok thanks...just wanted to check

jreback added a commit that referenced this pull request Jan 10, 2015
BUG: bug in left join on multi-index with sort=True or nulls
@jreback jreback merged commit ff090f4 into pandas-dev:master Jan 10, 2015
@behzadnouri behzadnouri deleted the lji branch January 10, 2015 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants