Skip to content

Commit bd169dc

Browse files
Albert Villanova del Moraljreback
Albert Villanova del Moral
authored andcommitted
BUG: Fix index order for Index.intersection()
closes #15582 Author: Albert Villanova del Moral <[email protected]> Author: Jeff Reback <[email protected]> Closes #15583 from albertvillanova/fix-15582 and squashes the following commits: 2d4e143 [Albert Villanova del Moral] Fix pytest fixture name collision 64e86a4 [Albert Villanova del Moral] Fix test on right join 73df69e [Albert Villanova del Moral] Address requested changes 8d2e9cc [Albert Villanova del Moral] Address requested changes 968c7f1 [Jeff Reback] DOC/TST: change to use parameterization 9e39794 [Albert Villanova del Moral] Address requested changes 5bf1508 [Albert Villanova del Moral] Address requested changes 654288b [Albert Villanova del Moral] Fix Travis errors 33eb740 [Albert Villanova del Moral] Address requested changes 3c200fe [Albert Villanova del Moral] Add new tests ef2581e [Albert Villanova del Moral] Fix Travis error f0d9d03 [Albert Villanova del Moral] Add whatsnew c96306d [Albert Villanova del Moral] Add sort argument to Index.join 047b513 [Albert Villanova del Moral] Address requested changes ec836bd [Albert Villanova del Moral] Fix Travis errors b977278 [Albert Villanova del Moral] Address requested changes 784fe75 [Albert Villanova del Moral] Fix error: line too long 1197b99 [Albert Villanova del Moral] Fix DataFrame column order when read from HDF file d9e29f8 [Albert Villanova del Moral] Create new DatetimeIndex from the Index.intersection result e7bcd28 [Albert Villanova del Moral] Fix typo in documentation a4ead99 [Albert Villanova del Moral] Fix typo c2a8dc3 [Albert Villanova del Moral] Implement tests c12bb3f [Albert Villanova del Moral] BUG: Fix index order for Index.intersection()
1 parent 2e64614 commit bd169dc

File tree

11 files changed

+309
-137
lines changed

11 files changed

+309
-137
lines changed

doc/source/whatsnew/v0.20.0.txt

+57
Original file line numberDiff line numberDiff line change
@@ -750,6 +750,62 @@ New Behavior:
750750
TypeError: Cannot compare 2014-01-01 00:00:00 of
751751
type <class 'pandas.tslib.Timestamp'> to string column
752752

753+
.. _whatsnew_0200.api_breaking.index_order:
754+
755+
Index.intersection and inner join now preserve the order of the left Index
756+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
757+
758+
`:meth:Index.intersection` now preserves the order of the calling ``Index`` (left)
759+
instead of the other ``Index`` (right) (:issue:`15582`). This affects the inner
760+
joins (`:meth:DataFrame.join` and `:func:merge`) and the ``.align`` methods.
761+
762+
- ``Index.intersection``
763+
764+
.. ipython:: python
765+
766+
left = pd.Index([2, 1, 0])
767+
left
768+
right = pd.Index([1, 2, 3])
769+
right
770+
771+
Previous Behavior:
772+
773+
.. code-block:: ipython
774+
775+
In [4]: left.intersection(right)
776+
Out[4]: Int64Index([1, 2], dtype='int64')
777+
778+
New Behavior:
779+
780+
.. ipython:: python
781+
782+
left.intersection(right)
783+
784+
- ``DataFrame.join`` and ``pd.merge``
785+
786+
.. ipython:: python
787+
788+
left = pd.DataFrame({'a': [20, 10, 0]}, index=[2, 1, 0])
789+
left
790+
right = pd.DataFrame({'b': [100, 200, 300]}, index=[1, 2, 3])
791+
right
792+
793+
Previous Behavior:
794+
795+
.. code-block:: ipython
796+
797+
In [4]: left.join(right, how='inner')
798+
Out[4]:
799+
a b
800+
1 10 100
801+
2 20 200
802+
803+
New Behavior:
804+
805+
.. ipython:: python
806+
807+
left.join(right, how='inner')
808+
753809

754810
.. _whatsnew_0200.api:
755811

@@ -984,6 +1040,7 @@ Bug Fixes
9841040

9851041
- Bug in ``DataFrame.to_stata()`` and ``StataWriter`` which produces incorrectly formatted files to be produced for some locales (:issue:`13856`)
9861042
- Bug in ``StataReader`` and ``StataWriter`` which allows invalid encodings (:issue:`15723`)
1043+
- Bug with ``sort=True`` in ``DataFrame.join`` and ``pd.merge`` when joining on indexes (:issue:`15582`)
9871044

9881045
- Bug in ``pd.concat()`` in which concatting with an empty dataframe with ``join='inner'`` was being improperly handled (:issue:`15328`)
9891046
- Bug in ``groupby.agg()`` incorrectly localizing timezone on ``datetime`` (:issue:`15426`, :issue:`10668`, :issue:`13046`)

pandas/core/frame.py

+15-8
Original file line numberDiff line numberDiff line change
@@ -124,10 +124,14 @@
124124
----------%s
125125
right : DataFrame
126126
how : {'left', 'right', 'outer', 'inner'}, default 'inner'
127-
* left: use only keys from left frame (SQL: left outer join)
128-
* right: use only keys from right frame (SQL: right outer join)
129-
* outer: use union of keys from both frames (SQL: full outer join)
130-
* inner: use intersection of keys from both frames (SQL: inner join)
127+
* left: use only keys from left frame, similar to a SQL left outer join;
128+
preserve key order
129+
* right: use only keys from right frame, similar to a SQL right outer join;
130+
preserve key order
131+
* outer: use union of keys from both frames, similar to a SQL full outer
132+
join; sort keys lexicographically
133+
* inner: use intersection of keys from both frames, similar to a SQL inner
134+
join; preserve the order of the left keys
131135
on : label or list
132136
Field names to join on. Must be found in both DataFrames. If on is
133137
None and not merging on indexes, then it merges on the intersection of
@@ -147,7 +151,8 @@
147151
Use the index from the right DataFrame as the join key. Same caveats as
148152
left_index
149153
sort : boolean, default False
150-
Sort the join keys lexicographically in the result DataFrame
154+
Sort the join keys lexicographically in the result DataFrame. If False,
155+
the order of the join keys depends on the join type (how keyword)
151156
suffixes : 2-length sequence (tuple, list, ...)
152157
Suffix to apply to overlapping column names in the left and right
153158
side, respectively
@@ -4472,16 +4477,18 @@ def join(self, other, on=None, how='left', lsuffix='', rsuffix='',
44724477
* left: use calling frame's index (or column if on is specified)
44734478
* right: use other frame's index
44744479
* outer: form union of calling frame's index (or column if on is
4475-
specified) with other frame's index
4480+
specified) with other frame's index, and sort it
4481+
lexicographically
44764482
* inner: form intersection of calling frame's index (or column if
4477-
on is specified) with other frame's index
4483+
on is specified) with other frame's index, preserving the order
4484+
of the calling's one
44784485
lsuffix : string
44794486
Suffix to use from left frame's overlapping columns
44804487
rsuffix : string
44814488
Suffix to use from right frame's overlapping columns
44824489
sort : boolean, default False
44834490
Order result DataFrame lexicographically by the join key. If False,
4484-
preserves the index order of the calling (left) DataFrame
4491+
the order of the join key depends on the join type (how keyword)
44854492
44864493
Notes
44874494
-----

pandas/indexes/base.py

+19-8
Original file line numberDiff line numberDiff line change
@@ -2089,8 +2089,8 @@ def intersection(self, other):
20892089
"""
20902090
Form the intersection of two Index objects.
20912091
2092-
This returns a new Index with elements common to the index and `other`.
2093-
Sortedness of the result is not guaranteed.
2092+
This returns a new Index with elements common to the index and `other`,
2093+
preserving the order of the calling index.
20942094
20952095
Parameters
20962096
----------
@@ -2128,15 +2128,15 @@ def intersection(self, other):
21282128
pass
21292129

21302130
try:
2131-
indexer = Index(self._values).get_indexer(other._values)
2131+
indexer = Index(other._values).get_indexer(self._values)
21322132
indexer = indexer.take((indexer != -1).nonzero()[0])
21332133
except:
21342134
# duplicates
2135-
indexer = Index(self._values).get_indexer_non_unique(
2136-
other._values)[0].unique()
2135+
indexer = Index(other._values).get_indexer_non_unique(
2136+
self._values)[0].unique()
21372137
indexer = indexer[indexer != -1]
21382138

2139-
taken = self.take(indexer)
2139+
taken = other.take(indexer)
21402140
if self.name != other.name:
21412141
taken.name = None
21422142
return taken
@@ -2831,8 +2831,7 @@ def _reindex_non_unique(self, target):
28312831
new_index = self._shallow_copy_with_infer(new_labels, freq=None)
28322832
return new_index, indexer, new_indexer
28332833

2834-
def join(self, other, how='left', level=None, return_indexers=False):
2835-
"""
2834+
_index_shared_docs['join'] = """
28362835
*this is an internal non-public method*
28372836
28382837
Compute join_index and indexers to conform data
@@ -2844,11 +2843,20 @@ def join(self, other, how='left', level=None, return_indexers=False):
28442843
how : {'left', 'right', 'inner', 'outer'}
28452844
level : int or level name, default None
28462845
return_indexers : boolean, default False
2846+
sort : boolean, default False
2847+
Sort the join keys lexicographically in the result Index. If False,
2848+
the order of the join keys depends on the join type (how keyword)
2849+
2850+
.. versionadded:: 0.20.0
28472851
28482852
Returns
28492853
-------
28502854
join_index, (left_indexer, right_indexer)
28512855
"""
2856+
2857+
@Appender(_index_shared_docs['join'])
2858+
def join(self, other, how='left', level=None, return_indexers=False,
2859+
sort=False):
28522860
from .multi import MultiIndex
28532861
self_is_mi = isinstance(self, MultiIndex)
28542862
other_is_mi = isinstance(other, MultiIndex)
@@ -2929,6 +2937,9 @@ def join(self, other, how='left', level=None, return_indexers=False):
29292937
elif how == 'outer':
29302938
join_index = self.union(other)
29312939

2940+
if sort:
2941+
join_index = join_index.sort_values()
2942+
29322943
if return_indexers:
29332944
if join_index is self:
29342945
lindexer = None

pandas/indexes/range.py

+7-20
Original file line numberDiff line numberDiff line change
@@ -431,29 +431,16 @@ def union(self, other):
431431

432432
return self._int64index.union(other)
433433

434-
def join(self, other, how='left', level=None, return_indexers=False):
435-
"""
436-
*this is an internal non-public method*
437-
438-
Compute join_index and indexers to conform data
439-
structures to the new index.
440-
441-
Parameters
442-
----------
443-
other : Index
444-
how : {'left', 'right', 'inner', 'outer'}
445-
level : int or level name, default None
446-
return_indexers : boolean, default False
447-
448-
Returns
449-
-------
450-
join_index, (left_indexer, right_indexer)
451-
"""
434+
@Appender(_index_shared_docs['join'])
435+
def join(self, other, how='left', level=None, return_indexers=False,
436+
sort=False):
452437
if how == 'outer' and self is not other:
453438
# note: could return RangeIndex in more circumstances
454-
return self._int64index.join(other, how, level, return_indexers)
439+
return self._int64index.join(other, how, level, return_indexers,
440+
sort)
455441

456-
return super(RangeIndex, self).join(other, how, level, return_indexers)
442+
return super(RangeIndex, self).join(other, how, level, return_indexers,
443+
sort)
457444

458445
def __len__(self):
459446
"""

pandas/io/pytables.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -4321,7 +4321,7 @@ def _reindex_axis(obj, axis, labels, other=None):
43214321

43224322
labels = _ensure_index(labels.unique())
43234323
if other is not None:
4324-
labels = labels & _ensure_index(other.unique())
4324+
labels = _ensure_index(other.unique()) & labels
43254325
if not labels.equals(ax):
43264326
slicer = [slice(None, None)] * obj.ndim
43274327
slicer[axis] = labels

pandas/tests/frame/test_join.py

+140
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
# -*- coding: utf-8 -*-
2+
3+
import pytest
4+
import numpy as np
5+
6+
from pandas import DataFrame, Index
7+
from pandas.tests.frame.common import TestData
8+
import pandas.util.testing as tm
9+
10+
11+
@pytest.fixture
12+
def frame():
13+
return TestData().frame
14+
15+
16+
@pytest.fixture
17+
def left():
18+
return DataFrame({'a': [20, 10, 0]}, index=[2, 1, 0])
19+
20+
21+
@pytest.fixture
22+
def right():
23+
return DataFrame({'b': [300, 100, 200]}, index=[3, 1, 2])
24+
25+
26+
@pytest.mark.parametrize(
27+
"how, sort, expected",
28+
[('inner', False, DataFrame({'a': [20, 10],
29+
'b': [200, 100]},
30+
index=[2, 1])),
31+
('inner', True, DataFrame({'a': [10, 20],
32+
'b': [100, 200]},
33+
index=[1, 2])),
34+
('left', False, DataFrame({'a': [20, 10, 0],
35+
'b': [200, 100, np.nan]},
36+
index=[2, 1, 0])),
37+
('left', True, DataFrame({'a': [0, 10, 20],
38+
'b': [np.nan, 100, 200]},
39+
index=[0, 1, 2])),
40+
('right', False, DataFrame({'a': [np.nan, 10, 20],
41+
'b': [300, 100, 200]},
42+
index=[3, 1, 2])),
43+
('right', True, DataFrame({'a': [10, 20, np.nan],
44+
'b': [100, 200, 300]},
45+
index=[1, 2, 3])),
46+
('outer', False, DataFrame({'a': [0, 10, 20, np.nan],
47+
'b': [np.nan, 100, 200, 300]},
48+
index=[0, 1, 2, 3])),
49+
('outer', True, DataFrame({'a': [0, 10, 20, np.nan],
50+
'b': [np.nan, 100, 200, 300]},
51+
index=[0, 1, 2, 3]))])
52+
def test_join(left, right, how, sort, expected):
53+
54+
result = left.join(right, how=how, sort=sort)
55+
tm.assert_frame_equal(result, expected)
56+
57+
58+
def test_join_index(frame):
59+
# left / right
60+
61+
f = frame.loc[frame.index[:10], ['A', 'B']]
62+
f2 = frame.loc[frame.index[5:], ['C', 'D']].iloc[::-1]
63+
64+
joined = f.join(f2)
65+
tm.assert_index_equal(f.index, joined.index)
66+
expected_columns = Index(['A', 'B', 'C', 'D'])
67+
tm.assert_index_equal(joined.columns, expected_columns)
68+
69+
joined = f.join(f2, how='left')
70+
tm.assert_index_equal(joined.index, f.index)
71+
tm.assert_index_equal(joined.columns, expected_columns)
72+
73+
joined = f.join(f2, how='right')
74+
tm.assert_index_equal(joined.index, f2.index)
75+
tm.assert_index_equal(joined.columns, expected_columns)
76+
77+
# inner
78+
79+
joined = f.join(f2, how='inner')
80+
tm.assert_index_equal(joined.index, f.index[5:10])
81+
tm.assert_index_equal(joined.columns, expected_columns)
82+
83+
# outer
84+
85+
joined = f.join(f2, how='outer')
86+
tm.assert_index_equal(joined.index, frame.index.sort_values())
87+
tm.assert_index_equal(joined.columns, expected_columns)
88+
89+
tm.assertRaisesRegexp(ValueError, 'join method', f.join, f2, how='foo')
90+
91+
# corner case - overlapping columns
92+
for how in ('outer', 'left', 'inner'):
93+
with tm.assertRaisesRegexp(ValueError, 'columns overlap but '
94+
'no suffix'):
95+
frame.join(frame, how=how)
96+
97+
98+
def test_join_index_more(frame):
99+
af = frame.loc[:, ['A', 'B']]
100+
bf = frame.loc[::2, ['C', 'D']]
101+
102+
expected = af.copy()
103+
expected['C'] = frame['C'][::2]
104+
expected['D'] = frame['D'][::2]
105+
106+
result = af.join(bf)
107+
tm.assert_frame_equal(result, expected)
108+
109+
result = af.join(bf, how='right')
110+
tm.assert_frame_equal(result, expected[::2])
111+
112+
result = bf.join(af, how='right')
113+
tm.assert_frame_equal(result, expected.loc[:, result.columns])
114+
115+
116+
def test_join_index_series(frame):
117+
df = frame.copy()
118+
s = df.pop(frame.columns[-1])
119+
joined = df.join(s)
120+
121+
# TODO should this check_names ?
122+
tm.assert_frame_equal(joined, frame, check_names=False)
123+
124+
s.name = None
125+
tm.assertRaisesRegexp(ValueError, 'must have a name', df.join, s)
126+
127+
128+
def test_join_overlap(frame):
129+
df1 = frame.loc[:, ['A', 'B', 'C']]
130+
df2 = frame.loc[:, ['B', 'C', 'D']]
131+
132+
joined = df1.join(df2, lsuffix='_df1', rsuffix='_df2')
133+
df1_suf = df1.loc[:, ['B', 'C']].add_suffix('_df1')
134+
df2_suf = df2.loc[:, ['B', 'C']].add_suffix('_df2')
135+
136+
no_overlap = frame.loc[:, ['A', 'D']]
137+
expected = df1_suf.join(df2_suf).join(no_overlap)
138+
139+
# column order not necessarily sorted
140+
tm.assert_frame_equal(joined, expected.loc[:, joined.columns])

0 commit comments

Comments
 (0)