Skip to content

BUG: Fix index order for Index.intersection() #15583

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 23 commits into from
Closed
Show file tree
Hide file tree
Changes from 19 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 64 additions & 0 deletions doc/source/whatsnew/v0.20.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -738,6 +738,69 @@ New Behavior:
TypeError: Cannot compare 2014-01-01 00:00:00 of
type <class 'pandas.tslib.Timestamp'> to string column

.. _whatsnew_0200.api_breaking.index_order:

Index order after inner join due to Index intersection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make the title about what you have in the first sentence

Index.intersectio now preserves the order of the left Index

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would include 'join' in the title as well. Eg Index.intersection and inner join now preserve the order of the left Index

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The ``Index.intersection`` now preserves the order of the calling Index (left)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say what this did previously. No need to mention Index.join or DataFrame.merge and say .align methods. simpler is better.

instead of the other Index (right) (:issue:`15582`). This affects the inner
joins (methods ``DataFrame.join`` and ``pd.merge``) and the .align methods.

- ``Index.intersection``

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove refernces to Index.join

.. ipython:: python

idx1 = pd.Index([2, 1, 0])
Copy link
Contributor

@jreback jreback Mar 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you name these left_index & right_index (for clarity)

idx1
idx2 = pd.Index([1, 2, 3])
idx2

Previous Behavior:

.. code-block:: ipython

In [4]: idx1.intersection(idx2)
Out[4]: Int64Index([1, 2], dtype='int64')

New Behavior:

.. ipython:: python

idx1.intersection(idx2)

- ``DataFrame.join`` and ``pd.merge``

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove references to DataFrame.merge

.. ipython:: python

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you name these left and right

df1 = pd.DataFrame({'a': [20, 10, 0]}, index=[2, 1, 0])
df1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would only show .join here. a merge on the indexes is by-definition a join.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: is the inner join of merge where the keys are columns also affected?

If so, I would include the merge example, but by joining on a common column instead of with left/right_index=True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the inner join on columns is not affected. Indeed it was already working as expected, preserving the order of the left column.

df2 = pd.DataFrame({'b': [100, 200, 300]}, index=[1, 2, 3])
df2

Previous Behavior:

.. code-block:: ipython

In [4]: df1.join(df2, how='inner')
Out[4]:
a b
1 10 100
2 20 200

In [5]: pd.merge(df1, df2, how='inner', left_index=True, right_index=True)
Out[5]:
a b
1 10 100
2 20 200

New Behavior:

.. ipython:: python

df1.join(df2, how='inner')
pd.merge(df1, df2, how='inner', left_index=True, right_index=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is needed to show the result of both pd.merge(df1, df2) and df1.merge(df2) (you mentioned them all in the text)

(BTW, you don't need to put those empty lines between the different input lines, that has no impact on the result after building the docs)



.. _whatsnew_0200.api:

Expand Down Expand Up @@ -960,6 +1023,7 @@ Bug Fixes

- Bug in ``DataFrame.to_stata()`` and ``StataWriter`` which produces incorrectly formatted files to be produced for some locales (:issue:`13856`)
- Bug in ``StataReader`` and ``StataWriter`` which allows invalid encodings (:issue:`15723`)
- Bug with ``sort=True`` in ``DataFrame.join`` and ``pd.merge`` when joining on indexes (:issue:`15582`)

- Bug in ``pd.concat()`` in which concatting with an empty dataframe with ``join='inner'`` was being improperly handled (:issue:`15328`)
- Bug in ``groupby.agg()`` incorrectly localizing timezone on ``datetime`` (:issue:`15426`, :issue:`10668`, :issue:`13046`)
Expand Down
14 changes: 9 additions & 5 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,8 @@
* left: use only keys from left frame (SQL: left outer join)
* right: use only keys from right frame (SQL: right outer join)
* outer: use union of keys from both frames (SQL: full outer join)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the resulting order here? Also sorted as for join?

* inner: use intersection of keys from both frames (SQL: inner join)
* inner: use intersection of keys from both frames (SQL: inner join),
preserving the order of the left keys
on : label or list
Field names to join on. Must be found in both DataFrames. If on is
None and not merging on indexes, then it merges on the intersection of
Expand All @@ -147,7 +148,8 @@
Use the index from the right DataFrame as the join key. Same caveats as
left_index
sort : boolean, default False
Sort the join keys lexicographically in the result DataFrame
Sort the join keys lexicographically in the result DataFrame. If False,
the order of the join keys depends on the join type (how keyword)
suffixes : 2-length sequence (tuple, list, ...)
Suffix to apply to overlapping column names in the left and right
side, respectively
Expand Down Expand Up @@ -4463,16 +4465,18 @@ def join(self, other, on=None, how='left', lsuffix='', rsuffix='',
* left: use calling frame's index (or column if on is specified)
* right: use other frame's index
* outer: form union of calling frame's index (or column if on is
specified) with other frame's index
specified) with other frame's index, and sort it
lexicographically
* inner: form intersection of calling frame's index (or column if
on is specified) with other frame's index
on is specified) with other frame's index, preserving the order
of the calling's one
lsuffix : string
Suffix to use from left frame's overlapping columns
rsuffix : string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe leave something as "If False, index order depends on the join type (how keyword)" ?

Suffix to use from right frame's overlapping columns
sort : boolean, default False
Order result DataFrame lexicographically by the join key. If False,
preserves the index order of the calling (left) DataFrame
the order of the join key depends on the join type (how keyword)

Notes
-----
Expand Down
25 changes: 17 additions & 8 deletions pandas/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -2089,8 +2089,8 @@ def intersection(self, other):
"""
Form the intersection of two Index objects.

This returns a new Index with elements common to the index and `other`.
Sortedness of the result is not guaranteed.
This returns a new Index with elements common to the index and `other`,
preserving the order of the calling index.

Parameters
----------
Expand Down Expand Up @@ -2128,15 +2128,15 @@ def intersection(self, other):
pass

try:
indexer = Index(self._values).get_indexer(other._values)
indexer = Index(other._values).get_indexer(self._values)
indexer = indexer.take((indexer != -1).nonzero()[0])
except:
# duplicates
indexer = Index(self._values).get_indexer_non_unique(
other._values)[0].unique()
indexer = Index(other._values).get_indexer_non_unique(
self._values)[0].unique()
indexer = indexer[indexer != -1]

taken = self.take(indexer)
taken = other.take(indexer)
if self.name != other.name:
taken.name = None
return taken
Expand Down Expand Up @@ -2831,8 +2831,7 @@ def _reindex_non_unique(self, target):
new_index = self._shallow_copy_with_infer(new_labels, freq=None)
return new_index, indexer, new_indexer

def join(self, other, how='left', level=None, return_indexers=False):
"""
_index_shared_docs['join'] = """
*this is an internal non-public method*

Compute join_index and indexers to conform data
Expand All @@ -2844,11 +2843,18 @@ def join(self, other, how='left', level=None, return_indexers=False):
how : {'left', 'right', 'inner', 'outer'}
level : int or level name, default None
return_indexers : boolean, default False
sort : boolean, default False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a versionadded tag


.. versionadded:: 0.20.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a sentence explaining what sort does (I know its repeatative, but this is a different part of the code)


Returns
-------
join_index, (left_indexer, right_indexer)
"""

@Appender(_index_shared_docs['join'])
def join(self, other, how='left', level=None, return_indexers=False,
sort=False):
from .multi import MultiIndex
self_is_mi = isinstance(self, MultiIndex)
other_is_mi = isinstance(other, MultiIndex)
Expand Down Expand Up @@ -2929,6 +2935,9 @@ def join(self, other, how='left', level=None, return_indexers=False):
elif how == 'outer':
join_index = self.union(other)

if sort:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should only be on an outer join right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit there is an ambiguity here.

By default (sort=False) the result index values are not sorted, except for the outer join. This is because Index.union sorts the result index values.

If we set sort=True, the result index values are sorted for all types of join (although this is redundant for the outer join, as it is already sorted).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Is this the expected result?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure. can you show a small example with this both turned on and off (for each of the how's). Sorry this is a bit of work, but this is a bit of unexplored API area

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In file tests/frame/test_join.py I have created 2 methods test_join and test_join_sort where I test the joined result for each of the 4 values of how, for default sort=False and sort=True, respectively. Is that OK for you? Or would you need further examples?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't answer my question though, isn't sort ignored for all but inner join? If so then I would set it False (to make the code clear for the others).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, sort=True is only ignored for outer join, because the outer join always sorts the result index. For the other types of join (left, right, inner), the default result is unsorted, as it preserves the left index order (for left and inner) or the right index order (for right), which might not be previously sorted.

By setting sort=True, we force the resulting index order to be sorted, even for the cases of left, right and inner joins.

What do you mean by I would set it False? The parameter sort is set False by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albertvillanova what I mean is that reading the code it is very confusing and not obvious how sort interacts with how.

What I mean is I would explicity set sort=False (and maybe a comment) like this:

        if how == 'left':
            join_index = self
            # sort=True is ignored? if so, then set sort=False
       elif how == 'right':
            join_index = other
           # same
        elif how == 'inner':
            join_index = self.intersection(other)
        elif how == 'outer':
            join_index = self.union(other)
            # or could just sort=False
            assert sort is False, "outer join must explicty have sort=False"

        if sort:
            join_index = join_index.sort_values()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback There is not really an interaction between sort and how. It is just that when sort=False (the default) not additional sorting is done, and the order of the result depends on the type of join. When you set sort=True, you explicitly sort. And that regard, I find the current code of

if sort:
    join_index = join_index.sort_values()

very clear.
The only thing is that when how='outer', you would have sorted twice (as join_index is already sorted at this point in that case). But I wouldn't care about that, since it is not the default, and you can just not set sort to True to prevent that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche Great explanation.

join_index = join_index.sort_values()

if return_indexers:
if join_index is self:
lindexer = None
Expand Down
27 changes: 7 additions & 20 deletions pandas/indexes/range.py
Original file line number Diff line number Diff line change
Expand Up @@ -431,29 +431,16 @@ def union(self, other):

return self._int64index.union(other)

def join(self, other, how='left', level=None, return_indexers=False):
"""
*this is an internal non-public method*

Compute join_index and indexers to conform data
structures to the new index.

Parameters
----------
other : Index
how : {'left', 'right', 'inner', 'outer'}
level : int or level name, default None
return_indexers : boolean, default False

Returns
-------
join_index, (left_indexer, right_indexer)
"""
@Appender(_index_shared_docs['join'])
def join(self, other, how='left', level=None, return_indexers=False,
sort=False):
if how == 'outer' and self is not other:
# note: could return RangeIndex in more circumstances
return self._int64index.join(other, how, level, return_indexers)
return self._int64index.join(other, how, level, return_indexers,
sort)

return super(RangeIndex, self).join(other, how, level, return_indexers)
return super(RangeIndex, self).join(other, how, level, return_indexers,
sort)

def __len__(self):
"""
Expand Down
2 changes: 1 addition & 1 deletion pandas/io/pytables.py
Original file line number Diff line number Diff line change
Expand Up @@ -4321,7 +4321,7 @@ def _reindex_axis(obj, axis, labels, other=None):

labels = _ensure_index(labels.unique())
if other is not None:
labels = labels & _ensure_index(other.unique())
labels = _ensure_index(other.unique()) & labels
if not labels.equals(ax):
slicer = [slice(None, None)] * obj.ndim
slicer[axis] = labels
Expand Down
140 changes: 140 additions & 0 deletions pandas/tests/frame/test_join.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
# -*- coding: utf-8 -*-

import pytest
import numpy as np

from pandas import DataFrame, Index
from pandas.tests.frame.common import TestData
import pandas.util.testing as tm


@pytest.fixture
def frame():
return TestData().frame


@pytest.fixture
def df1():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you change df1->left and df2->right

return DataFrame({'a': [20, 10, 0]}, index=[2, 1, 0])


@pytest.fixture
def df2():
return DataFrame({'b': [100, 200, 300]}, index=[1, 2, 3])


@pytest.mark.parametrize(
"how, sort, expected",
[('inner', False, DataFrame({'a': [20, 10],
'b': [200, 100]},
index=[2, 1])),
('inner', True, DataFrame({'a': [10, 20],
'b': [100, 200]},
index=[1, 2])),
('left', False, DataFrame({'a': [20, 10, 0],
'b': [200, 100, np.nan]},
index=[2, 1, 0])),
('left', True, DataFrame({'a': [0, 10, 20],
'b': [np.nan, 100, 200]},
index=[0, 1, 2])),
('right', False, DataFrame({'a': [10, 20, np.nan],
'b': [100, 200, 300]},
index=[1, 2, 3])),
('right', True, DataFrame({'a': [10, 20, np.nan],
'b': [100, 200, 300]},
index=[1, 2, 3])),
('outer', False, DataFrame({'a': [0, 10, 20, np.nan],
'b': [np.nan, 100, 200, 300]},
index=[0, 1, 2, 3])),
('outer', True, DataFrame({'a': [0, 10, 20, np.nan],
'b': [np.nan, 100, 200, 300]},
index=[0, 1, 2, 3]))])
def test_join(df1, df2, how, sort, expected):

result = df1.join(df2, how=how, sort=sort)
tm.assert_frame_equal(result, expected)


def test_join_index(frame):
# left / right

f = frame.loc[frame.index[:10], ['A', 'B']]
f2 = frame.loc[frame.index[5:], ['C', 'D']].iloc[::-1]

joined = f.join(f2)
tm.assert_index_equal(f.index, joined.index)
expected_columns = Index(['A', 'B', 'C', 'D'])
tm.assert_index_equal(joined.columns, expected_columns)

joined = f.join(f2, how='left')
tm.assert_index_equal(joined.index, f.index)
tm.assert_index_equal(joined.columns, expected_columns)

joined = f.join(f2, how='right')
tm.assert_index_equal(joined.index, f2.index)
tm.assert_index_equal(joined.columns, expected_columns)

# inner

joined = f.join(f2, how='inner')
tm.assert_index_equal(joined.index, f.index[5:10])
tm.assert_index_equal(joined.columns, expected_columns)

# outer

joined = f.join(f2, how='outer')
tm.assert_index_equal(joined.index, frame.index.sort_values())
tm.assert_index_equal(joined.columns, expected_columns)

tm.assertRaisesRegexp(ValueError, 'join method', f.join, f2, how='foo')

# corner case - overlapping columns
for how in ('outer', 'left', 'inner'):
with tm.assertRaisesRegexp(ValueError, 'columns overlap but '
'no suffix'):
frame.join(frame, how=how)


def test_join_index_more(frame):
af = frame.loc[:, ['A', 'B']]
bf = frame.loc[::2, ['C', 'D']]

expected = af.copy()
expected['C'] = frame['C'][::2]
expected['D'] = frame['D'][::2]

result = af.join(bf)
tm.assert_frame_equal(result, expected)

result = af.join(bf, how='right')
tm.assert_frame_equal(result, expected[::2])

result = bf.join(af, how='right')
tm.assert_frame_equal(result, expected.loc[:, result.columns])


def test_join_index_series(frame):
df = frame.copy()
s = df.pop(frame.columns[-1])
joined = df.join(s)

# TODO should this check_names ?
tm.assert_frame_equal(joined, frame, check_names=False)

s.name = None
tm.assertRaisesRegexp(ValueError, 'must have a name', df.join, s)


def test_join_overlap(frame):
df1 = frame.loc[:, ['A', 'B', 'C']]
df2 = frame.loc[:, ['B', 'C', 'D']]

joined = df1.join(df2, lsuffix='_df1', rsuffix='_df2')
df1_suf = df1.loc[:, ['B', 'C']].add_suffix('_df1')
df2_suf = df2.loc[:, ['B', 'C']].add_suffix('_df2')

no_overlap = frame.loc[:, ['A', 'D']]
expected = df1_suf.join(df2_suf).join(no_overlap)

# column order not necessarily sorted
tm.assert_frame_equal(joined, expected.loc[:, joined.columns])
Loading