Skip to content

BUG: Fix index order for Index.intersection() #15583

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 23 commits into from

Conversation

albertvillanova
Copy link
Contributor

@albertvillanova albertvillanova commented Mar 5, 2017

@jreback
Copy link
Contributor

jreback commented Mar 5, 2017

not sure what you are attempting to do here. This needs quite a few tests to even flesh out what is going on.

@albertvillanova
Copy link
Contributor Author

I have implemented/updated some tests to demonstrate that the bug is at Index.intersection(other).

Before this was returning the intersection of self and other preserving the order of other instead of preserving the order of self.

@jreback jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Mar 7, 2017
@codecov-io
Copy link

codecov-io commented Mar 12, 2017

Codecov Report

Merging #15583 into master will decrease coverage by 0.05%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #15583      +/-   ##
==========================================
- Coverage   91.01%   90.96%   -0.06%     
==========================================
  Files         143      143              
  Lines       49400    49429      +29     
==========================================
- Hits        44963    44961       -2     
- Misses       4437     4468      +31
Impacted Files Coverage Δ
pandas/core/frame.py 97.56% <ø> (-0.4%) ⬇️
pandas/tools/merge.py 93.68% <ø> (ø) ⬆️
pandas/tseries/index.py 95.42% <100%> (+0.02%) ⬆️
pandas/indexes/range.py 92.11% <100%> (+0.02%) ⬆️
pandas/io/pytables.py 93.06% <100%> (ø) ⬆️
pandas/indexes/base.py 96.14% <100%> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/sparse/frame.py 94.26% <0%> (-2.43%) ⬇️
pandas/core/categorical.py 95.85% <0%> (-1.05%) ⬇️
pandas/io/json/normalize.py 96.93% <0%> (-0.98%) ⬇️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d3554c...2d4e143. Read the comment docs.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual changes in the intersection method look good to me.
Can you add a whatsnew notice?

result = self._simple_new(result._values, name=result.name,
tz=result.tz)
if result.freq is None:
result.offset = to_offset(result.inferred_freq)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change related to the PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as the simple result of Index.intersection inherits all attributes from other, and this is not the expected result. Therefore, it is necessary to create a new DatetimeIndex using only three of the Index.intersection attributes: _values, name and tz.


joined = f.join(f2)
self.assert_index_equal(f.index, joined.index)
self.assertEqual(len(joined.columns), 4)
expected_columns = pd.Index(['A', 'B', 'C', 'D'])
self.assert_index_equal(joined.columns, expected_columns)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that this is more explicitly tests is good! (instead of only the length)
But just for my understanding, is the behaviour changed for this specific case by the PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, for the cases of 'left' (default), 'right' and 'outer' joins, the behavior is the same before and after the PR. Only for the 'inner' join (which uses the Index.intersection), the PR fixes the bug in the result.index order: before the PR, the index order was the one of the other, and after the PR the index order is the one in self.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was talking about the columns, not index. I think the columns of the result are not influenced by inner vs outer join?


joined = f.join(f2, how='right')
self.assert_index_equal(joined.index, f2.index)
self.assertEqual(len(joined.columns), 4)
self.assert_index_equal(joined.columns, expected_columns)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of a 'right' join, you check that the index is equal to the right frame (f2). This sounds the most logical to do, but, then the docs should be updated? (as the docs say the order of the "calling (left) DataFrame" is preserved).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed that sentence from the doc.


# inner

f = self.frame.reindex(columns=['A', 'B'])[:10]
f2 = self.frame.reindex(columns=['C', 'D'])

joined = f.join(f2, how='inner')
self.assert_index_equal(joined.index, f.index.intersection(f2.index))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this test more explicit by not using intersection? As this was not catched before because intersection itself was not working correctly (I know you fixed both now at the same time, but maybe good to make the tests independent)

joined = f.join(f2, how='outer')
self.assertTrue(tm.equalContents(self.frame.index, joined.index))
self.assertEqual(len(joined.columns), 4)
self.assert_index_equal(joined.index, self.frame.index.sort_values())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this always the sorted index? Again, it is not fully clear what to expect based on the docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for the 'outer' join, the return DataFrame index is sorted. This is because DataFrame.join uses Index.join for the DataFrame indexes, and the latter uses Index.union when how='outer', which sorts the result Index. Therefore, I agree with you that this should be clarified in the doc.

@jorisvandenbossche
Copy link
Member

A more general question is the notice of "Sortedness of the result is not guaranteed." in the Index.intersection docstring. Whether we should keep that or not.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make a sub-section on the whatsnew api-breaking changes. this need an old/new example

@@ -61,37 +61,32 @@ def test_join_index(self):
# left / right

f = self.frame.reindex(columns=['A', 'B'])[:10]
f2 = self.frame.reindex(columns=['C', 'D'])
f2 = self.frame.reindex(columns=['C', 'D'])[5:][::-1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use .iloc or .loc here more idiomatic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could I use .ix instead? I ask this because the indexing is of both types, label and integer positional.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no ix is deprecated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can easily indexing like this:

self.frame.loc[self.frame.index[5::-1], ['C', 'D']]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http://pandas-docs.github.io/pandas-docs-travis/indexing.html#ix-indexer-is-deprecated

you are looking at released docs

ix depreciation is not released yet - it's in 0.20.0

expected = Index([5, 3, 4], name='idx')
result = idx1.intersection(idx2)
self.assert_index_equal(result, expected)
self.assertEqual(result.name, expected.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to compare names as assert_index_equal already does this

expected = Index([5, 3, 4], name=None)
result = idx1.intersection(idx2)
self.assert_index_equal(result, expected)
self.assertEqual(result.name, expected.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

if isinstance(result, DatetimeIndex):
if result.freq is None:
result.offset = to_offset(result.inferred_freq)
result = self._simple_new(result._values, name=result.name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use _shallow_copy rather than _simple_new

Copy link
Contributor Author

@albertvillanova albertvillanova Mar 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I use _shallow_copy, the result will inherit all self attributes (e.g., freq) and this might not be the expected result. For example, this test would fail:

base = date_range('6/1/2000', '6/30/2000', freq='D', name='idx')

rng4 = date_range('7/1/2000', '7/31/2000', freq='D', name='idx') 
expected4 = DatetimeIndex([], name='idx')

because result would inherit self freq = 'D'.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you misunderstand, simply pass the new freq

result = self._shallow_copy(...., freq=....)

We don't use _simple_new at all (outside of a few specific places), this is not one of them. This is all documented next to these functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might I simply use the constructor DatetimeIndex()? I ask this because I think the result freq is not known in advance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, use _shallow_copy with freq=None, which will not infer it, and then you can reset the freq after

@jreback
Copy link
Contributor

jreback commented Mar 13, 2017

also add the explicit example from the issue. ideally add some more tests in pandas/tests/tools/test_join .

I would actually like to either:

  • separate out the join tests from frame/test_misc_api -> frame/tests/test_join.py
  • move the join tests to tools/test_join where the rest of the join tests are.

Its not imperative that we do this here. but could do as a followup. (or a pre-PR alternatively)

@albertvillanova
Copy link
Contributor Author

I have a question @jreback about the whatsnew notice: is this PR introducing an API-breaking change? I thought I was just fixing a bug about Index order after intersection. The expected bahavior of the intersection is analogue to:

a1.intersection(a2) = [item for item in a1 if item in a2]

thus preserving the order of the caller a1.

@jorisvandenbossche
Copy link
Member

I don't think this needs a subsection in the API changes section. I would regard this as a bug fix (certainly for join).

@jreback
Copy link
Contributor

jreback commented Mar 14, 2017

I disagree. This is most certainly an API change. For sure we need a sub-section.

@albertvillanova
Copy link
Contributor Author

I have addressed the following requested changes:

  • I have updated the docs
  • I have used _shallow_copy
  • I have removed redundant assertEqual
  • I have implemented the test example from the Issue
  • I have moved the join tests from tests/frame/test_misc_api.py to tests/frame/test_join.py, because these test the DataFrame join method. I think that in tests/tools it should be implemented only the tests using pd.merge.
  • The only thing I have not done yet is the whatsnew notice

@jorisvandenbossche
Copy link
Member

OK to add the intersection order change to the API changes section, but just as a bullet pointin "Other API Changes"? @jreback can you explain why you think this needs a subsection? Do you think people rely on the order? (The docstring also says that the order should not be relied upon)

@jreback
Copy link
Contributor

jreback commented Mar 14, 2017

@jorisvandenbossche any non-trivial change needs an explanation. Sure this is a 'bug-fix'. But if I have code that relied upon the ordering (rightly or wrongly), I would for sure want to know. This is very subtle. You can never over-document whatsnew IMHO.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates!
Added small comments to the doc updates

@@ -4485,16 +4485,15 @@ def join(self, other, on=None, how='left', lsuffix='', rsuffix='',
* left: use calling frame's index (or column if on is specified)
* right: use other frame's index
* outer: form union of calling frame's index (or column if on is
specified) with other frame's index
specified) with other frame's index, and sort it
* inner: form intersection of calling frame's index (or column if
on is specified) with other frame's index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also note for inner what the order will be? (because you could also think this will be sorted)

* inner: form intersection of calling frame's index (or column if
on is specified) with other frame's index
lsuffix : string
Suffix to use from left frame's overlapping columns
rsuffix : string
Suffix to use from right frame's overlapping columns
sort : boolean, default False
Order result DataFrame lexicographically by the join key. If False,
preserves the index order of the calling (left) DataFrame
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe leave something as "If False, index order depends on the join type (how keyword)" ?

assert_frame_equal)


class TestDataFrameJoin(tm.TestCase, TestData):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you are writing new tests, pls use the pytest style. don't inherit from tm.TestCase), so don't use self. for anything testing related

e.g. just use

assert_frame_equal and such

from pandas.tests.frame.common import TestData

import pandas.util.testing as tm
from pandas.util.testing import (assertRaisesRegexp,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually prefer NOT to import the testing routines and instead use tm.assert_frame_equal, but as long as its consistent one way of the other is ok.


New Behavior:

.. code-block:: ipython
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use ipython:: here (on the new).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I have a naive question: what is the difference between code-block:: ipython and ipython:: python and why should the former used for previous behavior and the latter for new? Is there a way to see this txt file parsed to check the format? Thank you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can build the docs here: http://pandas-docs.github.io/pandas-docs-travis/contributing.html#how-to-build-the-pandas-documentation

http://www.sphinx-doc.org/en/stable/markup/code.html

code-block shows code literally, while ipython actually executes it and displays the results.

So we almost always want to actually execute the code, but a previous behaviour by-definition is not current code, so we show a snippet that is executed previously (meaning you run it in 0.19.2 and paste the output here).

Previous Behavior:

.. code-block:: ipython
In [2]: df1 = pd.DataFrame({'a': [20, 10, 0]}, index=[2, 1, 0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put these 2 in an ipython block at the top (and then show them, e.g.

.. ipython:: python

   df1 = pd.DataFrame(....)
   df1
   df2 = pd.DataFrame(...)
   df2

Previous Behaviour:

.. code-block:: ipython

     # show this part
     In [4]: df1.join(....)


New Behaviour:

.. ipython:: python

    df1.join(....)

@@ -2886,6 +2888,9 @@ def join(self, other, how='left', level=None, return_indexers=False):
elif how == 'outer':
join_index = self.union(other)

if sort:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should only be on an outer join right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit there is an ambiguity here.

By default (sort=False) the result index values are not sorted, except for the outer join. This is because Index.union sorts the result index values.

If we set sort=True, the result index values are sorted for all types of join (although this is redundant for the outer join, as it is already sorted).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Is this the expected result?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure. can you show a small example with this both turned on and off (for each of the how's). Sorry this is a bit of work, but this is a bit of unexplored API area

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In file tests/frame/test_join.py I have created 2 methods test_join and test_join_sort where I test the joined result for each of the 4 values of how, for default sort=False and sort=True, respectively. Is that OK for you? Or would you need further examples?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't answer my question though, isn't sort ignored for all but inner join? If so then I would set it False (to make the code clear for the others).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, sort=True is only ignored for outer join, because the outer join always sorts the result index. For the other types of join (left, right, inner), the default result is unsorted, as it preserves the left index order (for left and inner) or the right index order (for right), which might not be previously sorted.

By setting sort=True, we force the resulting index order to be sorted, even for the cases of left, right and inner joins.

What do you mean by I would set it False? The parameter sort is set False by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albertvillanova what I mean is that reading the code it is very confusing and not obvious how sort interacts with how.

What I mean is I would explicity set sort=False (and maybe a comment) like this:

        if how == 'left':
            join_index = self
            # sort=True is ignored? if so, then set sort=False
       elif how == 'right':
            join_index = other
           # same
        elif how == 'inner':
            join_index = self.intersection(other)
        elif how == 'outer':
            join_index = self.union(other)
            # or could just sort=False
            assert sort is False, "outer join must explicty have sort=False"

        if sort:
            join_index = join_index.sort_values()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback There is not really an interaction between sort and how. It is just that when sort=False (the default) not additional sorting is done, and the order of the result depends on the type of join. When you set sort=True, you explicitly sort. And that regard, I find the current code of

if sort:
    join_index = join_index.sort_values()

very clear.
The only thing is that when how='outer', you would have sorted twice (as join_index is already sorted at this point in that case). But I wouldn't care about that, since it is not the default, and you can just not set sort to True to prevent that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche Great explanation.

@@ -4485,16 +4485,17 @@ def join(self, other, on=None, how='left', lsuffix='', rsuffix='',
* left: use calling frame's index (or column if on is specified)
* right: use other frame's index
* outer: form union of calling frame's index (or column if on is
specified) with other frame's index
specified) with other frame's index, and sort it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a little bit of expl here

Further I am pretty sure the pd.merge, Series.join, DataFrame.merge doc-string needs to be updated as well (as these all go thru merge for the joins we are talking about) (they could also go thru concat for certain ones, but they are not relevant here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right! This also affects DataFrame.merge, pd.merge and Series.align. Their docstrings need to be updated. And also the API-breaking whatsnew notice!

df2 = pd.DataFrame({'b': [100, 200, 300]}, index=[1, 2, 3])

result = df1.join(df2)
expected = pd.DataFrame({'a': [20, 10, 0], 'b': [200, 100, None]},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't put None when we really use np.nan.

index=[0, 1, 2, 3])
tm.assert_frame_equal(result, expected)

def test_join_index(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of these tests on index should be in tests/pandas/tests/indexes/test_base.py. These are done pretty generically, follow the existing pattern (also wouldn't hurt to move these tests and your new ones to a new test class).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only new method I have created is test_join. All the other methods were previously in tests/frame/test_misc_api.pyand I just moved them to this file I have created tests/frame/test_join.py.

Despite their misleading names, the methods test_join_index and a test_join_index_more do not directly test any Index method, but the DataFrame.join functionality and the index and columns in the result DataFrame.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok that's fine then.

@@ -609,14 +609,14 @@ def test_intersection(self):
# non monotonic
idx1 = Index([5, 3, 2, 4, 1], name='idx')
idx2 = Index([4, 7, 6, 5, 3], name='idx')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the file where all of the index join tests go.

@@ -2801,6 +2802,7 @@ def join(self, other, how='left', level=None, return_indexers=False):
how : {'left', 'right', 'inner', 'outer'}
level : int or level name, default None
return_indexers : boolean, default False
sort : boolean, default False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a versionadded tag

@@ -444,16 +445,19 @@ def join(self, other, how='left', level=None, return_indexers=False):
how : {'left', 'right', 'inner', 'outer'}
level : int or level name, default None
return_indexers : boolean, default False
sort : boolean, default False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same (actually this doc-string should be a shared_doc string if you want to move it), rather than repeating

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback How could I create a shared_docstring? Any existing example?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look for shared_docs and u can see
(these r defined in base.py)

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albertvillanova added some small comments on the docs

@jreback the Index.join docstring says "this is an internal non-public method". I am not sure of its status (never used that method? But if that is really the case, I would remove it from the examples in the whatsnew.

1 10 100
2 20 200

In [7]: (res1, res2) = df1.align(df2, axis=0, join='inner')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You already have an align example above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but the former is for Series and the latter is for DataFrame.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is needed to show both, people know series and dataframe methods work similar.
I would just list both Series and DataFrame align in the bullet item of Series.align, and then show eg the series example.


df1.merge(df2, how='inner', left_index=True, right_index=True)

pd.merge(df1, df2, how='inner', left_index=True, right_index=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is needed to show the result of both pd.merge(df1, df2) and df1.merge(df2) (you mentioned them all in the text)

(BTW, you don't need to put those empty lines between the different input lines, that has no impact on the result after building the docs)

@@ -127,7 +127,8 @@
* left: use only keys from left frame (SQL: left outer join)
* right: use only keys from right frame (SQL: right outer join)
* outer: use union of keys from both frames (SQL: full outer join)
* inner: use intersection of keys from both frames (SQL: inner join)
* inner: use intersection of keys from both frames (SQL: inner join),
preserving the order of the left keys
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small rst nitpick (it's very whitespace sensitive ..): the 'preserving ..' should be aligned with the 'inner ..' on the line above

* inner: form intersection of calling frame's index (or column if
on is specified) with other frame's index
on is specified) with other frame's index, preserving the
order of the calling's one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same here (about the whitespace alignment)

@@ -308,6 +308,8 @@ Other enhancements
- ``pd.types.concat.union_categoricals`` gained the ``ignore_ordered`` argument to allow ignoring the ordered attribute of unioned categoricals (:issue:`13410`). See the :ref:`categorical union docs <categorical.union>` for more information.
- ``pandas.io.json.json_normalize()`` with an empty ``list`` will return an empty ``DataFrame`` (:issue:`15534`)
- ``pd.DataFrame.to_latex`` and ``pd.DataFrame.to_string`` now allow optional header aliases. (:issue:`15536`)
- ``Index.intersection()`` accepts parameter ``sort`` (:issue:`15582`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be join instead of intersection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. I have fixed this.

@@ -0,0 +1,91 @@
# -*- coding: utf-8 -*-

from __future__ import print_function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, these need to go in pandas/tests/tools/test_merge.py. just fit them in where appropriate.

.join is fine here as its a direct / used method on DataFrame. .merge is a direct calling of the pd.merge where LOTS of tests already exist. Test location is very important to avoid future confusion.

In fact I suspect most of these are duplicated tests, pls only add as appropriate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you already added to the correct location. no point in duplicating tests. delete these.

@@ -1355,3 +1355,92 @@ def test_dtype_on_merged_different(self, change, how, left, right):
np.dtype('int64')],
index=['X', 'Y', 'Z'])
assert_series_equal(result, expected)


class TestMergeOnIndexes(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok you added here good.

@@ -926,3 +1055,4 @@ Bug Fixes
- Bug in ``pd.melt()`` where passing a tuple value for ``value_vars`` caused a ``TypeError`` (:issue:`15348`)
- Bug in ``.eval()`` which caused multiline evals to fail with local variables not on the first line (:issue:`15342`)
- Bug in ``pd.read_msgpack`` which did not allow to load dataframe with an index of type ``CategoricalIndex`` (:issue:`15487`)
- Bug with ``sort=True`` in ``DataFrame.join``, ``DataFrame.merge`` and ``pd.merge`` when joining on index (:issue:`15582`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to mention DataFrame.merge, pd.merge and .join are the important ones

@@ -308,6 +308,8 @@ Other enhancements
- ``pd.types.concat.union_categoricals`` gained the ``ignore_ordered`` argument to allow ignoring the ordered attribute of unioned categoricals (:issue:`13410`). See the :ref:`categorical union docs <categorical.union>` for more information.
- ``pandas.io.json.json_normalize()`` with an empty ``list`` will return an empty ``DataFrame`` (:issue:`15534`)
- ``pd.DataFrame.to_latex`` and ``pd.DataFrame.to_string`` now allow optional header aliases. (:issue:`15536`)
- ``Index.join()`` accepts parameter ``sort`` (:issue:`15582`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an internal method, remove this comment

Index order after inner join due to Index intersection
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The ``Index.intersection`` now preserves the order of the calling Index (left)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say what this did previously. No need to mention Index.join or DataFrame.merge and say .align methods. simpler is better.

``DataFrame.align``).

- ``Index.intersection`` and ``Index.join``

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove refernces to Index.join

res2

- ``DataFrame.join``, ``DataFrame.merge`` and ``pd.merge``

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove references to DataFrame.merge


df1.join(df2, how='inner')
df1.merge(df2, how='inner', left_index=True, right_index=True)
(res1, res2) = df1.align(df2, axis=0, join='inner')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove the .align examples. This is not that important to show as its a rarely used method. The .join / .intersection are much more important.

@albertvillanova
Copy link
Contributor Author

@jreback and @jorisvandenbossche Is it OK after last changes?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of minor doc edits. changes suggested for testing. This is going to be a whole lot easier to read and less error prone to re-write in a parametrized way. Its pretty straightforward to do this. I put an example up.

@@ -705,6 +706,69 @@ New Behavior:
TypeError: Cannot compare 2014-01-01 00:00:00 of
type <class 'pandas.tslib.Timestamp'> to string column

.. _whatsnew_0200.api_breaking.index_order:

Index order after inner join due to Index intersection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make the title about what you have in the first sentence

Index.intersectio now preserves the order of the left Index

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would include 'join' in the title as well. Eg Index.intersection and inner join now preserve the order of the left Index

.. ipython:: python

df1 = pd.DataFrame({'a': [20, 10, 0]}, index=[2, 1, 0])
df1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would only show .join here. a merge on the indexes is by-definition a join.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: is the inner join of merge where the keys are columns also affected?

If so, I would include the merge example, but by joining on a common column instead of with left/right_index=True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the inner join on columns is not affected. Indeed it was already working as expected, preserving the order of the left column.

@@ -926,3 +990,4 @@ Bug Fixes
- Bug in ``pd.melt()`` where passing a tuple value for ``value_vars`` caused a ``TypeError`` (:issue:`15348`)
- Bug in ``.eval()`` which caused multiline evals to fail with local variables not on the first line (:issue:`15342`)
- Bug in ``pd.read_msgpack`` which did not allow to load dataframe with an index of type ``CategoricalIndex`` (:issue:`15487`)
- Bug with ``sort=True`` in ``DataFrame.join`` and ``pd.merge`` when joining on indexes (:issue:`15582`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put notes somewhere in the Bug Fix section and not at the end, otherwise you will get merge conflicts a lot.

@@ -2855,11 +2854,18 @@ def join(self, other, how='left', level=None, return_indexers=False):
how : {'left', 'right', 'inner', 'outer'}
level : int or level name, default None
return_indexers : boolean, default False
sort : boolean, default False

.. versionadded:: 0.20.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a sentence explaining what sort does (I know its repeatative, but this is a different part of the code)

@@ -2886,6 +2888,9 @@ def join(self, other, how='left', level=None, return_indexers=False):
elif how == 'outer':
join_index = self.union(other)

if sort:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't answer my question though, isn't sort ignored for all but inner join? If so then I would set it False (to make the code clear for the others).

index=[0, 1, 2, 3])
tm.assert_frame_equal(result, expected)

def test_join_sort(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to integrate these tests and use a parametrization to make this easy; this is not part of the a class test, just a raw function.

something like

@pytest.fixture
def df1():
     return DataFrame(...)

@pytest.fixture
def df2():
     return DataFrame(...)

@pytest.mark.parametrize(
    "how, sort, expected", [
         ('left', False, DataFrame(.....)),
         ('left', True, DataFrame(.....))])
def test_join(how, sort, expected):

     result = df1.join(df2, how=how, sort=sort)
     tm.assert_frame_equal(result, expected)

index=[0, 1, 2, 3])
tm.assert_frame_equal(result, expected)

def test_join_index(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would parametrize here as well.

tm.assert_index_equal(joined.index, self.frame.index.sort_values())
tm.assert_index_equal(joined.columns, expected_columns)

tm.assertRaisesRegexp(ValueError, 'join method', f.join, f2, how='foo')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then make these a separate test (for these error cases)

def test_join_index_errors

import pandas.util.testing as tm


class TestDataFrameJoin(TestData):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of using a class definition make a fixture out of frame

@pytest.fixture
def frame():
     return TestData.frame

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the you don't need a class at all for testing

def test_merge_on_indexes(self):
df1 = pd.DataFrame({'a': [20, 10, 0]}, index=[2, 1, 0])
df2 = pd.DataFrame({'b': [100, 200, 300]}, index=[1, 2, 3])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change this to be parametrized like join above. (e.g. matrix of how/sort)

@@ -2886,6 +2888,9 @@ def join(self, other, how='left', level=None, return_indexers=False):
elif how == 'outer':
join_index = self.union(other)

if sort:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albertvillanova what I mean is that reading the code it is very confusing and not obvious how sort interacts with how.

What I mean is I would explicity set sort=False (and maybe a comment) like this:

        if how == 'left':
            join_index = self
            # sort=True is ignored? if so, then set sort=False
       elif how == 'right':
            join_index = other
           # same
        elif how == 'inner':
            join_index = self.intersection(other)
        elif how == 'outer':
            join_index = self.union(other)
            # or could just sort=False
            assert sort is False, "outer join must explicty have sort=False"

        if sort:
            join_index = join_index.sort_values()

@@ -705,6 +706,69 @@ New Behavior:
TypeError: Cannot compare 2014-01-01 00:00:00 of
type <class 'pandas.tslib.Timestamp'> to string column

.. _whatsnew_0200.api_breaking.index_order:

Index order after inner join due to Index intersection
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would include 'join' in the title as well. Eg Index.intersection and inner join now preserve the order of the left Index

@@ -127,7 +127,8 @@
* left: use only keys from left frame (SQL: left outer join)
* right: use only keys from right frame (SQL: right outer join)
* outer: use union of keys from both frames (SQL: full outer join)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the resulting order here? Also sorted as for join?

.. ipython:: python

df1 = pd.DataFrame({'a': [20, 10, 0]}, index=[2, 1, 0])
df1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: is the inner join of merge where the keys are columns also affected?

If so, I would include the merge example, but by joining on a common column instead of with left/right_index=True

@@ -2886,6 +2888,9 @@ def join(self, other, how='left', level=None, return_indexers=False):
elif how == 'outer':
join_index = self.union(other)

if sort:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback There is not really an interaction between sort and how. It is just that when sort=False (the default) not additional sorting is done, and the order of the result depends on the type of join. When you set sort=True, you explicitly sort. And that regard, I find the current code of

if sort:
    join_index = join_index.sort_values()

very clear.
The only thing is that when how='outer', you would have sorted twice (as join_index is already sorted at this point in that case). But I wouldn't care about that, since it is not the default, and you can just not set sort to True to prevent that.

@jorisvandenbossche
Copy link
Member

@jreback I would leave the testing / pytest changes for now. That is not that trivial, and this PR has already seen enough back and forth, time to get it merged :-)
I would just try to fix the remaining doc comments and the discussion regarding the sort keyword, and leave clean-up of the (partly existing) tests for a follow-up PR.

@jreback
Copy link
Contributor

jreback commented Mar 25, 2017

@albertvillanova I rebased this and pushed a commit fixing the tests. The reason I suggest this is that it makes for much shorter, more understandable code that is easier to error check. There were several duplicate tests (not really a big deal), but still the same, copy-pasting is in general not a great idea for repetative things.

@jreback
Copy link
Contributor

jreback commented Mar 27, 2017

@albertvillanova I think we just needed a couple of doc updates, otherwise this lgtm.

@jreback jreback added this to the 0.20.0 milestone Mar 27, 2017
@albertvillanova
Copy link
Contributor Author

albertvillanova commented Mar 28, 2017

Great @jreback, I am just addressing your previously requested changes and also adding a slightly modification in the tests to make them exhaustively consider all possible combinations of unsorted left-right indexes and sort=False,True.

@albertvillanova
Copy link
Contributor Author

These are the changes I have made:

  • In doc/source/whatsnew/v0.20.0.txt I have changed the title (following both recommendations) and removed the example of pd.merge
  • In pandas/core/frame.py I have clarified which is the resulting order in the merge_doc docstring
  • In pandas/indexes/base.py I have added what sort does to the join _index_shared_docs dosctring
  • In pandas/tests/frame/test_join.py I have made a little modification so that it is also considered the case of unsorted right index in a right join with sort=False

* left: use only keys from left frame (SQL: left outer join)
* right: use only keys from right frame (SQL: right outer join)
* outer: use union of keys from both frames (SQL: full outer join)
* left: use only keys from left frame (SQL: left outer join), preserving
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you format these lines like

* right: use only keys from right frame
     similar to a SQL: right outer join; preserves key orderings


.. ipython:: python

idx1 = pd.Index([2, 1, 0])
Copy link
Contributor

@jreback jreback Mar 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you name these left_index & right_index (for clarity)

- ``DataFrame.join`` and ``pd.merge``

.. ipython:: python

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you name these left and right



@pytest.fixture
def df1():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you change these to df1->left, and df2->right

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have also changed the order of the right index (like in pandas/tests/frame/test_join.py), so that sort=False is tested for the right join.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I get an error when naming them left and right, because these names are also used in other parts of the code. What should I do?



@pytest.fixture
def df1():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you change df1->left and df2->right

@jreback
Copy link
Contributor

jreback commented Mar 28, 2017

@albertvillanova just a couple of minor renames. ping when pushed and green and will merge. thanks for the patience!

you can see that PR's are mostly about getting the details / docs right; oftentimes the code changes are the easiest part!

@jreback
Copy link
Contributor

jreback commented Mar 28, 2017

hmm that last broke something........lmk when you have sorted and are green.

@jreback
Copy link
Contributor

jreback commented Mar 28, 2017

lgtm. ping on green.

@jreback jreback closed this in bd169dc Mar 29, 2017
@jreback
Copy link
Contributor

jreback commented Mar 29, 2017

thanks @albertvillanova nice PR!

as you can see, the docs really matter!

@albertvillanova
Copy link
Contributor Author

Thanks both @jreback and @jorisvandenbossche for all the things I have learned with your comments, ideas and suggestions.

@jorisvandenbossche
Copy link
Member

@albertvillanova Thanks!

mattip pushed a commit to mattip/pandas that referenced this pull request Apr 3, 2017
closes pandas-dev#15582

Author: Albert Villanova del Moral <[email protected]>
Author: Jeff Reback <[email protected]>

Closes pandas-dev#15583 from albertvillanova/fix-15582 and squashes the following commits:

2d4e143 [Albert Villanova del Moral] Fix pytest fixture name collision
64e86a4 [Albert Villanova del Moral] Fix test on right join
73df69e [Albert Villanova del Moral] Address requested changes
8d2e9cc [Albert Villanova del Moral] Address requested changes
968c7f1 [Jeff Reback] DOC/TST: change to use parameterization
9e39794 [Albert Villanova del Moral] Address requested changes
5bf1508 [Albert Villanova del Moral] Address requested changes
654288b [Albert Villanova del Moral] Fix Travis errors
33eb740 [Albert Villanova del Moral] Address requested changes
3c200fe [Albert Villanova del Moral] Add new tests
ef2581e [Albert Villanova del Moral] Fix Travis error
f0d9d03 [Albert Villanova del Moral] Add whatsnew
c96306d [Albert Villanova del Moral] Add sort argument to Index.join
047b513 [Albert Villanova del Moral] Address requested changes
ec836bd [Albert Villanova del Moral] Fix Travis errors
b977278 [Albert Villanova del Moral] Address requested changes
784fe75 [Albert Villanova del Moral] Fix error: line too long
1197b99 [Albert Villanova del Moral] Fix DataFrame column order when read from HDF file
d9e29f8 [Albert Villanova del Moral] Create new DatetimeIndex from the Index.intersection result
e7bcd28 [Albert Villanova del Moral] Fix typo in documentation
a4ead99 [Albert Villanova del Moral] Fix typo
c2a8dc3 [Albert Villanova del Moral] Implement tests
c12bb3f [Albert Villanova del Moral] BUG: Fix index order for Index.intersection()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Join result takes the index order of the other (right) DataFrame instead of the calling's (left) one
4 participants