BUG: Fix index order for Index.intersection() #15583

albertvillanova · 2017-03-05T20:57:02Z

closes Join result takes the index order of the other (right) DataFrame instead of the calling's (left) one #15582
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

jreback · 2017-03-05T22:32:29Z

not sure what you are attempting to do here. This needs quite a few tests to even flesh out what is going on.

albertvillanova · 2017-03-06T23:23:22Z

I have implemented/updated some tests to demonstrate that the bug is at Index.intersection(other).

Before this was returning the intersection of self and other preserving the order of other instead of preserving the order of self.

codecov-io · 2017-03-12T22:50:34Z

Codecov Report

Merging #15583 into master will decrease coverage by 0.05%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #15583      +/-   ##
==========================================
- Coverage   91.01%   90.96%   -0.06%     
==========================================
  Files         143      143              
  Lines       49400    49429      +29     
==========================================
- Hits        44963    44961       -2     
- Misses       4437     4468      +31

Impacted Files	Coverage Δ
pandas/core/frame.py	`97.56% <ø> (-0.4%)`	⬇️
pandas/tools/merge.py	`93.68% <ø> (ø)`	⬆️
pandas/tseries/index.py	`95.42% <100%> (+0.02%)`	⬆️
pandas/indexes/range.py	`92.11% <100%> (+0.02%)`	⬆️
pandas/io/pytables.py	`93.06% <100%> (ø)`	⬆️
pandas/indexes/base.py	`96.14% <100%> (ø)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/sparse/frame.py	`94.26% <0%> (-2.43%)`	⬇️
pandas/core/categorical.py	`95.85% <0%> (-1.05%)`	⬇️
pandas/io/json/normalize.py	`96.93% <0%> (-0.98%)`	⬇️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d3554c...2d4e143. Read the comment docs.

jorisvandenbossche

The actual changes in the intersection method look good to me.
Can you add a whatsnew notice?

jorisvandenbossche · 2017-03-13T13:11:25Z

pandas/tseries/index.py

+            result = self._simple_new(result._values, name=result.name,
+                                      tz=result.tz)
+            if result.freq is None:
+                result.offset = to_offset(result.inferred_freq)


Is this change related to the PR?

Yes, as the simple result of Index.intersection inherits all attributes from other, and this is not the expected result. Therefore, it is necessary to create a new DatetimeIndex using only three of the Index.intersection attributes: _values, name and tz.

jorisvandenbossche · 2017-03-13T13:16:23Z

pandas/tests/frame/test_misc_api.py


        joined = f.join(f2)
        self.assert_index_equal(f.index, joined.index)
-        self.assertEqual(len(joined.columns), 4)
+        expected_columns = pd.Index(['A', 'B', 'C', 'D'])
+        self.assert_index_equal(joined.columns, expected_columns)


The fact that this is more explicitly tests is good! (instead of only the length)
But just for my understanding, is the behaviour changed for this specific case by the PR?

No, for the cases of 'left' (default), 'right' and 'outer' joins, the behavior is the same before and after the PR. Only for the 'inner' join (which uses the Index.intersection), the PR fixes the bug in the result.index order: before the PR, the index order was the one of the other, and after the PR the index order is the one in self.

Sorry, I was talking about the columns, not index. I think the columns of the result are not influenced by inner vs outer join?

jorisvandenbossche · 2017-03-13T13:17:43Z

pandas/tests/frame/test_misc_api.py


        joined = f.join(f2, how='right')
        self.assert_index_equal(joined.index, f2.index)
-        self.assertEqual(len(joined.columns), 4)
+        self.assert_index_equal(joined.columns, expected_columns)


In the case of a 'right' join, you check that the index is equal to the right frame (f2). This sounds the most logical to do, but, then the docs should be updated? (as the docs say the order of the "calling (left) DataFrame" is preserved).

I have removed that sentence from the doc.

jorisvandenbossche · 2017-03-13T13:18:58Z

pandas/tests/frame/test_misc_api.py


        # inner

-        f = self.frame.reindex(columns=['A', 'B'])[:10]
-        f2 = self.frame.reindex(columns=['C', 'D'])
-
        joined = f.join(f2, how='inner')
        self.assert_index_equal(joined.index, f.index.intersection(f2.index))


Can you make this test more explicit by not using intersection? As this was not catched before because intersection itself was not working correctly (I know you fixed both now at the same time, but maybe good to make the tests independent)

jorisvandenbossche · 2017-03-13T13:21:01Z

pandas/tests/frame/test_misc_api.py

        joined = f.join(f2, how='outer')
-        self.assertTrue(tm.equalContents(self.frame.index, joined.index))
-        self.assertEqual(len(joined.columns), 4)
+        self.assert_index_equal(joined.index, self.frame.index.sort_values())


Is this always the sorted index? Again, it is not fully clear what to expect based on the docs.

Yes, for the 'outer' join, the return DataFrame index is sorted. This is because DataFrame.join uses Index.join for the DataFrame indexes, and the latter uses Index.union when how='outer', which sorts the result Index. Therefore, I agree with you that this should be clarified in the doc.

jorisvandenbossche · 2017-03-13T13:22:26Z

A more general question is the notice of "Sortedness of the result is not guaranteed." in the Index.intersection docstring. Whether we should keep that or not.

jreback

make a sub-section on the whatsnew api-breaking changes. this need an old/new example

jreback · 2017-03-13T13:23:51Z

pandas/tests/frame/test_misc_api.py

@@ -61,37 +61,32 @@ def test_join_index(self):
        # left / right

        f = self.frame.reindex(columns=['A', 'B'])[:10]
-        f2 = self.frame.reindex(columns=['C', 'D'])
+        f2 = self.frame.reindex(columns=['C', 'D'])[5:][::-1]


use .iloc or .loc here more idiomatic

Could I use .ix instead? I ask this because the indexing is of both types, label and integer positional.

no ix is deprecated

you can easily indexing like this:

self.frame.loc[self.frame.index[5::-1], ['C', 'D']]

If .ix is deprecated, this should be pointed out in the doc:

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.ix.html

http://pandas.pydata.org/pandas-docs/stable/indexing.html

http://pandas-docs.github.io/pandas-docs-travis/indexing.html#ix-indexer-is-deprecated

you are looking at released docs

ix depreciation is not released yet - it's in 0.20.0

jreback · 2017-03-13T13:24:57Z

pandas/tests/indexes/test_base.py

+        expected = Index([5, 3, 4], name='idx')
+        result = idx1.intersection(idx2)
+        self.assert_index_equal(result, expected)
+        self.assertEqual(result.name, expected.name)


you don't need to compare names as assert_index_equal already does this

jreback · 2017-03-13T13:25:05Z

pandas/tests/indexes/test_base.py

+        expected = Index([5, 3, 4], name=None)
+        result = idx1.intersection(idx2)
+        self.assert_index_equal(result, expected)
+        self.assertEqual(result.name, expected.name)


jreback · 2017-03-13T13:25:47Z

pandas/tseries/index.py

-            if isinstance(result, DatetimeIndex):
-                if result.freq is None:
-                    result.offset = to_offset(result.inferred_freq)
+            result = self._simple_new(result._values, name=result.name,


use _shallow_copy rather than _simple_new

If I use _shallow_copy, the result will inherit all self attributes (e.g., freq) and this might not be the expected result. For example, this test would fail:

base = date_range('6/1/2000', '6/30/2000', freq='D', name='idx') rng4 = date_range('7/1/2000', '7/31/2000', freq='D', name='idx') expected4 = DatetimeIndex([], name='idx')

because result would inherit self freq = 'D'.

maybe you misunderstand, simply pass the new freq

result = self._shallow_copy(...., freq=....)

We don't use _simple_new at all (outside of a few specific places), this is not one of them. This is all documented next to these functions.

Might I simply use the constructor DatetimeIndex()? I ask this because I think the result freq is not known in advance.

no, use _shallow_copy with freq=None, which will not infer it, and then you can reset the freq after

jreback · 2017-03-13T13:29:48Z

also add the explicit example from the issue. ideally add some more tests in pandas/tests/tools/test_join .

I would actually like to either:

separate out the join tests from frame/test_misc_api -> frame/tests/test_join.py
move the join tests to tools/test_join where the rest of the join tests are.

Its not imperative that we do this here. but could do as a followup. (or a pre-PR alternatively)

albertvillanova · 2017-03-14T22:34:25Z

I have a question @jreback about the whatsnew notice: is this PR introducing an API-breaking change? I thought I was just fixing a bug about Index order after intersection. The expected bahavior of the intersection is analogue to:

a1.intersection(a2) = [item for item in a1 if item in a2]

thus preserving the order of the caller a1.

jorisvandenbossche · 2017-03-14T22:41:25Z

I don't think this needs a subsection in the API changes section. I would regard this as a bug fix (certainly for join).

jreback · 2017-03-14T22:49:36Z

I disagree. This is most certainly an API change. For sure we need a sub-section.

albertvillanova · 2017-03-14T23:01:24Z

I have addressed the following requested changes:

I have updated the docs
I have used _shallow_copy
I have removed redundant assertEqual
I have implemented the test example from the Issue
I have moved the join tests from tests/frame/test_misc_api.py to tests/frame/test_join.py, because these test the DataFrame join method. I think that in tests/tools it should be implemented only the tests using pd.merge.
The only thing I have not done yet is the whatsnew notice

jorisvandenbossche · 2017-03-14T23:05:57Z

OK to add the intersection order change to the API changes section, but just as a bullet pointin "Other API Changes"? @jreback can you explain why you think this needs a subsection? Do you think people rely on the order? (The docstring also says that the order should not be relied upon)

jreback · 2017-03-14T23:19:29Z

@jorisvandenbossche any non-trivial change needs an explanation. Sure this is a 'bug-fix'. But if I have code that relied upon the ordering (rightly or wrongly), I would for sure want to know. This is very subtle. You can never over-document whatsnew IMHO.

jorisvandenbossche

Thanks for the updates!
Added small comments to the doc updates

jorisvandenbossche · 2017-03-14T23:10:56Z

pandas/core/frame.py

@@ -4485,16 +4485,15 @@ def join(self, other, on=None, how='left', lsuffix='', rsuffix='',
            * left: use calling frame's index (or column if on is specified)
            * right: use other frame's index
            * outer: form union of calling frame's index (or column if on is
-                specified) with other frame's index
+                specified) with other frame's index, and sort it
            * inner: form intersection of calling frame's index (or column if
                on is specified) with other frame's index


also note for inner what the order will be? (because you could also think this will be sorted)

jorisvandenbossche · 2017-03-14T23:10:59Z

pandas/core/frame.py

            * inner: form intersection of calling frame's index (or column if
                on is specified) with other frame's index
        lsuffix : string
            Suffix to use from left frame's overlapping columns
        rsuffix : string
            Suffix to use from right frame's overlapping columns
        sort : boolean, default False
-            Order result DataFrame lexicographically by the join key. If False,
-            preserves the index order of the calling (left) DataFrame


Maybe leave something as "If False, index order depends on the join type (how keyword)" ?

jreback · 2017-03-15T13:34:06Z

pandas/tests/frame/test_join.py

+                                 assert_frame_equal)
+
+
+class TestDataFrameJoin(tm.TestCase, TestData):


since you are writing new tests, pls use the pytest style. don't inherit from tm.TestCase), so don't use self. for anything testing related

e.g. just use

assert_frame_equal and such

jreback · 2017-03-15T13:35:00Z

pandas/tests/frame/test_join.py

+from pandas.tests.frame.common import TestData
+
+import pandas.util.testing as tm
+from pandas.util.testing import (assertRaisesRegexp,


I actually prefer NOT to import the testing routines and instead use tm.assert_frame_equal, but as long as its consistent one way of the other is ok.

jreback · 2017-03-15T23:30:35Z

doc/source/whatsnew/v0.20.0.txt

+
+New Behavior:
+
+.. code-block:: ipython


use ipython:: here (on the new).

@jreback I have a naive question: what is the difference between code-block:: ipython and ipython:: python and why should the former used for previous behavior and the latter for new? Is there a way to see this txt file parsed to check the format? Thank you.

you can build the docs here: http://pandas-docs.github.io/pandas-docs-travis/contributing.html#how-to-build-the-pandas-documentation

http://www.sphinx-doc.org/en/stable/markup/code.html

code-block shows code literally, while ipython actually executes it and displays the results.

So we almost always want to actually execute the code, but a previous behaviour by-definition is not current code, so we show a snippet that is executed previously (meaning you run it in 0.19.2 and paste the output here).

jreback · 2017-03-15T23:31:59Z

doc/source/whatsnew/v0.20.0.txt

+Previous Behavior:
+
+.. code-block:: ipython
+   In [2]: df1 = pd.DataFrame({'a': [20, 10, 0]}, index=[2, 1, 0])


put these 2 in an ipython block at the top (and then show them, e.g.

.. ipython:: python df1 = pd.DataFrame(....) df1 df2 = pd.DataFrame(...) df2 Previous Behaviour: .. code-block:: ipython # show this part In [4]: df1.join(....) New Behaviour: .. ipython:: python df1.join(....)

jreback · 2017-03-15T23:33:24Z

pandas/indexes/base.py

@@ -2886,6 +2888,9 @@ def join(self, other, how='left', level=None, return_indexers=False):
        elif how == 'outer':
            join_index = self.union(other)

+        if sort:


this should only be on an outer join right?

I admit there is an ambiguity here.

By default (sort=False) the result index values are not sorted, except for the outer join. This is because Index.union sorts the result index values.

If we set sort=True, the result index values are sorted for all types of join (although this is redundant for the outer join, as it is already sorted).

@jreback Is this the expected result?

I am not sure. can you show a small example with this both turned on and off (for each of the how's). Sorry this is a bit of work, but this is a bit of unexplored API area

In file tests/frame/test_join.py I have created 2 methods test_join and test_join_sort where I test the joined result for each of the 4 values of how, for default sort=False and sort=True, respectively. Is that OK for you? Or would you need further examples?

this doesn't answer my question though, isn't sort ignored for all but inner join? If so then I would set it False (to make the code clear for the others).

No, sort=True is only ignored for outer join, because the outer join always sorts the result index. For the other types of join (left, right, inner), the default result is unsorted, as it preserves the left index order (for left and inner) or the right index order (for right), which might not be previously sorted.

By setting sort=True, we force the resulting index order to be sorted, even for the cases of left, right and inner joins.

What do you mean by I would set it False? The parameter sort is set False by default.

@albertvillanova what I mean is that reading the code it is very confusing and not obvious how sort interacts with how.

What I mean is I would explicity set sort=False (and maybe a comment) like this:

if how == 'left': join_index = self # sort=True is ignored? if so, then set sort=False elif how == 'right': join_index = other # same elif how == 'inner': join_index = self.intersection(other) elif how == 'outer': join_index = self.union(other) # or could just sort=False assert sort is False, "outer join must explicty have sort=False" if sort: join_index = join_index.sort_values()

@jreback There is not really an interaction between sort and how. It is just that when sort=False (the default) not additional sorting is done, and the order of the result depends on the type of join. When you set sort=True, you explicitly sort. And that regard, I find the current code of

if sort: join_index = join_index.sort_values()

very clear.
The only thing is that when how='outer', you would have sorted twice (as join_index is already sorted at this point in that case). But I wouldn't care about that, since it is not the default, and you can just not set sort to True to prevent that.

@jorisvandenbossche Great explanation.

jreback · 2017-03-15T23:35:14Z

pandas/core/frame.py

@@ -4485,16 +4485,17 @@ def join(self, other, on=None, how='left', lsuffix='', rsuffix='',
            * left: use calling frame's index (or column if on is specified)
            * right: use other frame's index
            * outer: form union of calling frame's index (or column if on is
-                specified) with other frame's index
+                specified) with other frame's index, and sort it


add a little bit of expl here

Further I am pretty sure the pd.merge, Series.join, DataFrame.merge doc-string needs to be updated as well (as these all go thru merge for the joins we are talking about) (they could also go thru concat for certain ones, but they are not relevant here)

You are right! This also affects DataFrame.merge, pd.merge and Series.align. Their docstrings need to be updated. And also the API-breaking whatsnew notice!

jreback · 2017-03-15T23:36:24Z

pandas/tests/frame/test_join.py

+        df2 = pd.DataFrame({'b': [100, 200, 300]}, index=[1, 2, 3])
+
+        result = df1.join(df2)
+        expected = pd.DataFrame({'a': [20, 10, 0], 'b': [200, 100, None]},


don't put None when we really use np.nan.

jreback · 2017-03-15T23:39:05Z

pandas/tests/frame/test_join.py

+                                index=[0, 1, 2, 3])
+        tm.assert_frame_equal(result, expected)
+
+    def test_join_index(self):


all of these tests on index should be in tests/pandas/tests/indexes/test_base.py. These are done pretty generically, follow the existing pattern (also wouldn't hurt to move these tests and your new ones to a new test class).

The only new method I have created is test_join. All the other methods were previously in tests/frame/test_misc_api.pyand I just moved them to this file I have created tests/frame/test_join.py.

Despite their misleading names, the methods test_join_index and a test_join_index_more do not directly test any Index method, but the DataFrame.join functionality and the index and columns in the result DataFrame.

ok that's fine then.

jreback · 2017-03-15T23:39:42Z

pandas/tests/indexes/test_base.py

@@ -609,14 +609,14 @@ def test_intersection(self):
        # non monotonic
        idx1 = Index([5, 3, 2, 4, 1], name='idx')
        idx2 = Index([4, 7, 6, 5, 3], name='idx')


this is the file where all of the index join tests go.

jreback · 2017-03-15T23:40:11Z

pandas/indexes/base.py

@@ -2801,6 +2802,7 @@ def join(self, other, how='left', level=None, return_indexers=False):
        how : {'left', 'right', 'inner', 'outer'}
        level : int or level name, default None
        return_indexers : boolean, default False
+        sort : boolean, default False


add a versionadded tag

jreback · 2017-03-15T23:40:32Z

pandas/indexes/range.py

@@ -444,16 +445,19 @@ def join(self, other, how='left', level=None, return_indexers=False):
        how : {'left', 'right', 'inner', 'outer'}
        level : int or level name, default None
        return_indexers : boolean, default False
+        sort : boolean, default False


same (actually this doc-string should be a shared_doc string if you want to move it), rather than repeating

@jreback How could I create a shared_docstring? Any existing example?

look for shared_docs and u can see
(these r defined in base.py)

jorisvandenbossche

@albertvillanova added some small comments on the docs

@jreback the Index.join docstring says "this is an internal non-public method". I am not sure of its status (never used that method? But if that is really the case, I would remove it from the examples in the whatsnew.

jorisvandenbossche · 2017-03-18T10:37:45Z

doc/source/whatsnew/v0.20.0.txt

+     1  10  100
+     2  20  200
+
+     In [7]: (res1, res2) = df1.align(df2, axis=0, join='inner')


You already have an align example above?

Yes, but the former is for Series and the latter is for DataFrame.

I don't think it is needed to show both, people know series and dataframe methods work similar.
I would just list both Series and DataFrame align in the bullet item of Series.align, and then show eg the series example.

jorisvandenbossche · 2017-03-19T22:30:50Z

doc/source/whatsnew/v0.20.0.txt

+
+     df1.merge(df2, how='inner', left_index=True, right_index=True)
+
+     pd.merge(df1, df2, how='inner', left_index=True, right_index=True)


I don't think it is needed to show the result of both pd.merge(df1, df2) and df1.merge(df2) (you mentioned them all in the text)

(BTW, you don't need to put those empty lines between the different input lines, that has no impact on the result after building the docs)

jorisvandenbossche · 2017-03-19T22:32:40Z

pandas/core/frame.py

@@ -127,7 +127,8 @@
    * left: use only keys from left frame (SQL: left outer join)
    * right: use only keys from right frame (SQL: right outer join)
    * outer: use union of keys from both frames (SQL: full outer join)
-    * inner: use intersection of keys from both frames (SQL: inner join)
+    * inner: use intersection of keys from both frames (SQL: inner join),
+        preserving the order of the left keys


small rst nitpick (it's very whitespace sensitive ..): the 'preserving ..' should be aligned with the 'inner ..' on the line above

jorisvandenbossche · 2017-03-19T22:43:45Z

pandas/core/frame.py

            * inner: form intersection of calling frame's index (or column if
-                on is specified) with other frame's index
+                on is specified) with other frame's index, preserving the
+                order of the calling's one


The same here (about the whitespace alignment)

jorisvandenbossche · 2017-03-19T22:48:07Z

doc/source/whatsnew/v0.20.0.txt

@@ -308,6 +308,8 @@ Other enhancements
 - ``pd.types.concat.union_categoricals`` gained the ``ignore_ordered`` argument to allow ignoring the ordered attribute of unioned categoricals (:issue:`13410`). See the :ref:`categorical union docs <categorical.union>` for more information.
 - ``pandas.io.json.json_normalize()`` with an empty ``list`` will return an empty ``DataFrame`` (:issue:`15534`)
 - ``pd.DataFrame.to_latex`` and ``pd.DataFrame.to_string`` now allow optional header aliases. (:issue:`15536`)
+- ``Index.intersection()`` accepts parameter ``sort`` (:issue:`15582`)


I think this should be join instead of intersection?

Yes, you are right. I have fixed this.

jreback · 2017-03-20T12:20:13Z

pandas/tests/frame/test_merge.py

@@ -0,0 +1,91 @@
+# -*- coding: utf-8 -*-
+
+from __future__ import print_function


no, these need to go in pandas/tests/tools/test_merge.py. just fit them in where appropriate.

.join is fine here as its a direct / used method on DataFrame. .merge is a direct calling of the pd.merge where LOTS of tests already exist. Test location is very important to avoid future confusion.

In fact I suspect most of these are duplicated tests, pls only add as appropriate.

I see you already added to the correct location. no point in duplicating tests. delete these.

jreback · 2017-03-20T12:20:42Z

pandas/tests/tools/test_merge.py

@@ -1355,3 +1355,92 @@ def test_dtype_on_merged_different(self, change, how, left, right):
                           np.dtype('int64')],
                          index=['X', 'Y', 'Z'])
        assert_series_equal(result, expected)
+
+
+class TestMergeOnIndexes(object):


ok you added here good.

jreback · 2017-03-20T12:22:36Z

doc/source/whatsnew/v0.20.0.txt

@@ -926,3 +1055,4 @@ Bug Fixes
 - Bug in ``pd.melt()`` where passing a tuple value for ``value_vars`` caused a ``TypeError`` (:issue:`15348`)
 - Bug in ``.eval()`` which caused multiline evals to fail with local variables not on the first line (:issue:`15342`)
 - Bug in ``pd.read_msgpack`` which did not allow to load dataframe with an index of type ``CategoricalIndex`` (:issue:`15487`)
+- Bug with ``sort=True`` in ``DataFrame.join``, ``DataFrame.merge`` and ``pd.merge`` when joining on index (:issue:`15582`)


no need to mention DataFrame.merge, pd.merge and .join are the important ones

jreback · 2017-03-20T12:22:55Z

doc/source/whatsnew/v0.20.0.txt

@@ -308,6 +308,8 @@ Other enhancements
 - ``pd.types.concat.union_categoricals`` gained the ``ignore_ordered`` argument to allow ignoring the ordered attribute of unioned categoricals (:issue:`13410`). See the :ref:`categorical union docs <categorical.union>` for more information.
 - ``pandas.io.json.json_normalize()`` with an empty ``list`` will return an empty ``DataFrame`` (:issue:`15534`)
 - ``pd.DataFrame.to_latex`` and ``pd.DataFrame.to_string`` now allow optional header aliases. (:issue:`15536`)
+- ``Index.join()`` accepts parameter ``sort`` (:issue:`15582`)


this is an internal method, remove this comment

jreback · 2017-03-20T12:23:51Z

doc/source/whatsnew/v0.20.0.txt

+Index order after inner join due to Index intersection
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The ``Index.intersection`` now preserves the order of the calling Index (left)


say what this did previously. No need to mention Index.join or DataFrame.merge and say .align methods. simpler is better.

jreback · 2017-03-20T12:24:07Z

doc/source/whatsnew/v0.20.0.txt

+``DataFrame.align``).
+
+- ``Index.intersection`` and ``Index.join``
+


remove refernces to Index.join

jreback · 2017-03-20T12:25:01Z

doc/source/whatsnew/v0.20.0.txt

+     res2
+
+- ``DataFrame.join``, ``DataFrame.merge`` and ``pd.merge``
+


remove references to DataFrame.merge

jreback · 2017-03-20T12:25:51Z

doc/source/whatsnew/v0.20.0.txt

+
+     df1.join(df2, how='inner')
+     df1.merge(df2, how='inner', left_index=True, right_index=True)
+     (res1, res2) = df1.align(df2, axis=0, join='inner')


I would remove the .align examples. This is not that important to show as its a rarely used method. The .join / .intersection are much more important.

albertvillanova · 2017-03-23T06:31:08Z

@jreback and @jorisvandenbossche Is it OK after last changes?

jreback

couple of minor doc edits. changes suggested for testing. This is going to be a whole lot easier to read and less error prone to re-write in a parametrized way. Its pretty straightforward to do this. I put an example up.

jreback · 2017-03-23T12:41:27Z

doc/source/whatsnew/v0.20.0.txt

@@ -705,6 +706,69 @@ New Behavior:
   TypeError: Cannot compare 2014-01-01 00:00:00 of
   type <class 'pandas.tslib.Timestamp'> to string column

+.. _whatsnew_0200.api_breaking.index_order:
+
+Index order after inner join due to Index intersection


Make the title about what you have in the first sentence

Index.intersectio now preserves the order of the left Index

I would include 'join' in the title as well. Eg Index.intersection and inner join now preserve the order of the left Index

jreback · 2017-03-23T12:43:01Z

doc/source/whatsnew/v0.20.0.txt

+  .. ipython:: python
+
+     df1 = pd.DataFrame({'a': [20, 10, 0]}, index=[2, 1, 0])
+     df1


I would only show .join here. a merge on the indexes is by-definition a join.

Question: is the inner join of merge where the keys are columns also affected?

If so, I would include the merge example, but by joining on a common column instead of with left/right_index=True

No, the inner join on columns is not affected. Indeed it was already working as expected, preserving the order of the left column.

jreback · 2017-03-23T12:43:32Z

doc/source/whatsnew/v0.20.0.txt

@@ -926,3 +990,4 @@ Bug Fixes
 - Bug in ``pd.melt()`` where passing a tuple value for ``value_vars`` caused a ``TypeError`` (:issue:`15348`)
 - Bug in ``.eval()`` which caused multiline evals to fail with local variables not on the first line (:issue:`15342`)
 - Bug in ``pd.read_msgpack`` which did not allow to load dataframe with an index of type ``CategoricalIndex`` (:issue:`15487`)
+- Bug with ``sort=True`` in ``DataFrame.join`` and ``pd.merge`` when joining on indexes (:issue:`15582`)


put notes somewhere in the Bug Fix section and not at the end, otherwise you will get merge conflicts a lot.

jreback · 2017-03-23T12:45:01Z

pandas/indexes/base.py

@@ -2855,11 +2854,18 @@ def join(self, other, how='left', level=None, return_indexers=False):
        how : {'left', 'right', 'inner', 'outer'}
        level : int or level name, default None
        return_indexers : boolean, default False
+        sort : boolean, default False
+
+            .. versionadded:: 0.20.0


can you add a sentence explaining what sort does (I know its repeatative, but this is a different part of the code)

jreback · 2017-03-23T12:46:38Z

pandas/indexes/base.py

@@ -2886,6 +2888,9 @@ def join(self, other, how='left', level=None, return_indexers=False):
        elif how == 'outer':
            join_index = self.union(other)

+        if sort:


this doesn't answer my question though, isn't sort ignored for all but inner join? If so then I would set it False (to make the code clear for the others).

jreback · 2017-03-23T12:50:45Z

pandas/tests/frame/test_join.py

+                                index=[0, 1, 2, 3])
+        tm.assert_frame_equal(result, expected)
+
+    def test_join_sort(self):


I would like to integrate these tests and use a parametrization to make this easy; this is not part of the a class test, just a raw function.

something like

@pytest.fixture def df1(): return DataFrame(...) @pytest.fixture def df2(): return DataFrame(...) @pytest.mark.parametrize( "how, sort, expected", [ ('left', False, DataFrame(.....)), ('left', True, DataFrame(.....))]) def test_join(how, sort, expected): result = df1.join(df2, how=how, sort=sort) tm.assert_frame_equal(result, expected)

jreback · 2017-03-23T12:51:29Z

pandas/tests/frame/test_join.py

+                                index=[0, 1, 2, 3])
+        tm.assert_frame_equal(result, expected)
+
+    def test_join_index(self):


I would parametrize here as well.

jreback · 2017-03-23T12:51:54Z

pandas/tests/frame/test_join.py

+        tm.assert_index_equal(joined.index, self.frame.index.sort_values())
+        tm.assert_index_equal(joined.columns, expected_columns)
+
+        tm.assertRaisesRegexp(ValueError, 'join method', f.join, f2, how='foo')


then make these a separate test (for these error cases)

def test_join_index_errors

jreback · 2017-03-23T12:53:41Z

pandas/tests/frame/test_join.py

+import pandas.util.testing as tm
+
+
+class TestDataFrameJoin(TestData):


instead of using a class definition make a fixture out of frame

@pytest.fixture def frame(): return TestData.frame

the you don't need a class at all for testing

jreback · 2017-03-23T12:54:38Z

pandas/tests/tools/test_merge.py

+    def test_merge_on_indexes(self):
+        df1 = pd.DataFrame({'a': [20, 10, 0]}, index=[2, 1, 0])
+        df2 = pd.DataFrame({'b': [100, 200, 300]}, index=[1, 2, 3])
+


I would change this to be parametrized like join above. (e.g. matrix of how/sort)

jreback · 2017-03-24T14:01:16Z

pandas/indexes/base.py

@@ -2886,6 +2888,9 @@ def join(self, other, how='left', level=None, return_indexers=False):
        elif how == 'outer':
            join_index = self.union(other)

+        if sort:


@albertvillanova what I mean is that reading the code it is very confusing and not obvious how sort interacts with how.

What I mean is I would explicity set sort=False (and maybe a comment) like this:

if how == 'left': join_index = self # sort=True is ignored? if so, then set sort=False elif how == 'right': join_index = other # same elif how == 'inner': join_index = self.intersection(other) elif how == 'outer': join_index = self.union(other) # or could just sort=False assert sort is False, "outer join must explicty have sort=False" if sort: join_index = join_index.sort_values()

jorisvandenbossche · 2017-03-25T11:06:46Z

doc/source/whatsnew/v0.20.0.txt

@@ -705,6 +706,69 @@ New Behavior:
   TypeError: Cannot compare 2014-01-01 00:00:00 of
   type <class 'pandas.tslib.Timestamp'> to string column

+.. _whatsnew_0200.api_breaking.index_order:
+
+Index order after inner join due to Index intersection


I would include 'join' in the title as well. Eg Index.intersection and inner join now preserve the order of the left Index

jorisvandenbossche · 2017-03-25T11:16:45Z

pandas/core/frame.py

@@ -127,7 +127,8 @@
    * left: use only keys from left frame (SQL: left outer join)
    * right: use only keys from right frame (SQL: right outer join)
    * outer: use union of keys from both frames (SQL: full outer join)


What is the resulting order here? Also sorted as for join?

jorisvandenbossche · 2017-03-25T11:19:02Z

doc/source/whatsnew/v0.20.0.txt

+  .. ipython:: python
+
+     df1 = pd.DataFrame({'a': [20, 10, 0]}, index=[2, 1, 0])
+     df1


Question: is the inner join of merge where the keys are columns also affected?

If so, I would include the merge example, but by joining on a common column instead of with left/right_index=True

jorisvandenbossche · 2017-03-25T12:26:35Z

pandas/indexes/base.py

@@ -2886,6 +2888,9 @@ def join(self, other, how='left', level=None, return_indexers=False):
        elif how == 'outer':
            join_index = self.union(other)

+        if sort:


@jreback There is not really an interaction between sort and how. It is just that when sort=False (the default) not additional sorting is done, and the order of the result depends on the type of join. When you set sort=True, you explicitly sort. And that regard, I find the current code of

if sort: join_index = join_index.sort_values()

very clear.
The only thing is that when how='outer', you would have sorted twice (as join_index is already sorted at this point in that case). But I wouldn't care about that, since it is not the default, and you can just not set sort to True to prevent that.

jorisvandenbossche · 2017-03-25T12:35:58Z

@jreback I would leave the testing / pytest changes for now. That is not that trivial, and this PR has already seen enough back and forth, time to get it merged :-)
I would just try to fix the remaining doc comments and the discussion regarding the sort keyword, and leave clean-up of the (partly existing) tests for a follow-up PR.

jreback · 2017-03-25T15:19:50Z

@albertvillanova I rebased this and pushed a commit fixing the tests. The reason I suggest this is that it makes for much shorter, more understandable code that is easier to error check. There were several duplicate tests (not really a big deal), but still the same, copy-pasting is in general not a great idea for repetative things.

jreback · 2017-03-27T17:34:10Z

@albertvillanova I think we just needed a couple of doc updates, otherwise this lgtm.

albertvillanova · 2017-03-28T07:20:10Z

Great @jreback, I am just addressing your previously requested changes and also adding a slightly modification in the tests to make them exhaustively consider all possible combinations of unsorted left-right indexes and sort=False,True.

albertvillanova · 2017-03-28T12:51:41Z

These are the changes I have made:

In doc/source/whatsnew/v0.20.0.txt I have changed the title (following both recommendations) and removed the example of pd.merge
In pandas/core/frame.py I have clarified which is the resulting order in the merge_doc docstring
In pandas/indexes/base.py I have added what sort does to the join _index_shared_docs dosctring
In pandas/tests/frame/test_join.py I have made a little modification so that it is also considered the case of unsorted right index in a right join with sort=False

jreback · 2017-03-28T12:55:31Z

pandas/core/frame.py

-    * left: use only keys from left frame (SQL: left outer join)
-    * right: use only keys from right frame (SQL: right outer join)
-    * outer: use union of keys from both frames (SQL: full outer join)
+    * left: use only keys from left frame (SQL: left outer join), preserving


nit: can you format these lines like

* right: use only keys from right frame similar to a SQL: right outer join; preserves key orderings

jreback · 2017-03-28T12:56:08Z

doc/source/whatsnew/v0.20.0.txt

+
+  .. ipython:: python
+
+     idx1 = pd.Index([2, 1, 0])


can you name these left_index & right_index (for clarity)

jreback · 2017-03-28T12:56:42Z

doc/source/whatsnew/v0.20.0.txt

+- ``DataFrame.join`` and ``pd.merge``
+
+  .. ipython:: python
+


can you name these left and right

jreback · 2017-03-28T12:57:48Z

pandas/tests/tools/test_merge.py

+
+
+@pytest.fixture
+def df1():


can you change these to df1->left, and df2->right

I have also changed the order of the right index (like in pandas/tests/frame/test_join.py), so that sort=False is tested for the right join.

@jreback I get an error when naming them left and right, because these names are also used in other parts of the code. What should I do?

jreback · 2017-03-28T12:58:05Z

pandas/tests/frame/test_join.py

+
+
+@pytest.fixture
+def df1():


can you change df1->left and df2->right

jreback · 2017-03-28T12:59:01Z

@albertvillanova just a couple of minor renames. ping when pushed and green and will merge. thanks for the patience!

you can see that PR's are mostly about getting the details / docs right; oftentimes the code changes are the easiest part!

jreback · 2017-03-28T19:00:14Z

hmm that last broke something........lmk when you have sorted and are green.

jreback · 2017-03-28T22:46:45Z

lgtm. ping on green.

jreback · 2017-03-29T00:30:29Z

thanks @albertvillanova nice PR!

as you can see, the docs really matter!

albertvillanova · 2017-03-29T04:21:55Z

Thanks both @jreback and @jorisvandenbossche for all the things I have learned with your comments, ideas and suggestions.

jorisvandenbossche · 2017-03-29T08:06:09Z

@albertvillanova Thanks!

closes pandas-dev#15582 Author: Albert Villanova del Moral <[email protected]> Author: Jeff Reback <[email protected]> Closes pandas-dev#15583 from albertvillanova/fix-15582 and squashes the following commits: 2d4e143 [Albert Villanova del Moral] Fix pytest fixture name collision 64e86a4 [Albert Villanova del Moral] Fix test on right join 73df69e [Albert Villanova del Moral] Address requested changes 8d2e9cc [Albert Villanova del Moral] Address requested changes 968c7f1 [Jeff Reback] DOC/TST: change to use parameterization 9e39794 [Albert Villanova del Moral] Address requested changes 5bf1508 [Albert Villanova del Moral] Address requested changes 654288b [Albert Villanova del Moral] Fix Travis errors 33eb740 [Albert Villanova del Moral] Address requested changes 3c200fe [Albert Villanova del Moral] Add new tests ef2581e [Albert Villanova del Moral] Fix Travis error f0d9d03 [Albert Villanova del Moral] Add whatsnew c96306d [Albert Villanova del Moral] Add sort argument to Index.join 047b513 [Albert Villanova del Moral] Address requested changes ec836bd [Albert Villanova del Moral] Fix Travis errors b977278 [Albert Villanova del Moral] Address requested changes 784fe75 [Albert Villanova del Moral] Fix error: line too long 1197b99 [Albert Villanova del Moral] Fix DataFrame column order when read from HDF file d9e29f8 [Albert Villanova del Moral] Create new DatetimeIndex from the Index.intersection result e7bcd28 [Albert Villanova del Moral] Fix typo in documentation a4ead99 [Albert Villanova del Moral] Fix typo c2a8dc3 [Albert Villanova del Moral] Implement tests c12bb3f [Albert Villanova del Moral] BUG: Fix index order for Index.intersection()

jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Mar 7, 2017

jorisvandenbossche reviewed Mar 13, 2017

View reviewed changes

jreback requested changes Mar 13, 2017

View reviewed changes

jorisvandenbossche reviewed Mar 15, 2017

View reviewed changes

jreback reviewed Mar 15, 2017

View reviewed changes

jreback requested changes Mar 15, 2017

View reviewed changes

jorisvandenbossche reviewed Mar 19, 2017

View reviewed changes

jreback requested changes Mar 20, 2017

View reviewed changes

jreback requested changes Mar 23, 2017

View reviewed changes

jreback requested changes Mar 24, 2017

View reviewed changes

jorisvandenbossche reviewed Mar 25, 2017

View reviewed changes

Albert Villanova del Moral added 5 commits March 25, 2017 11:00

BUG: Fix index order for Index.intersection()

c12bb3f

Implement tests

c2a8dc3

Fix typo

a4ead99

Fix typo in documentation

e7bcd28

Create new DatetimeIndex from the Index.intersection result

d9e29f8

Albert Villanova del Moral and others added 5 commits March 25, 2017 11:02

Address requested changes

33eb740

Fix Travis errors

654288b

Address requested changes

5bf1508

Address requested changes

9e39794

DOC/TST: change to use parameterization

968c7f1

jreback force-pushed the fix-15582 branch from d9250f2 to 968c7f1 Compare March 25, 2017 15:18

jreback approved these changes Mar 27, 2017

View reviewed changes

jreback added this to the 0.20.0 milestone Mar 27, 2017

jreback added the API Design label Mar 27, 2017

Address requested changes

8d2e9cc

jreback reviewed Mar 28, 2017

View reviewed changes

Albert Villanova del Moral added 2 commits March 28, 2017 15:19

Address requested changes

73df69e

Fix test on right join

64e86a4

Fix pytest fixture name collision

2d4e143

jreback closed this in bd169dc Mar 29, 2017

		assert_frame_equal)


		class TestDataFrameJoin(tm.TestCase, TestData):


		df1.merge(df2, how='inner', left_index=True, right_index=True)

		pd.merge(df1, df2, how='inner', left_index=True, right_index=True)

		@@ -0,0 +1,91 @@
		# -- coding: utf-8 --

		from __future__ import print_function

		``DataFrame.align``).

		- ``Index.intersection`` and ``Index.join``

		res2

		- ``DataFrame.join``, ``DataFrame.merge`` and ``pd.merge``

		import pandas.util.testing as tm


		class TestDataFrameJoin(TestData):

Uh oh!

BUG: Fix index order for Index.intersection() #15583

BUG: Fix index order for Index.intersection() #15583

Uh oh!

Conversation

albertvillanova commented Mar 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented Mar 5, 2017

Uh oh!

albertvillanova commented Mar 6, 2017

Uh oh!

codecov-io commented Mar 12, 2017 • edited by codecov bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Mar 13, 2017

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albertvillanova Mar 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Mar 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

albertvillanova commented Mar 14, 2017

Uh oh!

jorisvandenbossche commented Mar 14, 2017

Uh oh!

jreback commented Mar 14, 2017

albertvillanova commented Mar 5, 2017 •

edited

Loading

codecov-io commented Mar 12, 2017 •

edited by codecov bot

Loading

albertvillanova Mar 13, 2017 •

edited

Loading

jreback commented Mar 13, 2017 •

edited

Loading