-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix index order for Index.intersection() #15583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
not sure what you are attempting to do here. This needs quite a few tests to even flesh out what is going on. |
I have implemented/updated some tests to demonstrate that the bug is at Index.intersection(other). Before this was returning the intersection of self and other preserving the order of other instead of preserving the order of self. |
Codecov Report
@@ Coverage Diff @@
## master #15583 +/- ##
==========================================
- Coverage 91.01% 90.96% -0.06%
==========================================
Files 143 143
Lines 49400 49429 +29
==========================================
- Hits 44963 44961 -2
- Misses 4437 4468 +31
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual changes in the intersection method look good to me.
Can you add a whatsnew notice?
pandas/tseries/index.py
Outdated
result = self._simple_new(result._values, name=result.name, | ||
tz=result.tz) | ||
if result.freq is None: | ||
result.offset = to_offset(result.inferred_freq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change related to the PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as the simple result of Index.intersection inherits all attributes from other, and this is not the expected result. Therefore, it is necessary to create a new DatetimeIndex using only three of the Index.intersection attributes: _values, name and tz.
pandas/tests/frame/test_misc_api.py
Outdated
|
||
joined = f.join(f2) | ||
self.assert_index_equal(f.index, joined.index) | ||
self.assertEqual(len(joined.columns), 4) | ||
expected_columns = pd.Index(['A', 'B', 'C', 'D']) | ||
self.assert_index_equal(joined.columns, expected_columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that this is more explicitly tests is good! (instead of only the length)
But just for my understanding, is the behaviour changed for this specific case by the PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, for the cases of 'left' (default), 'right' and 'outer' joins, the behavior is the same before and after the PR. Only for the 'inner' join (which uses the Index.intersection), the PR fixes the bug in the result.index order: before the PR, the index order was the one of the other, and after the PR the index order is the one in self.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I was talking about the columns, not index. I think the columns of the result are not influenced by inner vs outer join?
pandas/tests/frame/test_misc_api.py
Outdated
|
||
joined = f.join(f2, how='right') | ||
self.assert_index_equal(joined.index, f2.index) | ||
self.assertEqual(len(joined.columns), 4) | ||
self.assert_index_equal(joined.columns, expected_columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of a 'right' join, you check that the index is equal to the right frame (f2
). This sounds the most logical to do, but, then the docs should be updated? (as the docs say the order of the "calling (left) DataFrame" is preserved).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed that sentence from the doc.
pandas/tests/frame/test_misc_api.py
Outdated
|
||
# inner | ||
|
||
f = self.frame.reindex(columns=['A', 'B'])[:10] | ||
f2 = self.frame.reindex(columns=['C', 'D']) | ||
|
||
joined = f.join(f2, how='inner') | ||
self.assert_index_equal(joined.index, f.index.intersection(f2.index)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make this test more explicit by not using intersection
? As this was not catched before because intersection itself was not working correctly (I know you fixed both now at the same time, but maybe good to make the tests independent)
pandas/tests/frame/test_misc_api.py
Outdated
joined = f.join(f2, how='outer') | ||
self.assertTrue(tm.equalContents(self.frame.index, joined.index)) | ||
self.assertEqual(len(joined.columns), 4) | ||
self.assert_index_equal(joined.index, self.frame.index.sort_values()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this always the sorted index? Again, it is not fully clear what to expect based on the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for the 'outer' join, the return DataFrame index is sorted. This is because DataFrame.join uses Index.join for the DataFrame indexes, and the latter uses Index.union when how='outer', which sorts the result Index. Therefore, I agree with you that this should be clarified in the doc.
A more general question is the notice of "Sortedness of the result is not guaranteed." in the Index.intersection docstring. Whether we should keep that or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make a sub-section on the whatsnew api-breaking changes. this need an old/new example
pandas/tests/frame/test_misc_api.py
Outdated
@@ -61,37 +61,32 @@ def test_join_index(self): | |||
# left / right | |||
|
|||
f = self.frame.reindex(columns=['A', 'B'])[:10] | |||
f2 = self.frame.reindex(columns=['C', 'D']) | |||
f2 = self.frame.reindex(columns=['C', 'D'])[5:][::-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use .iloc
or .loc
here more idiomatic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could I use .ix instead? I ask this because the indexing is of both types, label and integer positional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no ix is deprecated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can easily indexing like this:
self.frame.loc[self.frame.index[5::-1], ['C', 'D']]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If .ix is deprecated, this should be pointed out in the doc:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
http://pandas-docs.github.io/pandas-docs-travis/indexing.html#ix-indexer-is-deprecated
you are looking at released docs
ix depreciation is not released yet - it's in 0.20.0
pandas/tests/indexes/test_base.py
Outdated
expected = Index([5, 3, 4], name='idx') | ||
result = idx1.intersection(idx2) | ||
self.assert_index_equal(result, expected) | ||
self.assertEqual(result.name, expected.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to compare names as assert_index_equal
already does this
pandas/tests/indexes/test_base.py
Outdated
expected = Index([5, 3, 4], name=None) | ||
result = idx1.intersection(idx2) | ||
self.assert_index_equal(result, expected) | ||
self.assertEqual(result.name, expected.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
pandas/tseries/index.py
Outdated
if isinstance(result, DatetimeIndex): | ||
if result.freq is None: | ||
result.offset = to_offset(result.inferred_freq) | ||
result = self._simple_new(result._values, name=result.name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use _shallow_copy
rather than _simple_new
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I use _shallow_copy, the result will inherit all self attributes (e.g., freq) and this might not be the expected result. For example, this test would fail:
base = date_range('6/1/2000', '6/30/2000', freq='D', name='idx')
rng4 = date_range('7/1/2000', '7/31/2000', freq='D', name='idx')
expected4 = DatetimeIndex([], name='idx')
because result would inherit self freq = 'D'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you misunderstand, simply pass the new freq
result = self._shallow_copy(...., freq=....)
We don't use _simple_new
at all (outside of a few specific places), this is not one of them. This is all documented next to these functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might I simply use the constructor DatetimeIndex()? I ask this because I think the result freq is not known in advance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, use _shallow_copy
with freq=None
, which will not infer it, and then you can reset the freq after
also add the explicit example from the issue. ideally add some more tests in I would actually like to either:
Its not imperative that we do this here. but could do as a followup. (or a pre-PR alternatively) |
I have a question @jreback about the whatsnew notice: is this PR introducing an API-breaking change? I thought I was just fixing a bug about Index order after intersection. The expected bahavior of the intersection is analogue to:
thus preserving the order of the caller a1. |
I don't think this needs a subsection in the API changes section. I would regard this as a bug fix (certainly for |
I disagree. This is most certainly an API change. For sure we need a sub-section. |
I have addressed the following requested changes:
|
OK to add the intersection order change to the API changes section, but just as a bullet pointin "Other API Changes"? @jreback can you explain why you think this needs a subsection? Do you think people rely on the order? (The docstring also says that the order should not be relied upon) |
@jorisvandenbossche any non-trivial change needs an explanation. Sure this is a 'bug-fix'. But if I have code that relied upon the ordering (rightly or wrongly), I would for sure want to know. This is very subtle. You can never over-document whatsnew IMHO. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates!
Added small comments to the doc updates
pandas/core/frame.py
Outdated
@@ -4485,16 +4485,15 @@ def join(self, other, on=None, how='left', lsuffix='', rsuffix='', | |||
* left: use calling frame's index (or column if on is specified) | |||
* right: use other frame's index | |||
* outer: form union of calling frame's index (or column if on is | |||
specified) with other frame's index | |||
specified) with other frame's index, and sort it | |||
* inner: form intersection of calling frame's index (or column if | |||
on is specified) with other frame's index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also note for inner what the order will be? (because you could also think this will be sorted)
pandas/core/frame.py
Outdated
* inner: form intersection of calling frame's index (or column if | ||
on is specified) with other frame's index | ||
lsuffix : string | ||
Suffix to use from left frame's overlapping columns | ||
rsuffix : string | ||
Suffix to use from right frame's overlapping columns | ||
sort : boolean, default False | ||
Order result DataFrame lexicographically by the join key. If False, | ||
preserves the index order of the calling (left) DataFrame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe leave something as "If False, index order depends on the join type (how keyword)" ?
pandas/tests/frame/test_join.py
Outdated
assert_frame_equal) | ||
|
||
|
||
class TestDataFrameJoin(tm.TestCase, TestData): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since you are writing new tests, pls use the pytest style. don't inherit from tm.TestCase
), so don't use self.
for anything testing related
e.g. just use
assert_frame_equal
and such
pandas/tests/frame/test_join.py
Outdated
from pandas.tests.frame.common import TestData | ||
|
||
import pandas.util.testing as tm | ||
from pandas.util.testing import (assertRaisesRegexp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually prefer NOT to import the testing routines and instead use tm.assert_frame_equal
, but as long as its consistent one way of the other is ok.
doc/source/whatsnew/v0.20.0.txt
Outdated
|
||
New Behavior: | ||
|
||
.. code-block:: ipython |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use ipython:: here (on the new).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback I have a naive question: what is the difference between code-block:: ipython
and ipython:: python
and why should the former used for previous behavior and the latter for new? Is there a way to see this txt file parsed to check the format? Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can build the docs here: http://pandas-docs.github.io/pandas-docs-travis/contributing.html#how-to-build-the-pandas-documentation
http://www.sphinx-doc.org/en/stable/markup/code.html
code-block
shows code literally, while ipython
actually executes it and displays the results.
So we almost always want to actually execute the code, but a previous behaviour by-definition is not current code, so we show a snippet that is executed previously (meaning you run it in 0.19.2 and paste the output here).
doc/source/whatsnew/v0.20.0.txt
Outdated
Previous Behavior: | ||
|
||
.. code-block:: ipython | ||
In [2]: df1 = pd.DataFrame({'a': [20, 10, 0]}, index=[2, 1, 0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put these 2 in an ipython block at the top (and then show them, e.g.
.. ipython:: python
df1 = pd.DataFrame(....)
df1
df2 = pd.DataFrame(...)
df2
Previous Behaviour:
.. code-block:: ipython
# show this part
In [4]: df1.join(....)
New Behaviour:
.. ipython:: python
df1.join(....)
@@ -2886,6 +2888,9 @@ def join(self, other, how='left', level=None, return_indexers=False): | |||
elif how == 'outer': | |||
join_index = self.union(other) | |||
|
|||
if sort: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should only be on an outer join right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I admit there is an ambiguity here.
By default (sort=False
) the result index values are not sorted, except for the outer join. This is because Index.union sorts the result index values.
If we set sort=True
, the result index values are sorted for all types of join (although this is redundant for the outer join, as it is already sorted).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback Is this the expected result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure. can you show a small example with this both turned on and off (for each of the how's). Sorry this is a bit of work, but this is a bit of unexplored API area
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In file tests/frame/test_join.py
I have created 2 methods test_join
and test_join_sort
where I test the joined result for each of the 4 values of how
, for default sort=False
and sort=True
, respectively. Is that OK for you? Or would you need further examples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't answer my question though, isn't sort ignored for all but inner join? If so then I would set it False (to make the code clear for the others).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, sort=True is only ignored for outer join, because the outer join always sorts the result index. For the other types of join (left, right, inner), the default result is unsorted, as it preserves the left index order (for left and inner) or the right index order (for right), which might not be previously sorted.
By setting sort=True, we force the resulting index order to be sorted, even for the cases of left, right and inner joins.
What do you mean by I would set it False? The parameter sort is set False by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@albertvillanova what I mean is that reading the code it is very confusing and not obvious how sort
interacts with how
.
What I mean is I would explicity set sort=False
(and maybe a comment) like this:
if how == 'left':
join_index = self
# sort=True is ignored? if so, then set sort=False
elif how == 'right':
join_index = other
# same
elif how == 'inner':
join_index = self.intersection(other)
elif how == 'outer':
join_index = self.union(other)
# or could just sort=False
assert sort is False, "outer join must explicty have sort=False"
if sort:
join_index = join_index.sort_values()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback There is not really an interaction between sort
and how
. It is just that when sort=False
(the default) not additional sorting is done, and the order of the result depends on the type of join. When you set sort=True
, you explicitly sort. And that regard, I find the current code of
if sort:
join_index = join_index.sort_values()
very clear.
The only thing is that when how='outer'
, you would have sorted twice (as join_index
is already sorted at this point in that case). But I wouldn't care about that, since it is not the default, and you can just not set sort
to True to prevent that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche Great explanation.
pandas/core/frame.py
Outdated
@@ -4485,16 +4485,17 @@ def join(self, other, on=None, how='left', lsuffix='', rsuffix='', | |||
* left: use calling frame's index (or column if on is specified) | |||
* right: use other frame's index | |||
* outer: form union of calling frame's index (or column if on is | |||
specified) with other frame's index | |||
specified) with other frame's index, and sort it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a little bit of expl here
Further I am pretty sure the pd.merge
, Series.join, DataFrame.merge doc-string needs to be updated as well (as these all go thru merge for the joins we are talking about) (they could also go thru concat for certain ones, but they are not relevant here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right! This also affects DataFrame.merge
, pd.merge
and Series.align
. Their docstrings need to be updated. And also the API-breaking whatsnew notice!
pandas/tests/frame/test_join.py
Outdated
df2 = pd.DataFrame({'b': [100, 200, 300]}, index=[1, 2, 3]) | ||
|
||
result = df1.join(df2) | ||
expected = pd.DataFrame({'a': [20, 10, 0], 'b': [200, 100, None]}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't put None
when we really use np.nan
.
pandas/tests/frame/test_join.py
Outdated
index=[0, 1, 2, 3]) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
def test_join_index(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of these tests on index should be in tests/pandas/tests/indexes/test_base.py
. These are done pretty generically, follow the existing pattern (also wouldn't hurt to move these tests and your new ones to a new test class).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only new method I have created is test_join
. All the other methods were previously in tests/frame/test_misc_api.py
and I just moved them to this file I have created tests/frame/test_join.py
.
Despite their misleading names, the methods test_join_index
and a test_join_index_more
do not directly test any Index
method, but the DataFrame.join
functionality and the index and columns in the result DataFrame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok that's fine then.
@@ -609,14 +609,14 @@ def test_intersection(self): | |||
# non monotonic | |||
idx1 = Index([5, 3, 2, 4, 1], name='idx') | |||
idx2 = Index([4, 7, 6, 5, 3], name='idx') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the file where all of the index join tests go.
@@ -2801,6 +2802,7 @@ def join(self, other, how='left', level=None, return_indexers=False): | |||
how : {'left', 'right', 'inner', 'outer'} | |||
level : int or level name, default None | |||
return_indexers : boolean, default False | |||
sort : boolean, default False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a versionadded tag
pandas/indexes/range.py
Outdated
@@ -444,16 +445,19 @@ def join(self, other, how='left', level=None, return_indexers=False): | |||
how : {'left', 'right', 'inner', 'outer'} | |||
level : int or level name, default None | |||
return_indexers : boolean, default False | |||
sort : boolean, default False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same (actually this doc-string should be a shared_doc string if you want to move it), rather than repeating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback How could I create a shared_docstring? Any existing example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
look for shared_docs and u can see
(these r defined in base.py)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@albertvillanova added some small comments on the docs
@jreback the Index.join
docstring says "this is an internal non-public method". I am not sure of its status (never used that method? But if that is really the case, I would remove it from the examples in the whatsnew.
doc/source/whatsnew/v0.20.0.txt
Outdated
1 10 100 | ||
2 20 200 | ||
|
||
In [7]: (res1, res2) = df1.align(df2, axis=0, join='inner') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You already have an align example above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but the former is for Series and the latter is for DataFrame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is needed to show both, people know series and dataframe methods work similar.
I would just list both Series and DataFrame align in the bullet item of Series.align, and then show eg the series example.
doc/source/whatsnew/v0.20.0.txt
Outdated
|
||
df1.merge(df2, how='inner', left_index=True, right_index=True) | ||
|
||
pd.merge(df1, df2, how='inner', left_index=True, right_index=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is needed to show the result of both pd.merge(df1, df2)
and df1.merge(df2)
(you mentioned them all in the text)
(BTW, you don't need to put those empty lines between the different input lines, that has no impact on the result after building the docs)
pandas/core/frame.py
Outdated
@@ -127,7 +127,8 @@ | |||
* left: use only keys from left frame (SQL: left outer join) | |||
* right: use only keys from right frame (SQL: right outer join) | |||
* outer: use union of keys from both frames (SQL: full outer join) | |||
* inner: use intersection of keys from both frames (SQL: inner join) | |||
* inner: use intersection of keys from both frames (SQL: inner join), | |||
preserving the order of the left keys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small rst nitpick (it's very whitespace sensitive ..): the 'preserving ..' should be aligned with the 'inner ..' on the line above
pandas/core/frame.py
Outdated
* inner: form intersection of calling frame's index (or column if | ||
on is specified) with other frame's index | ||
on is specified) with other frame's index, preserving the | ||
order of the calling's one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same here (about the whitespace alignment)
doc/source/whatsnew/v0.20.0.txt
Outdated
@@ -308,6 +308,8 @@ Other enhancements | |||
- ``pd.types.concat.union_categoricals`` gained the ``ignore_ordered`` argument to allow ignoring the ordered attribute of unioned categoricals (:issue:`13410`). See the :ref:`categorical union docs <categorical.union>` for more information. | |||
- ``pandas.io.json.json_normalize()`` with an empty ``list`` will return an empty ``DataFrame`` (:issue:`15534`) | |||
- ``pd.DataFrame.to_latex`` and ``pd.DataFrame.to_string`` now allow optional header aliases. (:issue:`15536`) | |||
- ``Index.intersection()`` accepts parameter ``sort`` (:issue:`15582`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be join
instead of intersection
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. I have fixed this.
pandas/tests/frame/test_merge.py
Outdated
@@ -0,0 +1,91 @@ | |||
# -*- coding: utf-8 -*- | |||
|
|||
from __future__ import print_function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, these need to go in pandas/tests/tools/test_merge.py
. just fit them in where appropriate.
.join
is fine here as its a direct / used method on DataFrame. .merge
is a direct calling of the pd.merge
where LOTS of tests already exist. Test location is very important to avoid future confusion.
In fact I suspect most of these are duplicated tests, pls only add as appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you already added to the correct location. no point in duplicating tests. delete these.
pandas/tests/tools/test_merge.py
Outdated
@@ -1355,3 +1355,92 @@ def test_dtype_on_merged_different(self, change, how, left, right): | |||
np.dtype('int64')], | |||
index=['X', 'Y', 'Z']) | |||
assert_series_equal(result, expected) | |||
|
|||
|
|||
class TestMergeOnIndexes(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok you added here good.
doc/source/whatsnew/v0.20.0.txt
Outdated
@@ -926,3 +1055,4 @@ Bug Fixes | |||
- Bug in ``pd.melt()`` where passing a tuple value for ``value_vars`` caused a ``TypeError`` (:issue:`15348`) | |||
- Bug in ``.eval()`` which caused multiline evals to fail with local variables not on the first line (:issue:`15342`) | |||
- Bug in ``pd.read_msgpack`` which did not allow to load dataframe with an index of type ``CategoricalIndex`` (:issue:`15487`) | |||
- Bug with ``sort=True`` in ``DataFrame.join``, ``DataFrame.merge`` and ``pd.merge`` when joining on index (:issue:`15582`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to mention DataFrame.merge
, pd.merge
and .join
are the important ones
doc/source/whatsnew/v0.20.0.txt
Outdated
@@ -308,6 +308,8 @@ Other enhancements | |||
- ``pd.types.concat.union_categoricals`` gained the ``ignore_ordered`` argument to allow ignoring the ordered attribute of unioned categoricals (:issue:`13410`). See the :ref:`categorical union docs <categorical.union>` for more information. | |||
- ``pandas.io.json.json_normalize()`` with an empty ``list`` will return an empty ``DataFrame`` (:issue:`15534`) | |||
- ``pd.DataFrame.to_latex`` and ``pd.DataFrame.to_string`` now allow optional header aliases. (:issue:`15536`) | |||
- ``Index.join()`` accepts parameter ``sort`` (:issue:`15582`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an internal method, remove this comment
doc/source/whatsnew/v0.20.0.txt
Outdated
Index order after inner join due to Index intersection | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
The ``Index.intersection`` now preserves the order of the calling Index (left) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
say what this did previously. No need to mention Index.join
or DataFrame.merge
and say .align
methods. simpler is better.
doc/source/whatsnew/v0.20.0.txt
Outdated
``DataFrame.align``). | ||
|
||
- ``Index.intersection`` and ``Index.join`` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove refernces to Index.join
doc/source/whatsnew/v0.20.0.txt
Outdated
res2 | ||
|
||
- ``DataFrame.join``, ``DataFrame.merge`` and ``pd.merge`` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove references to DataFrame.merge
doc/source/whatsnew/v0.20.0.txt
Outdated
|
||
df1.join(df2, how='inner') | ||
df1.merge(df2, how='inner', left_index=True, right_index=True) | ||
(res1, res2) = df1.align(df2, axis=0, join='inner') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove the .align examples. This is not that important to show as its a rarely used method. The .join / .intersection are much more important.
@jreback and @jorisvandenbossche Is it OK after last changes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple of minor doc edits. changes suggested for testing. This is going to be a whole lot easier to read and less error prone to re-write in a parametrized way. Its pretty straightforward to do this. I put an example up.
doc/source/whatsnew/v0.20.0.txt
Outdated
@@ -705,6 +706,69 @@ New Behavior: | |||
TypeError: Cannot compare 2014-01-01 00:00:00 of | |||
type <class 'pandas.tslib.Timestamp'> to string column | |||
|
|||
.. _whatsnew_0200.api_breaking.index_order: | |||
|
|||
Index order after inner join due to Index intersection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make the title about what you have in the first sentence
Index.intersectio now preserves the order of the left Index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would include 'join' in the title as well. Eg Index.intersection and inner join now preserve the order of the left Index
doc/source/whatsnew/v0.20.0.txt
Outdated
.. ipython:: python | ||
|
||
df1 = pd.DataFrame({'a': [20, 10, 0]}, index=[2, 1, 0]) | ||
df1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would only show .join here. a merge on the indexes is by-definition a join.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: is the inner join of merge
where the keys are columns also affected?
If so, I would include the merge example, but by joining on a common column instead of with left/right_index=True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the inner join on columns is not affected. Indeed it was already working as expected, preserving the order of the left column.
doc/source/whatsnew/v0.20.0.txt
Outdated
@@ -926,3 +990,4 @@ Bug Fixes | |||
- Bug in ``pd.melt()`` where passing a tuple value for ``value_vars`` caused a ``TypeError`` (:issue:`15348`) | |||
- Bug in ``.eval()`` which caused multiline evals to fail with local variables not on the first line (:issue:`15342`) | |||
- Bug in ``pd.read_msgpack`` which did not allow to load dataframe with an index of type ``CategoricalIndex`` (:issue:`15487`) | |||
- Bug with ``sort=True`` in ``DataFrame.join`` and ``pd.merge`` when joining on indexes (:issue:`15582`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put notes somewhere in the Bug Fix section and not at the end, otherwise you will get merge conflicts a lot.
pandas/indexes/base.py
Outdated
@@ -2855,11 +2854,18 @@ def join(self, other, how='left', level=None, return_indexers=False): | |||
how : {'left', 'right', 'inner', 'outer'} | |||
level : int or level name, default None | |||
return_indexers : boolean, default False | |||
sort : boolean, default False | |||
|
|||
.. versionadded:: 0.20.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a sentence explaining what sort does (I know its repeatative, but this is a different part of the code)
@@ -2886,6 +2888,9 @@ def join(self, other, how='left', level=None, return_indexers=False): | |||
elif how == 'outer': | |||
join_index = self.union(other) | |||
|
|||
if sort: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't answer my question though, isn't sort ignored for all but inner join? If so then I would set it False (to make the code clear for the others).
pandas/tests/frame/test_join.py
Outdated
index=[0, 1, 2, 3]) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
def test_join_sort(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to integrate these tests and use a parametrization to make this easy; this is not part of the a class test, just a raw function.
something like
@pytest.fixture
def df1():
return DataFrame(...)
@pytest.fixture
def df2():
return DataFrame(...)
@pytest.mark.parametrize(
"how, sort, expected", [
('left', False, DataFrame(.....)),
('left', True, DataFrame(.....))])
def test_join(how, sort, expected):
result = df1.join(df2, how=how, sort=sort)
tm.assert_frame_equal(result, expected)
pandas/tests/frame/test_join.py
Outdated
index=[0, 1, 2, 3]) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
def test_join_index(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would parametrize here as well.
pandas/tests/frame/test_join.py
Outdated
tm.assert_index_equal(joined.index, self.frame.index.sort_values()) | ||
tm.assert_index_equal(joined.columns, expected_columns) | ||
|
||
tm.assertRaisesRegexp(ValueError, 'join method', f.join, f2, how='foo') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then make these a separate test (for these error cases)
def test_join_index_errors
pandas/tests/frame/test_join.py
Outdated
import pandas.util.testing as tm | ||
|
||
|
||
class TestDataFrameJoin(TestData): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of using a class definition make a fixture out of frame
@pytest.fixture
def frame():
return TestData.frame
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the you don't need a class at all for testing
pandas/tests/tools/test_merge.py
Outdated
def test_merge_on_indexes(self): | ||
df1 = pd.DataFrame({'a': [20, 10, 0]}, index=[2, 1, 0]) | ||
df2 = pd.DataFrame({'b': [100, 200, 300]}, index=[1, 2, 3]) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change this to be parametrized like join above. (e.g. matrix of how/sort)
@@ -2886,6 +2888,9 @@ def join(self, other, how='left', level=None, return_indexers=False): | |||
elif how == 'outer': | |||
join_index = self.union(other) | |||
|
|||
if sort: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@albertvillanova what I mean is that reading the code it is very confusing and not obvious how sort
interacts with how
.
What I mean is I would explicity set sort=False
(and maybe a comment) like this:
if how == 'left':
join_index = self
# sort=True is ignored? if so, then set sort=False
elif how == 'right':
join_index = other
# same
elif how == 'inner':
join_index = self.intersection(other)
elif how == 'outer':
join_index = self.union(other)
# or could just sort=False
assert sort is False, "outer join must explicty have sort=False"
if sort:
join_index = join_index.sort_values()
doc/source/whatsnew/v0.20.0.txt
Outdated
@@ -705,6 +706,69 @@ New Behavior: | |||
TypeError: Cannot compare 2014-01-01 00:00:00 of | |||
type <class 'pandas.tslib.Timestamp'> to string column | |||
|
|||
.. _whatsnew_0200.api_breaking.index_order: | |||
|
|||
Index order after inner join due to Index intersection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would include 'join' in the title as well. Eg Index.intersection and inner join now preserve the order of the left Index
pandas/core/frame.py
Outdated
@@ -127,7 +127,8 @@ | |||
* left: use only keys from left frame (SQL: left outer join) | |||
* right: use only keys from right frame (SQL: right outer join) | |||
* outer: use union of keys from both frames (SQL: full outer join) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the resulting order here? Also sorted as for join?
doc/source/whatsnew/v0.20.0.txt
Outdated
.. ipython:: python | ||
|
||
df1 = pd.DataFrame({'a': [20, 10, 0]}, index=[2, 1, 0]) | ||
df1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: is the inner join of merge
where the keys are columns also affected?
If so, I would include the merge example, but by joining on a common column instead of with left/right_index=True
@@ -2886,6 +2888,9 @@ def join(self, other, how='left', level=None, return_indexers=False): | |||
elif how == 'outer': | |||
join_index = self.union(other) | |||
|
|||
if sort: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback There is not really an interaction between sort
and how
. It is just that when sort=False
(the default) not additional sorting is done, and the order of the result depends on the type of join. When you set sort=True
, you explicitly sort. And that regard, I find the current code of
if sort:
join_index = join_index.sort_values()
very clear.
The only thing is that when how='outer'
, you would have sorted twice (as join_index
is already sorted at this point in that case). But I wouldn't care about that, since it is not the default, and you can just not set sort
to True to prevent that.
@jreback I would leave the testing / pytest changes for now. That is not that trivial, and this PR has already seen enough back and forth, time to get it merged :-) |
@albertvillanova I rebased this and pushed a commit fixing the tests. The reason I suggest this is that it makes for much shorter, more understandable code that is easier to error check. There were several duplicate tests (not really a big deal), but still the same, copy-pasting is in general not a great idea for repetative things. |
@albertvillanova I think we just needed a couple of doc updates, otherwise this lgtm. |
Great @jreback, I am just addressing your previously requested changes and also adding a slightly modification in the tests to make them exhaustively consider all possible combinations of unsorted left-right indexes and sort=False,True. |
These are the changes I have made:
|
pandas/core/frame.py
Outdated
* left: use only keys from left frame (SQL: left outer join) | ||
* right: use only keys from right frame (SQL: right outer join) | ||
* outer: use union of keys from both frames (SQL: full outer join) | ||
* left: use only keys from left frame (SQL: left outer join), preserving |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you format these lines like
* right: use only keys from right frame
similar to a SQL: right outer join; preserves key orderings
doc/source/whatsnew/v0.20.0.txt
Outdated
|
||
.. ipython:: python | ||
|
||
idx1 = pd.Index([2, 1, 0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you name these left_index & right_index (for clarity)
- ``DataFrame.join`` and ``pd.merge`` | ||
|
||
.. ipython:: python | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you name these left and right
pandas/tests/tools/test_merge.py
Outdated
|
||
|
||
@pytest.fixture | ||
def df1(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you change these to df1->left, and df2->right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have also changed the order of the right index (like in pandas/tests/frame/test_join.py), so that sort=False is tested for the right join.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback I get an error when naming them left and right, because these names are also used in other parts of the code. What should I do?
pandas/tests/frame/test_join.py
Outdated
|
||
|
||
@pytest.fixture | ||
def df1(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you change df1->left and df2->right
@albertvillanova just a couple of minor renames. ping when pushed and green and will merge. thanks for the patience! you can see that PR's are mostly about getting the details / docs right; oftentimes the code changes are the easiest part! |
hmm that last broke something........lmk when you have sorted and are green. |
lgtm. ping on green. |
thanks @albertvillanova nice PR! as you can see, the docs really matter! |
Thanks both @jreback and @jorisvandenbossche for all the things I have learned with your comments, ideas and suggestions. |
@albertvillanova Thanks! |
closes pandas-dev#15582 Author: Albert Villanova del Moral <[email protected]> Author: Jeff Reback <[email protected]> Closes pandas-dev#15583 from albertvillanova/fix-15582 and squashes the following commits: 2d4e143 [Albert Villanova del Moral] Fix pytest fixture name collision 64e86a4 [Albert Villanova del Moral] Fix test on right join 73df69e [Albert Villanova del Moral] Address requested changes 8d2e9cc [Albert Villanova del Moral] Address requested changes 968c7f1 [Jeff Reback] DOC/TST: change to use parameterization 9e39794 [Albert Villanova del Moral] Address requested changes 5bf1508 [Albert Villanova del Moral] Address requested changes 654288b [Albert Villanova del Moral] Fix Travis errors 33eb740 [Albert Villanova del Moral] Address requested changes 3c200fe [Albert Villanova del Moral] Add new tests ef2581e [Albert Villanova del Moral] Fix Travis error f0d9d03 [Albert Villanova del Moral] Add whatsnew c96306d [Albert Villanova del Moral] Add sort argument to Index.join 047b513 [Albert Villanova del Moral] Address requested changes ec836bd [Albert Villanova del Moral] Fix Travis errors b977278 [Albert Villanova del Moral] Address requested changes 784fe75 [Albert Villanova del Moral] Fix error: line too long 1197b99 [Albert Villanova del Moral] Fix DataFrame column order when read from HDF file d9e29f8 [Albert Villanova del Moral] Create new DatetimeIndex from the Index.intersection result e7bcd28 [Albert Villanova del Moral] Fix typo in documentation a4ead99 [Albert Villanova del Moral] Fix typo c2a8dc3 [Albert Villanova del Moral] Implement tests c12bb3f [Albert Villanova del Moral] BUG: Fix index order for Index.intersection()
git diff upstream/master | flake8 --diff