Skip to content

Clean multiindex keys #15615

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

Conversation

toobaz
Copy link
Member

@toobaz toobaz commented Mar 8, 2017

Most of the change to code is mere refactoring (the small change in pandas/core/frame.py can be undone once we decide about #15438), but I did suppress three obsolete (the same job is already done in the right way in _getitem_iterable) and buggy lines which actually not only caused #15452, but were the real explanation for #15424.

As a side consequence, generators can now be used to index a MultiIndex. This is not documented, as far as I know, but it was already possible with a flat Index, and it was actually also tested, so I added tests for MultiIndex too and to setitem (except that not all of them currently work). Once this PR goes through, I plan to open three separate issues for

  • the failing tests with setitem and generators (of boolean values)
  • df.loc[['missing', 'keys'], :], which still erroneously returns an empty DataFrame rather than raising (as df.loc[['missing', 'keys']] does)
  • my proposal about a change in behaviour for missing keys

Feel free to tell me if any of this is not worth discussing.

@jreback jreback added Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Mar 8, 2017
@codecov-io
Copy link

codecov-io commented Mar 8, 2017

Codecov Report

Merging #15615 into master will increase coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #15615      +/-   ##
==========================================
+ Coverage   90.99%   91.01%   +0.01%     
==========================================
  Files         161      143      -18     
  Lines       49303    49387      +84     
==========================================
+ Hits        44863    44949      +86     
+ Misses       4440     4438       -2
Flag Coverage Δ
#multiple ?
#single ?
Impacted Files Coverage Δ
pandas/core/frame.py 97.87% <100%> (+0.06%) ⬆️
pandas/core/indexing.py 94.1% <100%> (+0.17%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/util/testing.py 80.92% <0%> (-19.08%) ⬇️
pandas/tools/plotting.py 71.79% <0%> (-10.03%) ⬇️
pandas/io/json/json.py 90.27% <0%> (-9.73%) ⬇️
pandas/util/_tester.py 35.29% <0%> (-3.6%) ⬇️
pandas/conftest.py 94.11% <0%> (-2.32%) ⬇️
pandas/types/concat.py 98.06% <0%> (-1.94%) ⬇️
pandas/io/pickle.py 79.54% <0%> (-1.71%) ⬇️
... and 194 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4ca9fcd...03cec75. Read the comment docs.


# boolean generator (fails)
# s2 = self.series.copy()
# s2[iter(bool_idx)] = values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these future tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes (first of the points I mentioned for future issues)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than commenting simply xfail them

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha! I had missed the move to pytest

@toobaz toobaz force-pushed the clean_multiindex_keys branch from 2fb090f to b250e91 Compare March 10, 2017 17:54
@toobaz
Copy link
Member Author

toobaz commented Mar 10, 2017

(green)

s2.loc[iter(idces)] = values
assert_series_equal(s2, expected)

import pytest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imports go at the top. make 2 separate (non-class based) functions, the 2nd one which will xfail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not sure I understood what you meant by "non-class based")

@toobaz toobaz force-pushed the clean_multiindex_keys branch from b250e91 to b4a1c5b Compare March 11, 2017 01:36
@@ -198,6 +202,27 @@ def test_loc_getitem_array(self):
result = x.loc[scalar]
tm.assert_series_equal(result, expected)

def test_loc_generator(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there does not appear to be an equivalent test on Index with a generator, or even an iterator. Sure they should work. But is this fixed in this PR? or just adding testing to something already works?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR removes some broken code which prevented __getitem__(generator) to work, so yes, strictly speaking it fixes it, but this was a totally unintended consequence. I actually made sure to keep the tests for generators in two independent commits, which you can drop if you don't think they are useful. But there was already at least one test on indexing a Series with generators, I just expanded a bit. No idea about indexing Indexes.

@jreback jreback added this to the 0.20.0 milestone Mar 20, 2017
@jreback
Copy link
Contributor

jreback commented Mar 20, 2017

not sure why this didn't fail. This is a change in behavior (which we should prob do, but needs discussion).

In [1]:         idx = pd.MultiIndex.from_product([['A', 'B', 'C'],
   ...:                                           ['foo', 'bar', 'baz']],
   ...:                                          names=['one', 'two'])
   ...:         s = pd.Series(np.arange(9, dtype='int64'), index=idx).sort_index()
   ...:     
   ...: 

In [2]: s
Out[2]: 
one  two
A    bar    1
     baz    2
     foo    0
B    bar    4
     baz    5
     foo    3
C    bar    7
     baz    8
     foo    6
dtype: int64
In [3]: s.loc[['A', 'D']]
KeyError: "['D'] not in index"

@toobaz
Copy link
Member Author

toobaz commented Mar 20, 2017

not sure why this didn't fail.

Which commit are you on?!

In [3]: idx = pd.MultiIndex.from_product([['A', 'B', 'C'], ['foo', 'bar', 'baz']],names=['one', 'two'])

In [4]: s = pd.Series(np.arange(9, dtype='int64'), index=idx).sort_index()

In [5]: s.loc[['A', 'D']]
Out[5]: 
one  two
A    bar    1
     baz    2
     foo    0
dtype: int64

In [6]: pd.util.print_versions.get_sys_info()[0]
Out[6]: ('commit', 'b4a1c5b168961324d2260657424df621bbd0a645')

@jreback
Copy link
Contributor

jreback commented Mar 20, 2017

I rebased the PR. so rebase and push again.

@toobaz
Copy link
Member Author

toobaz commented Mar 20, 2017

This is related to 05d70f4. Before:

In [2]: s = pd.Series(range(9), index=pd.MultiIndex.from_product([['A', 'B', 'C'], ['foo', 'bar', 'baz']],names=['one', 'two']))

In [3]: s.loc._getitem_iterable(['A', 'E'])
Out[3]: 
one  two
A    foo    0
     bar    1
     baz    2
dtype: int64

In [4]: pd.util.print_versions.get_sys_info()[0]
Out[4]: ('commit', '998c801f76256990b98d3f0d2ad885ae27c955a1')

After:

In [2]: s = pd.Series(range(9), index=pd.MultiIndex.from_product([['A', 'B', 'C'], ['foo', 'bar', 'baz']],names=['one', 'two']))

In [3]: s.loc._getitem_iterable(['A', 'E'])
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-3-62e87e14f902> in <module>()
----> 1 s.loc._getitem_iterable(['A', 'E'])

/home/nobackup/repo/pandas/pandas/core/indexing.py in _getitem_iterable(self, key, axis)
   1090             # if it cannot handle
   1091             indexer, keyarr = labels._convert_listlike_indexer(
-> 1092                 key, kind=self.name)
   1093             if indexer is not None:
   1094                 return self.obj.take(indexer, axis=axis)

/home/nobackup/repo/pandas/pandas/indexes/multi.py in _convert_listlike_indexer(self, keyarr, kind)
   1598             mask = check == -1
   1599             if mask.any():
-> 1600                 raise KeyError('%s not in index' % keyarr[mask])
   1601 
   1602         return indexer, keyarr

KeyError: "['E'] not in index"

In [4]: pd.util.print_versions.get_sys_info()[0]
Out[4]: ('commit', '05d70f4e617a274813bdb02db69143b5554aa106')

My patch is good as before, but your new code is not coherent with our "desired" behavior for indexers.

Now, I assumed that proceeding step by step was the best way to move forward, but my assumption was clearly wrong if while I patiently wait for my trivial patch to get merged, you do find the time to completely rewrite related indexing methods (deleting a line I had explicitly mentioned as important when submitting my PR - all this in a commit which appears as "DOC: ..." in the logs, just to make it easier to spot it).

So, considering that the changes you made ironically go in a direction I like, we'll probably rather have the "famous" discussion first and then come back on this. I'm opening an issue for this in a minute.

@jreback
Copy link
Contributor

jreback commented Mar 20, 2017

My patch is good as before, but your new code is not coherent with our "desired" behavior for

your patch fails on master.

@jreback
Copy link
Contributor

jreback commented Mar 20, 2017

Now, I assumed that proceeding step by step was the best way to move forward, but my assumption was clearly wrong if while I patiently wait for my trivial patch to get merged, you do find the time to

and how is this not step-by-step?

any indexing patch is hardly trivial.

@toobaz
Copy link
Member Author

toobaz commented Mar 20, 2017

any indexing patch is hardly trivial.

... if you don't understand what you are doing

@jreback
Copy link
Contributor

jreback commented Mar 20, 2017

... if you don't understand what you are doing

@toobaz these comments are hardly necessary.

You can put up patches all you like. But they need review and I will get to them when I have time.

@toobaz
Copy link
Member Author

toobaz commented Mar 20, 2017

You can put up patches all you like. But they need review and I will get to them when I have time.

Patches are a total waste of time if you change the underlying code, and even more if you introduce new bugs there. You have time for your huge patch but not for my three-liner?! (because the other 10 lines were trivial refactoring)

@jreback
Copy link
Contributor

jreback commented Mar 20, 2017

Patches are a total waste of time if you change the underlying code, and even more if you introduce new bugs there. You have time for your huge patch but not for my three-liner?! (because the other 10 lines were trivial refactoring)

how so, we have 75 open PR's of which I have 10, some of which have sat here for weeks waiting for review / comment by others. This is how things work. I looked at your patch and it didn't work.

So I asked you to rebase? I am not sure of the problem.

@jreback
Copy link
Contributor

jreback commented Mar 20, 2017

even more if you introduce new bugs there

How is this even true? It completely passed the test suite. Yours does not.

@toobaz
Copy link
Member Author

toobaz commented Mar 20, 2017

I am not sure of the problem.

I am, but my time here is currently wasted. See my code above, it's not complicated, you just changed the behavior of a (non-public) method. No idea how to reproduce with user code, currently.

It completely passed the test suite. Yours does not.

It did before you introduced yours, so this is obviously a non-argument.

@jreback
Copy link
Contributor

jreback commented Mar 20, 2017

@toobaz I have asked you to rebase. If you can great.

PR's are merged when they fully pass the test suite.

patches are certainly require rebase a lot sometimes. To be honest I am working on someother PR's which DID require some cleanup in the indexing code.

You should be more appreciative of others's efforts.

@toobaz
Copy link
Member Author

toobaz commented Mar 20, 2017

You should be more appreciative of others's efforts.

Ah ah, you telling me this. When you'll want to be appreciative of my efforts, feel free to reopen this (which I will be happy to rebase).

@toobaz I have asked you to rebase. If you can great.

Doesn't make sense as of now (you tried yourself, as you said)

@jreback
Copy link
Contributor

jreback commented Mar 20, 2017

Ah ah, you telling me this. When you'll want to be appreciative of my efforts, feel free to reopen this (which I will be happy to rebase).

again not sure of what you mean.

Doesn't make sense as of now (you tried yourself, as you said)

also not sure what you mean.

Sure I appreciate your efforts @toobaz but at the same time I ask you to appreciate that others are contributing. It doesn't matter that a patch was put in before yours, this PR still needs rebasing.

@jreback
Copy link
Contributor

jreback commented Mar 20, 2017

and to clarify I just ran this PR and saw the failing tests, I didn't attempt to fix it.

@jreback
Copy link
Contributor

jreback commented May 13, 2017

needs a rebase and moved whatsnew. if its easier to close and re-submit that is fine too.

@toobaz
Copy link
Member Author

toobaz commented May 13, 2017

Rebasing is fine, but to apply this, s.loc._getitem_iterable(.) must be brought back to its pre- 05d70f4 behavior (on my TODO list but not terribly high), unless we decide for a change of API in #15747.

@gfyoung
Copy link
Member

gfyoung commented Jul 13, 2017

@toobaz : I know it's been some time, but according to GitHub, the only thing you absolutely need to update is the whatsnew. Do you still want to get this merged?

@toobaz
Copy link
Member Author

toobaz commented Jul 13, 2017

@toobaz : I know it's been some time, but according to GitHub, the only thing you absolutely need to update is the whatsnew.

Yes in order to solve conflicts, but then unfortunately tests will fail (see my previous comment). So I see it as pointless, but if instead there's some utility, I can do it easily.

Do you still want to get this merged?

Well, in principle yes, but no problem for me if you want to close, I can open another PR in the future. s.loc._getitem_iterable(.) must be fixed first, and I fear it will take more time than it took me to write this patch.

@gfyoung
Copy link
Member

gfyoung commented Jul 13, 2017

@toobaz : How about this: if you could patch the merge conflict (it's pretty simple to do as you said), and we can let Travis run the tests again just to see where things stand now.

If they fail, I can close this for you. However, if that happens, if you could add a comment regarding s.loc._getitem_iterable(...) in the issue that you posted (i.e. #15452) that this needs to be patched first, that would be useful for future reference. Or, if you think it should standalone as a separate issue, feel free to file a separate that references your initial issue.

How does that sound?

@toobaz
Copy link
Member Author

toobaz commented Jul 14, 2017

: How about this: if you could patch the merge conflict (it's pretty simple to do as you said), and we can let Travis run the tests again just to see where things stand now.

Sure, no problem! Done.

if you could add a comment regarding s.loc._getitem_iterable(...) in the issue that you posted (i.e. #15452) that this needs to be patched first, that would be useful for future reference. Or, if you think it should standalone as a separate issue, feel free to file a separate that references your initial issue.

Done: #16920

@gfyoung
Copy link
Member

gfyoung commented Jul 14, 2017

Alright: Two test failures due to the fact that you're using outdated testing functions (pytest.raises instead of tm.assertRaises). However, you do have one test failure as you predicted because of that issue re: the _get_item issue.

I will close this PR given your previous comments. As you said previously, #16920 will need to be patched before fixing #15452. Feel free to take a stab at the former if you have the time.

@gfyoung gfyoung closed this Jul 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incoherence between Index and MultiIndex when labels in list are not found
4 participants