Skip to content

Slicing with no keys found #10695

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
filmackay opened this issue Jul 29, 2015 · 28 comments
Closed

Slicing with no keys found #10695

filmackay opened this issue Jul 29, 2015 · 28 comments
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves Needs Discussion Requires discussion from core team before further action

Comments

@filmackay
Copy link

I just upgraded to latest Pandas 0.16, and the new error has hit me (with Pandas 0.15) when slicing with multiple values. (df.ix[[list_of_values]]). I actually think it is more valid to return an empty DataFrame, than to throw an error.

The best I've been able to come up with to reproduce the previous behaviour (fail silently, return empty DataFrame) is:

# I want to filter a DataFrame by the first level of the index
df.ix[df.reset_index().IndexValue.isin(list_of_values).values]

Not saying I'm right on the error/empty argument; but is the above the most elegant solution?

Perhaps we should consider three distinct slicing operations:

  1. Insists on everything being there (and throws errors)
  2. Creates spaces where there is nothing there (reindex)
  3. Just returns what's present, leaving the rest out (silent error)

I would think anyone indexing would be vary aware of what they are expecting from the above?

@sinhrks
Copy link
Member

sinhrks commented Jul 29, 2015

Thanks for the report. Can you attach the script which include the sample data preparation?

@sinhrks sinhrks added the Indexing Related to indexing on series/frames, not to indexes themselves label Jul 29, 2015
@jreback
Copy link
Contributor

jreback commented Jul 29, 2015

this is discussed tagentially in #10549

this is the discussion is .loc like .reindex Or not. Or somewhere in between (where we are now). If you have a single element at least that matches you get like a .reindex; if you have NO elements matching you get a KeyError.

Some want to always make it a key error if not ALL elements match. This I think is too restrictive and makes bugs very hard to find.

The other way is also a problem as if you have no matches it would silenty skip a fairly common error condition IMHO.

Note that you are discussing .ix, but it is very unlikely we will change this behavior AT ALL. If anything gets changed it would be .loc which is much more strict and deterministic w/o respect to the type of the index data (IOW, .ix has lots and lots of edge cases)

@filmackay
Copy link
Author

@sinhrks: yes of course (apologies), how about this:

>>> df = pd.DataFrame(np.random.randint(10,size=(8,8)),index=pd.MultiIndex.from_product([['bar', 'baz', 'foo', 'qux'], ['one', 'two']],names=['first','second']))
>>> df 
              0  1  2  3  4  5  6  7
first second                        
bar   one     0  5  5  5  6  2  6  8
      two     2  6  9  0  3  6  7  9
baz   one     9  0  9  9  2  5  7  4
      two     4  8  1  2  9  2  8  1
foo   one     2  7  3  6  5  5  5  2
      two     3  4  6  2  7  7  1  2
qux   one     0  8  5  9  5  5  7  3
      two     7  4  0  7  3  6  8  6

# works great:
>>> df.loc[['bar']]
              0  1  2  3  4  5  6  7
first second
bar   one     9  4  6  0  9  2  0  6
      two     0  2  0  4  2  6  5  4

# not so much:
>>> df.loc[['ba']]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/fil/anaconda/lib/python3.4/site-packages/pandas/core/indexing.py", line 1189, in __getitem__
    return self._getitem_axis(key, axis=0)
  File "/Users/fil/anaconda/lib/python3.4/site-packages/pandas/core/indexing.py", line 1323, in _getitem_axis
    return self._getitem_iterable(key, axis=axis)
  File "/Users/fil/anaconda/lib/python3.4/site-packages/pandas/core/indexing.py", line 972, in _getitem_iterable
    raise KeyError("cannot index a multi-index axis with these keys")
KeyError: 'cannot index a multi-index axis with these keys'

# my workaround: (but this is not "type safe")
>>> df.loc[df.reset_index()['first'].isin(['ba']).values]
Empty DataFrame
Columns: [0, 1, 2, 3, 4, 5, 6, 7]
Index: []

# still works for valid values:
>>> df.loc[df.reset_index()['first'].isin(['bar']).values]
              0  1  2  3  4  5  6  7
first second
bar   one     9  4  6  0  9  2  0  6
      two     0  2  0  4  2  6  5  4

@jreback : apologies I should have used .loc. Should I treat .ix as deprecated, and avoid use?

I guess we could have three different indexers, depending on how you want missing values to be handled (.xx_union[['bar','ba']] / .xx_intersect[['bar', 'ba'] / .reindex(['bar', 'ba'), or a parameter that specifies how to handle it (eg. .xx(missing='error/intersect/union')[['bar', 'ba']]).

Perhaps .reindex could accept an error/intersect/union? Aren't .reindex and .loc fundamentally the same thing, or am I missing a key distinction? One distinction would be that .loc accepts boolean series: but instead the indexing values case.

@jreback
Copy link
Contributor

jreback commented Jul 29, 2015

@filmackay adding ANOTHER indexer is a non-starter, too complicated already.

however, adding a missing='error'|'intersect'|'union' to .loc/.ix/.reindex (in the function call) might be worthwhile

@jreback
Copy link
Contributor

jreback commented Jul 29, 2015

.ix is not deprecated as it has a small set of cases where its useful, but for ore consistent behavior (in the presence of integer indexes), .loc is recommended

@jreback jreback added the Needs Discussion Requires discussion from core team before further action label Jul 29, 2015
@filmackay
Copy link
Author

Is it crazy to think that .loc, .ix and .reindex could be merged into a single indexer? Is it legacy difficult, or do you think it's fundamentally flawed to try to encompass all use cases?

@jreback
Copy link
Contributor

jreback commented Jul 30, 2015

.ix needs to stay like it is as there are a few edge cases which .loc won't ever support
e.g. simultaneous indexing by location is one dim AND by position in another
df.ix[[1,3,5],[['A','B']] where the index is say an object dtype. (otherwise you could use in this example df.loc[[df.index[[1,3,5]],[['A','B']])

.reindex and .loc are pretty equivalent. The discussion is basically to go back to where they are different, e.g..reindex will give you all values that you asked for (but NaN them if they don't exist), and then .loc can raise KeyError). BUT the issue is that setting is different than getting.

E.g. since you cannot use .reindex on the LHS of an expression, e.g. df.reindex(...) = ... is a python error, then doing:

df.loc[....] = .... is the only method. So therefor you don't want to have a situation where you can get something in a different way then you can set something. So imagine you are setting an index value that doesn't exist, you simply set it, done (this is enlargement), but now should you raise when you get something non-existant? So these are tied.

Currently I think we are in a consistent state that is predictbile. .reindex is the same as .loc, and getting and setting are mirror images. Downside is that its very tricky for .loc to figure out when to raise (currently only when NOTHING matches).

So a proposal to add a errors='strict|missing' might make sense.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Oct 30, 2015

I have a similar issue with respect to the .xs method. I have a DataFrame df1 indexed with 3 levels (A,B,C), and a column 'flow' and another DataFrame df2 that is indexed by 2 levels (A,B) and a column 'balance'. Note that there are values (a,b) in df2 that are not present in df1. For each pair (a,b) in df2, I want to create a new column in df2 with the values that correspond to df1.xs[(a,b)].flow.sum() + df2.loc[(a,b)].balance . If (a,b) is not in df1, I want the sum to be zero, but you still get the balance value. Right now df1.xs[(a,b)].sum() gives a key error, so I end up with some funky code to test if (a,b) is in df1 when computing the sums. If .xs would just return an empty DataFrame if (a,b) was not there, I'd have very clean code.

I can create a new issue for this. I can also provide an example. Let me know if I need to do either.

I would support an idea where adding errors='strict|missing' to .xs would solve the problem.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 22, 2015

I have an example that illustrates what I'd like to do, and a proposal. Here's the example:

citypairs = [('Miami', 'Boston'), ('Miami','New York'), ('New York', 'San Francisco'), 
             ('Boston', 'New York'), ('Boston', 'San Francisco')]
index = pd.MultiIndex.from_tuples(citypairs, names=['origin','dest'])
s = pd.Series([i*10+10 for i in range(5)], index=index)
# Compute all of the cities that appear as an origin or a destination
cities = set(p[0] for p in citypairs).union(set(p[1] for p in citypairs))
osums = { c : s.loc[c,:].sum() for c in cities}
dsums = { c : s.loc[:,c].sum() for c in cities}

The idea in this example is that I have data from a data source that has pairs of cities, and need to compute sums of the series for each city that appears by origin and by destination. In the above code, the computation of osums fails with a KeyError (due to the empty slice for 'San Francisco') and the computation of dsums fails with an IndexingError (due to the empty slice for 'Miami').
My proposal is as follows. Introduce a new indexer called .sloc (slicing loc) that has the same exact behavior as .loc, except that if an error occurs, an empty Series, DataFrame, or Panel is returned (dependent on what was being sliced). It is documented this way, with the warning that the user has to recognize that this could cause errors to get missed. Caveat Emptor
@jreback I know you said above that you didn't want to introduce a new indexer, but this solution seems rather clean, and I'd be willing to try to implement it as it would really help my applications.

@jreback
Copy link
Contributor

jreback commented Dec 22, 2015

this adds unneeded complexity
this is exactly what Categoricals get you

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 22, 2015

@jreback I tried with categoricals in a MultiIndex and still get an indexing problem if something is missing. It's because the categories aren't pushed down to the MultiIndex. Here's the example. Am I doing something wrong?

citypairs = [('Miami', 'Boston'), ('Miami','New York'), ('New York', 'San Francisco'), 
             ('Boston', 'New York'), ('Boston', 'San Francisco')]
vals = [i*10+10 for i in range(5)]
df = pd.DataFrame({ 'orig' : [p[0] for p in citypairs],
                    'dest' : [p[1] for p in citypairs],
                    'vals' : vals})
df['orig'] = df['orig'].astype("category")
df['dest'] = df['dest'].astype("category")
cities = set(p[0] for p in citypairs).union(set(p[1] for p in citypairs))
df['orig'].cat.set_categories(cities)
df['dest'].cat.set_categories(cities)
df.set_index(['orig','dest'],inplace=True)
osums = { c : df.loc[c,:].sum() for c in cities}
dsums = { c : df.loc[:,c].sum() for c in cities}

Now the error is KeyError: 'the label [San Francisco] is not in the [index]'.

I don't see how adding the extra .sloc adds complexity since it will work just like .loc but not fail. It makes codes like the one I initially wrote above quite clean.

@jreback
Copy link
Contributor

jreback commented Dec 23, 2015

this is what .reindex does

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 23, 2015

@jreback I apologize for not understanding you, and it's likely I'm missing something, but I can't see how to make .reindex work and have "clean" code. In my original example, I used a MultiIndexconsisting of two columns that were strings. You suggested to use Categoricals. I tried that, and I can't get that to work either.

My proposal for a .sloc would be implemented by subclassing the current _LocationIndexer and just catching the exception when __getitem__ is called to return an empty object. So I don't see the complexity you refer to. In my original example, if .sloc were implemented, then just using .sloc rather than .loc would produce nice looking code.

Incidentally, here is what I want to do using dicts but not pandas, but the solution doesn't scale well when the index has lots of elements in the tuples. The if tests are not nice looking.

citypairs = [('Miami', 'Boston'), ('Miami','New York'), ('New York', 'San Francisco'), 
             ('Boston', 'New York'), ('Boston', 'San Francisco')]
vals = [i*10+10 for i in range(5)]
adict = { z[0] : z[1] for z in zip(citypairs, vals)}
dictosums = { c : sum(adict[(c2,i)] for (c2,i) in adict.keys() if c==c2) for c in cities}
dictdsums = { c : sum(adict[(i,c2)] for (i,c2) in adict.keys() if c==c2) for c in cities}

@jorisvandenbossche
Copy link
Member

@Dr-Irv Maybe not fully related to this discussion, but the summing for each of the levels that you are trying to do can also be achieved using sum and specifying the level to sum over:

In [21]: df
Out[21]:
                        vals
orig     dest
Miami    Boston           10
         New York         20
New York San Francisco    30
Boston   New York         40
         San Francisco    50

In [26]: df.sum(level=0)
Out[26]:
          vals
orig
Boston      90
Miami       30
New York    30

In [27]: df.sum(level=1)
Out[27]:
               vals
dest
Boston           10
New York         60
San Francisco    80

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 23, 2015

@jorisvandenbossche Thanks, but the issue is that the result of that sum does not include the zero values for the missing cities. In the code that I wrote above using dictionaries, the results are:

{'Boston': 90, 'New York': 30, 'San Francisco': 0, 'Miami': 30}
{'Boston': 10, 'New York': 60, 'San Francisco': 80, 'Miami': 0}

Note the zero values for San Francisco and Miami in the two respective sums.

The reason the zero values are needed is because there is other code that needs the sums for all cities.

@jreback
Copy link
Contributor

jreback commented Dec 23, 2015

@Dr-Irv you can simply fully expand the index levels via .reindex if you want. you must specify the full scheme, as this is not automatic and not obvious how you want to do this.

Then you can do whatever you want, including using .fillna(0) if you want to preserve values.

In [61]: df.reindex(index=pd.MultiIndex.from_product([list(cities),list(cities)]))
Out[61]: 
                             vals
New York      New York        NaN
              San Francisco    30
              Miami           NaN
              Boston          NaN
San Francisco New York        NaN
              San Francisco   NaN
              Miami           NaN
              Boston          NaN
Miami         New York         20
              San Francisco   NaN
              Miami           NaN
              Boston           10
Boston        New York         40
              San Francisco    50
              Miami           NaN
              Boston          NaN

creating another indexer is a complete non-starter as it would make indexing even MORE confusing (we already have [],.ix,.iloc,.loc,.at,.iat. This is WAY more than enough.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 23, 2015

@jreback The problem with your solution is when there are 1000 different cities, but the original data has 10,000 city pairs. (The example data comes from the representation of a graph). Your solution above creates a DataFrame with 1,000,000 entries, most of which are unnecessary.

I understand the potential confusion of adding another indexer, but that then brings us back to the possibility of adding the errors=strict|missing optional argument to .loc. Or, how about doing this via a pandas option that determines the behavior of .loc in terms of whether it does the things the current way (reporting KeyError) or just returns empty objects when the key is missing. Having an option might be more elegant, otherwise, you end up with errors=missing code in lots of places. I'd be willing to give this a try in terms of implementing it.

@jreback
Copy link
Contributor

jreback commented Dec 23, 2015

@Dr-Irv you can't have it both ways, either you have a sparse repr which is what a MultiIndex is, or you don't. Try to solve your problem with existing machinery, rather than resorting to looping and creating new APIs.

To be honest pandas is not very good at representing graphs. Trying to shove things in like this are non-starters.

Not to mention your code above in non-performant.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 23, 2015

@jreback But this brings us back to the original problem (which started the discussion above by @filmackay), which is that I have a sparse representation using MultiIndex, but if I slice with a key that is not there, I get an error, as opposed to an empty DataFrame (which I can then apply .sum to).

I think this might be related to the discussion in #4036, as I have use cases where there is a MultiIndex with lots of levels, where I need to do sums for different combinations of the keys (i.e., fixing the keys for some levels, and summing across another level), and there may be combinations that don't exist, so if the .sum over the slice would return 0, the code would be easy to understand and explain.

@jorisvandenbossche
Copy link
Member

@Dr-Irv again specific for your problem: you can also do the reindex after the summing for each level:

In [54]: df.sum(level=0).reindex(list(cities))
Out[54]:
               vals
orig
New York         30
San Francisco   NaN
Miami            30
Boston           90

I think in this specific case, you will find a better and more performant solution as iterating through the dataframe. But nonetheless, the original question in this issue can still be relevant of course.

@jreback I agree that adding yet another indexer is not the best way forward, but I was wondering if there would be room for a method (so not an indexer, and so only usable for getting values and not setting) to do this? Some kind of general 'getitem' method, and as this is a method it would be easier to add keyword arguments to eg specify what should happen if the label is not found.
Actually there is a DataFrame.get, but this is restricted to the info axis.

@jreback
Copy link
Contributor

jreback commented Dec 23, 2015

building on @jorisvandenbossche soln, I think this is what you want.

In [4]: df.sum(level=0).reindex(index=list(cities)).fillna(0)
Out[4]: 
               vals
orig               
New York         30
San Francisco     0
Miami            30
Boston           90

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 23, 2015

@jreback @jorisvandenbossche Thank you very much for your response. While I appreciate your solution, it isn't as elegant as the one I propose. I'm trying to create something for teaching purposes that looks easy.

Regarding the comment about DataFrame.get, what would really be nice is to have something like that, but for the main (axis=0) axis, with all the same slicing that exists for .loc. In other words, I'd like to be able to do something like NDFrame.rows[a,b,:,d,e] that would return just the rows corresponding to that slice. This would work the same way for Series, DataFrame and Panel. It slices along the first dimension, where you're likely to have a MultiIndex. If it did that, and returned empty objects when the slices were empty, then I get back to clean code for my examples, and I think we address the main point of this issue that was originally raised. Put another way, we create a method that is a shorthand for .loc(axis=0) and document that if a key is specified and is missing, then an empty object is returned.

@jreback
Copy link
Contributor

jreback commented Dec 23, 2015

@Dr-Irv you iterating over values is not idiomatic at all
and completely non performant

you have a very nice soln above - this is a very pandonic soln
and esp if you are teaching this is the only way to go

individually indexing is NOT a soln

@pjcpjc
Copy link

pjcpjc commented Jan 27, 2016

Respectfully, can I chime in and agree with Dr Irv?

His .sloc suggestion seems reasonable.

The other way is also a problem as if you have no matches it would silenty skip a fairly common error condition IMHO.

The accuracy of this assumption depends on context. It seems reasonable that there are some programming contexts where returning the empty frame is the expected behavior. (For example, if the result is married with an aggregator like sum, which is the specific need here).

dicts have both get and [] for that reason. When I see one, I know a missing entry is not an error, and when I see the other, I know it is. I'm not sure why this issue is different.

@jreback
Copy link
Contributor

jreback commented Jan 27, 2016

@pjcpjc you can certainly comment!

We cannot extend this already way way too complicated API any more

we have: [], .ix, .iloc, .loc, .iat, at.

so you want another one?

as I said above, a keyword argument would possibly be ok, but certainly not another indexer.

We have to have a default. This certainly may not work for everyone (and that is why we have
options).

@pjcpjc
Copy link

pjcpjc commented Jan 27, 2016

Or a keyword argument to .loc that does what Dr. Irvs .sloc does... with the default value for the keyword being consistent with the current behavior.

Whatever you prefer.

We cannot extend this already way way too complicated API any more

I think the spelling in this case is MOAR!! ;)

But seriously, I am familiar with the domain here (agruably I am an expert and Dr Irv is a guru) and he is right on the money in terms of identifying a context in which the natural result is an empty frame and not an error.

I don't think he is being non-performant. He can get what he wants (I think) with a helper function, but that will look awkward when trying to convince other people to jump over from their current legacy language.

We're not trying to make trouble - we're trying to bring optimization programming into the 21st century and use Python + pandas.

@jreback
Copy link
Contributor

jreback commented Jan 18, 2018

this closed by #15747. .loc and .reindex are distinct and have separate purposes.

@jreback jreback closed this as completed Jan 18, 2018
@pjcpjc
Copy link

pjcpjc commented Jan 18, 2018

If anyone cares, I now regret some of the strong language I used above. I think pandas, while awesome, isn't the right vehicle for the sort of idioms that optimization people are accustomed to. Optimization people should either write pandas code pandonically or use different data structures. There are a small mountain of examples over at https://github.com/ticdat/ticdat/tree/master/examples if anyone is interested, to include a pandonic example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

6 participants