-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Slicing with no keys found #10695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report. Can you attach the script which include the sample data preparation? |
this is discussed tagentially in #10549 this is the discussion is Some want to always make it a key error if not ALL elements match. This I think is too restrictive and makes bugs very hard to find. The other way is also a problem as if you have no matches it would silenty skip a fairly common error condition IMHO. Note that you are discussing |
@sinhrks: yes of course (apologies), how about this:
@jreback : apologies I should have used I guess we could have three different indexers, depending on how you want missing values to be handled ( Perhaps |
@filmackay adding ANOTHER indexer is a non-starter, too complicated already. however, adding a |
|
Is it crazy to think that |
E.g. since you cannot use
Currently I think we are in a consistent state that is predictbile. So a proposal to add a |
I have a similar issue with respect to the I can create a new issue for this. I can also provide an example. Let me know if I need to do either. I would support an idea where adding |
I have an example that illustrates what I'd like to do, and a proposal. Here's the example: citypairs = [('Miami', 'Boston'), ('Miami','New York'), ('New York', 'San Francisco'),
('Boston', 'New York'), ('Boston', 'San Francisco')]
index = pd.MultiIndex.from_tuples(citypairs, names=['origin','dest'])
s = pd.Series([i*10+10 for i in range(5)], index=index)
# Compute all of the cities that appear as an origin or a destination
cities = set(p[0] for p in citypairs).union(set(p[1] for p in citypairs))
osums = { c : s.loc[c,:].sum() for c in cities}
dsums = { c : s.loc[:,c].sum() for c in cities} The idea in this example is that I have data from a data source that has pairs of cities, and need to compute sums of the series for each city that appears by origin and by destination. In the above code, the computation of |
this adds unneeded complexity |
@jreback I tried with categoricals in a MultiIndex and still get an indexing problem if something is missing. It's because the categories aren't pushed down to the MultiIndex. Here's the example. Am I doing something wrong? citypairs = [('Miami', 'Boston'), ('Miami','New York'), ('New York', 'San Francisco'),
('Boston', 'New York'), ('Boston', 'San Francisco')]
vals = [i*10+10 for i in range(5)]
df = pd.DataFrame({ 'orig' : [p[0] for p in citypairs],
'dest' : [p[1] for p in citypairs],
'vals' : vals})
df['orig'] = df['orig'].astype("category")
df['dest'] = df['dest'].astype("category")
cities = set(p[0] for p in citypairs).union(set(p[1] for p in citypairs))
df['orig'].cat.set_categories(cities)
df['dest'].cat.set_categories(cities)
df.set_index(['orig','dest'],inplace=True)
osums = { c : df.loc[c,:].sum() for c in cities}
dsums = { c : df.loc[:,c].sum() for c in cities} Now the error is I don't see how adding the extra |
this is what .reindex does |
@jreback I apologize for not understanding you, and it's likely I'm missing something, but I can't see how to make My proposal for a Incidentally, here is what I want to do using dicts but not pandas, but the solution doesn't scale well when the index has lots of elements in the tuples. The citypairs = [('Miami', 'Boston'), ('Miami','New York'), ('New York', 'San Francisco'),
('Boston', 'New York'), ('Boston', 'San Francisco')]
vals = [i*10+10 for i in range(5)]
adict = { z[0] : z[1] for z in zip(citypairs, vals)}
dictosums = { c : sum(adict[(c2,i)] for (c2,i) in adict.keys() if c==c2) for c in cities}
dictdsums = { c : sum(adict[(i,c2)] for (i,c2) in adict.keys() if c==c2) for c in cities} |
@Dr-Irv Maybe not fully related to this discussion, but the summing for each of the levels that you are trying to do can also be achieved using
|
@jorisvandenbossche Thanks, but the issue is that the result of that sum does not include the zero values for the missing cities. In the code that I wrote above using dictionaries, the results are: {'Boston': 90, 'New York': 30, 'San Francisco': 0, 'Miami': 30}
{'Boston': 10, 'New York': 60, 'San Francisco': 80, 'Miami': 0} Note the zero values for The reason the zero values are needed is because there is other code that needs the sums for all cities. |
@Dr-Irv you can simply fully expand the index levels via Then you can do whatever you want, including using
creating another indexer is a complete non-starter as it would make indexing even MORE confusing (we already have |
@jreback The problem with your solution is when there are 1000 different cities, but the original data has 10,000 city pairs. (The example data comes from the representation of a graph). Your solution above creates a I understand the potential confusion of adding another indexer, but that then brings us back to the possibility of adding the |
@Dr-Irv you can't have it both ways, either you have a sparse repr which is what a To be honest pandas is not very good at representing graphs. Trying to shove things in like this are non-starters. Not to mention your code above in non-performant. |
@jreback But this brings us back to the original problem (which started the discussion above by @filmackay), which is that I have a sparse representation using I think this might be related to the discussion in #4036, as I have use cases where there is a |
@Dr-Irv again specific for your problem: you can also do the reindex after the summing for each level:
I think in this specific case, you will find a better and more performant solution as iterating through the dataframe. But nonetheless, the original question in this issue can still be relevant of course. @jreback I agree that adding yet another indexer is not the best way forward, but I was wondering if there would be room for a method (so not an indexer, and so only usable for getting values and not setting) to do this? Some kind of general 'getitem' method, and as this is a method it would be easier to add keyword arguments to eg specify what should happen if the label is not found. |
building on @jorisvandenbossche soln, I think this is what you want.
|
@jreback @jorisvandenbossche Thank you very much for your response. While I appreciate your solution, it isn't as elegant as the one I propose. I'm trying to create something for teaching purposes that looks easy. Regarding the comment about |
@Dr-Irv you iterating over values is not idiomatic at all you have a very nice soln above - this is a very pandonic soln individually indexing is NOT a soln |
Respectfully, can I chime in and agree with Dr Irv? His
The accuracy of this assumption depends on context. It seems reasonable that there are some programming contexts where returning the empty frame is the expected behavior. (For example, if the result is married with an aggregator like dicts have both |
@pjcpjc you can certainly comment! We cannot extend this already way way too complicated API any more we have: so you want another one? as I said above, a keyword argument would possibly be ok, but certainly not another indexer. We have to have a default. This certainly may not work for everyone (and that is why we have |
Or a keyword argument to Whatever you prefer.
I think the spelling in this case is MOAR!! ;) But seriously, I am familiar with the domain here (agruably I am an expert and Dr Irv is a guru) and he is right on the money in terms of identifying a context in which the natural result is an empty frame and not an error. I don't think he is being non-performant. He can get what he wants (I think) with a helper function, but that will look awkward when trying to convince other people to jump over from their current legacy language. We're not trying to make trouble - we're trying to bring optimization programming into the 21st century and use Python + pandas. |
this closed by #15747. |
If anyone cares, I now regret some of the strong language I used above. I think pandas, while awesome, isn't the right vehicle for the sort of idioms that optimization people are accustomed to. Optimization people should either write pandas code pandonically or use different data structures. There are a small mountain of examples over at https://github.com/ticdat/ticdat/tree/master/examples if anyone is interested, to include a pandonic example. |
I just upgraded to latest Pandas 0.16, and the new error has hit me (with Pandas 0.15) when slicing with multiple values. (df.ix[[list_of_values]]). I actually think it is more valid to return an empty DataFrame, than to throw an error.
The best I've been able to come up with to reproduce the previous behaviour (fail silently, return empty DataFrame) is:
Not saying I'm right on the error/empty argument; but is the above the most elegant solution?
Perhaps we should consider three distinct slicing operations:
I would think anyone indexing would be vary aware of what they are expecting from the above?
The text was updated successfully, but these errors were encountered: