-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Unable to retrieve same-index dataframe with the last item per group #7883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
There's one thing you didn't try:
|
These lines from EdChum's answer last = lambda x: x.iloc[-1]
df.groupby(level='A').transform(last) are not the same as doing df.groupby(level='A').transform(last = lambda x: x.iloc[-1]) In the second one you're passing a keyword argument to |
Thanks @cpcloud I actually mistyped it at the beginning of this issue. I did try (and just tried again): last = lambda x: x.iloc[-1]
df.groupby(level='A').transform(last) but I do not get the desired dataframe. I get the original frame. |
@ribonoous Can you show a reproducible example? Hard to debug without seeing the data. Can make random data too, if you can't show the data where it's failing, I just need to be able to reproduce it. |
Thanks @cpcloud . The example I included above is from my current session. I understand that your answer should work, but it's literally not working on my machine as I type it: N.In <121>: df
N.Out<121>:
C D E
A B
bar one -0.507507 -0.719877 0
three 0.735206 0.476516 0
flux six 0.347799 -0.201347 2
three -0.271447 -0.936737 2
foo five 0.930560 -0.237395 2
one -1.759220 -0.564822 1
two 1.759313 0.277731 2
two -0.506662 0.758981 1
N.In <122>: df.groupby(level='A').transform(lambda x: x.iloc[-1])
N.Out<122>:
C D E
A B
bar one -0.507507 -0.719877 0
three 0.735206 0.476516 0
flux six 0.347799 -0.201347 2
three -0.271447 -0.936737 2
foo five 0.930560 -0.237395 2
one -1.759220 -0.564822 1
two 1.759313 0.277731 2
two -0.506662 0.758981 1 I am on Python 3 in an embedded shell in IPython with the following packages:
|
@cpcloud I just tried it on a fresh IPython session (no terminal embedded):
|
ok let me try |
ok now i can repro |
@ribonoous thx for the example! |
interesting seems to be related to types |
|
Thanks @cpcloud More than happy to help. Hopefully I will be able to contribute fixing the code and not just reporting examples soon. |
let us know if you need any help! |
issue is here: https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L2799 completely unsafe to do this with multiple dtypes (well it 'works', but no effect). need to do this block by block and will work |
hm darn, well gladd you asved me from that rabbit hole |
this should at least raise on mixed dtypes |
@ribonoous care to make a pull request, this is pretty straightforward |
well....not that hard to fix .... |
ok ... in that case @ribonoous offer still stands! |
Thanks @cpcloud I am looking at the line that jreback mentioned. This is my first time looking inside of pandas, as well as my first time contributing to an open source project on GitHub. Would you mind giving me a hint of where to look into? From the discussion I inferred that the problem occurs when the dataframe has mixed if isinstance(res, Series):
if res.index.is_(obj.index):
group.T.values[:] = res
else:
group.values[:] = res
applied.append(group)
else: ? |
@ribonoous the Something like
maybe need to add some timings for this as well. (and maybe a doc warning) |
Sure. I'll outline what I think could be the start of a solution and we can iterate from there. Let's say you have
And you want to execute df.groupby(level='A').transform(lambda x: x.iloc[-1]) Eventually you'll get to for name, group in gen: is iterating over each group generated by the call to Ignoring the fast path/slow path stuff (not relevant ATM since the issue lies with the dtype of group),
Notice that the |
This is actually a simple enough problem that you can gain some understanding about how everything works :) |
I think I understand the problem. I presume I need to iterate through |
yep u need to do this block by block (which by definition are a single dtype) |
@ribonoous lmk if you have made progress on this. I think have to address (for 0.15.1 for sure). to include Categorical (so by definition need to do it block-by-block), see #8065 . One possibility is to have a fast-path where you can do integer-like as the current path (e.g. float/int data). And have a 2nd path for multi-dtypes (which basically accumulates the results in a list then concats at the end). which is slower, but more robust. |
Thanks @jreback I apologize for not having responded earlier. Work has become hectic in recent weeks and haven't had a chance to look into it. Everyone was very helpful in the thread and am very sorry I haven't had a chance to pick it up from where @cpcloud left it. I certainly want to learn to contribute, but can't promise anything at this point :((. If any of you decides to fix it, I will make sure to see how how do it, since at a minimum it will definitely be educative. |
looks good in master, needs tests |
The original thread is here.
The goal is to get a dataframe with the same index with each entry having the last value in the groups associated with
A
, i.e.However, none of the following work:
df.groupby(level='A').apply(lambda x: x[-1])
df.groupby(level='A').transform(pd.Series.last)
df.groupby(level='A').transform(lambda x: x.iloc[-1])
The text was updated successfully, but these errors were encountered: