Unable to retrieve same-index dataframe with the last item per group #7883

amelio-vazquez-reina · 2014-07-30T20:41:51Z

The original thread is here.

> df
                   C         D  E
A    B                           
bar  one   -1.350006  0.260339  2
     three -0.236451 -0.056614  0
flux six   -0.515571 -0.155078  1
     three -0.365032  1.055669  2
foo  five   1.145811  1.514191  0
     one   -0.108427 -0.643262  2
     two    0.286897  0.405798  2
     two   -1.452286  1.149264  2

The goal is to get a dataframe with the same index with each entry having the last value in the groups associated with A, i.e.

> df
                   C         D  E
A    B                           
bar  one   -0.236451 -0.056614  2
     three -0.236451 -0.056614  2
flux six   -0.365032  1.055669  2
     three -0.365032  1.055669  2
foo  five  -1.452286  1.149264  2
     one   -1.452286  1.149264  2
     two   -1.452286  1.149264  2
     two   -1.452286  1.149264  2

However, none of the following work:

df.groupby(level='A').apply(lambda x: x[-1])
df.groupby(level='A').transform(pd.Series.last)
df.groupby(level='A').transform(lambda x: x.iloc[-1])

The text was updated successfully, but these errors were encountered:

cpcloud · 2014-07-30T20:51:45Z

There's one thing you didn't try:

In [37]: df
Out[37]:
                 C       D       E
A    B
bar  one   -0.9384  0.3450 -0.8715
     three  1.1107  1.9896 -0.4797
flux six    0.5059  0.0494  0.0993
     three  0.3148 -0.7710  0.2409
foo  five   2.3012 -0.3600 -0.9216
     one   -0.3965 -1.0410 -0.0600
     two    1.2348  0.6059 -0.8019
     two    0.0912  0.5812 -0.2724

In [38]: df.groupby(level='A').transform(lambda x: x.iloc[-1])
Out[38]:
                 C       D       E
A    B
bar  one    1.1107  1.9896 -0.4797
     three  1.1107  1.9896 -0.4797
flux six    0.3148 -0.7710  0.2409
     three  0.3148 -0.7710  0.2409
foo  five   0.0912  0.5812 -0.2724
     one    0.0912  0.5812 -0.2724
     two    0.0912  0.5812 -0.2724
     two    0.0912  0.5812 -0.2724

cpcloud · 2014-07-30T20:56:23Z

These lines from EdChum's answer

last = lambda x: x.iloc[-1]
df.groupby(level='A').transform(last)

are not the same as doing

df.groupby(level='A').transform(last = lambda x: x.iloc[-1])

In the second one you're passing a keyword argument to transform which will of course raise a TypeError because it expects at least two positional arguments and you're only passing a keyword argument and no positionals (except the implicit self that comes with every method)

amelio-vazquez-reina · 2014-07-30T21:01:16Z

Thanks @cpcloud I actually mistyped it at the beginning of this issue. I did try (and just tried again):

last = lambda x: x.iloc[-1]
df.groupby(level='A').transform(last)

but I do not get the desired dataframe. I get the original frame.

cpcloud · 2014-07-30T21:02:59Z

@ribonoous Can you show a reproducible example? Hard to debug without seeing the data. Can make random data too, if you can't show the data where it's failing, I just need to be able to reproduce it.

amelio-vazquez-reina · 2014-07-30T21:05:31Z

Thanks @cpcloud . The example I included above is from my current session. I understand that your answer should work, but it's literally not working on my machine as I type it:

N.In <121>: df
N.Out<121>: 
                   C         D  E
A    B                           
bar  one   -0.507507 -0.719877  0
     three  0.735206  0.476516  0
flux six    0.347799 -0.201347  2
     three -0.271447 -0.936737  2
foo  five   0.930560 -0.237395  2
     one   -1.759220 -0.564822  1
     two    1.759313  0.277731  2
     two   -0.506662  0.758981  1

N.In <122>: df.groupby(level='A').transform(lambda x: x.iloc[-1])
N.Out<122>: 
                   C         D  E
A    B                           
bar  one   -0.507507 -0.719877  0
     three  0.735206  0.476516  0
flux six    0.347799 -0.201347  2
     three -0.271447 -0.936737  2
foo  five   0.930560 -0.237395  2
     one   -1.759220 -0.564822  1
     two    1.759313  0.277731  2
     two   -0.506662  0.758981  1

I am on Python 3 in an embedded shell in IPython with the following packages:

$ pip freeze
Cython==0.20.1
DataShape==0.2
Flask==0.10.1
Jinja2==2.7.2
MarkupSafe==0.18
Pillow==2.4.0
PyYAML==3.11
Pygments==1.6
SQLAlchemy==0.9.4
Sphinx==1.2.2
Werkzeug==0.9.6
XlsxWriter==0.5.5
appnope==0.0.6
argcomplete==0.6.7
astropy==0.3.2
backports.ssl-match-hostname==3.4.0.2
beautifulsoup4==4.3.1
binstar==0.5.3
bitarray==0.8.1
blaze==0.5
blz==0.6.2
bokeh==0.4.4
cdecimal==2.3
colorama==0.2.7
configobj==5.0.5
docutils==0.11
elpy==1.4.2
flake8==2.2.2
future==0.12.1
greenlet==0.4.2
h5py==2.3.0
ipython==2.1.0
itsdangerous==0.24
jdcal==1.0
llvmpy==0.12.6
lpsolve55==5.5.2.0
lxml==3.3.5
matplotlib==1.3.1
mccabe==0.2.1
mock==1.0.1
multipledispatch==0.4.3
networkx==1.8.1
nose==1.3.3
numba==0.13.2
numexpr==2.3.1
numpy==1.8.1
openpyxl==1.8.5
pandas==0.14.1
patsy==0.2.1
pep8==1.5.7
ply==3.4
psutil==2.1.1
psycopg2==2.5.3
py==1.4.20
py4j==0.8.1
pycosat==0.6.1
pycparser==2.10
pycrypto==2.6.1
pyflakes==0.8.1
pymc==2.3.4
pyparsing==2.0.1
pytest==2.5.2
python-dateutil==2.1
pytz==2014.4
pyzmq==14.3.0
redis==2.9.1
requests==2.3.0
rope-py3k==0.9.4-1
runipy==0.1.0
scikit-image==0.10.0
scikit-learn==0.15.0
scipy==0.14.0
seaborn==0.4.dev
six==1.7.3
statsmodels==0.5.0
sympy==0.7.5
tables==3.1.1
tornado==3.2.1
ujson==1.33
xlrd==0.9.3

amelio-vazquez-reina · 2014-07-30T21:08:55Z

@cpcloud I just tried it on a fresh IPython session (no terminal embedded):

def create_random_multi_index():
  df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                            'foo', 'flux', 'foo', 'flux'],  
                     'B' : ['one', 'one', 'two', 'three',
                            'two', 'six', 'five', 'three'],
                     'C' : randn(8), 'D' : randn(8), 'E': randint(0,3, size=(8,))})
  df.set_index(['A', 'B'], inplace=True)
  df.sort_index(inplace=True)
  return df

In [8]: df = create_random_multi_index()

In [9]: df
Out[9]: 
                   C         D  E
A    B                           
bar  one    0.113720  1.000103  1
     three  0.778840 -0.546913  0
flux six    1.688509  1.541020  2
     three  2.973229 -0.063245  1
foo  five   0.208310  1.268056  0
     one    0.404996 -1.332461  2
     two   -1.038567  0.376829  1
     two    1.528148  1.386106  0

In [10]: df.groupby(level='A').transform(lambda x: x.iloc[-1])
Out[10]: 
                   C         D  E
A    B                           
bar  one    0.113720  1.000103  1
     three  0.778840 -0.546913  0
flux six    1.688509  1.541020  2
     three  2.973229 -0.063245  1
foo  five   0.208310  1.268056  0
     one    0.404996 -1.332461  2
     two   -1.038567  0.376829  1
     two    1.528148  1.386106  0

cpcloud · 2014-07-30T21:09:29Z

ok let me try

cpcloud · 2014-07-30T21:11:48Z

ok now i can repro

cpcloud · 2014-07-30T21:12:02Z

@ribonoous thx for the example!

cpcloud · 2014-07-30T21:12:43Z

interesting seems to be related to types

cpcloud · 2014-07-30T21:14:07Z

In [12]: df
Out[12]:
                 C       D  E
A    B
bar  one   -0.4580  0.1261  2
     three  0.9299 -1.4628  0
flux six   -0.4497  0.2510  1
     three  0.6542 -0.7250  0
foo  five  -0.2595  1.5363  2
     one    0.3749 -0.3560  0
     two    0.1009  1.5566  1
     two   -0.9102 -1.2865  0

In [13]: df[['C', 'D']].groupby(level='A').transform(lambda x: x.iloc[-1])
Out[13]:
                 C       D
A    B
bar  one    0.9299 -1.4628
     three  0.9299 -1.4628
flux six    0.6542 -0.7250
     three  0.6542 -0.7250
foo  five  -0.9102 -1.2865
     one   -0.9102 -1.2865
     two   -0.9102 -1.2865
     two   -0.9102 -1.2865

In [14]: df[['E']].groupby(level='A').transform(lambda x: x.iloc[-1])
Out[14]:
            E
A    B
bar  one    0
     three  0
flux six    0
     three  0
foo  five   0
     one    0
     two    0
     two    0

In [15]: df[['D', 'E']].groupby(level='A').transform(lambda x: x.iloc[-1])
Out[15]:
                 D  E
A    B
bar  one    0.1261  2
     three -1.4628  0
flux six    0.2510  1
     three -0.7250  0
foo  five   1.5363  2
     one   -0.3560  0
     two    1.5566  1
     two   -1.2865  0

In [16]: df[['C', 'E']].groupby(level='A').transform(lambda x: x.iloc[-1])
Out[16]:
                 C  E
A    B
bar  one   -0.4580  2
     three  0.9299  0
flux six   -0.4497  1
     three  0.6542  0
foo  five  -0.2595  2
     one    0.3749  0
     two    0.1009  1
     two   -0.9102  0

amelio-vazquez-reina · 2014-07-30T21:16:29Z

Thanks @cpcloud More than happy to help. Hopefully I will be able to contribute fixing the code and not just reporting examples soon.

cpcloud · 2014-07-30T21:16:53Z

let us know if you need any help!

jreback · 2014-07-30T21:22:04Z

issue is here: https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L2799

completely unsafe to do this with multiple dtypes (well it 'works', but no effect).

need to do this block by block and will work

cpcloud · 2014-07-30T21:23:14Z

hm darn, well gladd you asved me from that rabbit hole

cpcloud · 2014-07-30T22:53:43Z

this should at least raise on mixed dtypes

cpcloud · 2014-07-30T22:54:06Z

@ribonoous care to make a pull request, this is pretty straightforward

jreback · 2014-07-30T22:57:58Z

well....not that hard to fix ....

cpcloud · 2014-07-30T22:58:56Z

ok ... in that case @ribonoous offer still stands!

amelio-vazquez-reina · 2014-07-31T12:27:31Z

Thanks @cpcloud I am looking at the line that jreback mentioned. This is my first time looking inside of pandas, as well as my first time contributing to an open source project on GitHub. Would you mind giving me a hint of where to look into?

From the discussion I inferred that the problem occurs when the dataframe has mixed dtypes. What is the problem with the existing block?:

            if isinstance(res, Series):
                if res.index.is_(obj.index):
                    group.T.values[:] = res
                else:
                    group.values[:] = res

                applied.append(group)
            else:

?

jreback · 2014-07-31T12:40:01Z

@ribonoous

the .values takes the pandas object and returns the underlying ndarray and works fine if its a snigle dtype. if its multiple, the this is an upcast COPY, e.g. mxiing floats/object gets you an object dtype. numpy does do welll with this either as assignment is broken for object dtype.

Something like

group.loc[:] = res WILL work, but this is slow, so need to do something like this

# mixed dtypes
if group._is_mixed_type:
    for b in group._data.blocks:
           b.setitem(slice(None),res.iloc[b.mgr_locs.as_array])
else:
    # do what it was doing above (as this is faster)

maybe need to add some timings for this as well. (and maybe a doc warning)

cpcloud · 2014-07-31T12:42:20Z

Sure. I'll outline what I think could be the start of a solution and we can iterate from there.

Let's say you have df

In [21]: A, B = ['bar'] * 2 + ['flux'] * 2 + ['foo'] * 4, ['one', 'three', 'six', 'three', 'five', 'one', 'two', 'two']

In [22]: mi = MultiIndex.from_tuples(list(zip(A, B)), names=['A', 'B'])

In [23]: n = len(mi.values)

In [24]: df = DataFrame({'C': np.random.randn(n), 'D': np.random.randn(n), 'E': np.random.randint(5, size=n)}, index=mi)

In [25]: df
Out[25]:
                 C       D  E
A    B
bar  one    0.4615 -0.7685  2
     three  0.2533  0.3017  3
flux six   -0.1020 -1.3196  0
     three  0.2417  0.3707  2
foo  five   1.8301 -0.1334  1
     one    1.1182  1.2973  0
     two   -0.3256  3.0025  0
     two   -1.2568  1.7894  3

In [26]: df.dtypes
Out[26]:
C    float64
D    float64
E      int64
dtype: object

And you want to execute

df.groupby(level='A').transform(lambda x: x.iloc[-1])

Eventually you'll get to _transform_general and this line:

        for name, group in gen:

is iterating over each group generated by the call to groupby(column).

Ignoring the fast path/slow path stuff (not relevant ATM since the issue lies with the dtype of group), res for the first application of iloc[-1] will be:

In [28]: df.loc[np.s_['bar', :], :].iloc[-1]
Out[28]:
C    0.2533
D    0.3017
E    3.0000
Name: (bar, three), dtype: float64

Notice that the dtype is float64 even for E, which is not the original type. As @jreback says this because of an upcast and said upcast will make a copy. When you assign to that, it isn't reflected in the original frame, so you need to essentially do the operation over blocks (blocks are the blocks of the different dtypes: this is how the underlying data in pandas objects are represented).

cpcloud · 2014-07-31T12:44:26Z

This is actually a simple enough problem that you can gain some understanding about how everything works :)

amelio-vazquez-reina · 2014-07-31T12:57:28Z

Thanks @cpcloud and @jreback I greatly appreciate the help! Today my schedule is tight, but I should have time tomorrow afternoon/evening to look into this.

amelio-vazquez-reina · 2014-08-29T14:21:10Z

I think I understand the problem. I presume I need to iterate through dtypes (blocks of columns) solving the problem for one dtype at a time, correct?

jreback · 2014-08-29T14:27:30Z

yep u need to do this block by block (which by definition are a single dtype)
use the blocks as this is internal (so u avoid lots of creation overhead - blocks are thin wrappers in ndarrays)

jreback · 2014-09-29T14:49:32Z

@ribonoous lmk if you have made progress on this. I think have to address (for 0.15.1 for sure). to include Categorical (so by definition need to do it block-by-block), see #8065 .

One possibility is to have a fast-path where you can do integer-like as the current path (e.g. float/int data). And have a 2nd path for multi-dtypes (which basically accumulates the results in a list then concats at the end). which is slower, but more robust.

amelio-vazquez-reina · 2014-12-02T19:13:14Z

Thanks @jreback I apologize for not having responded earlier. Work has become hectic in recent weeks and haven't had a chance to look into it. Everyone was very helpful in the thread and am very sorry I haven't had a chance to pick it up from where @cpcloud left it. I certainly want to learn to contribute, but can't promise anything at this point :((.

If any of you decides to fix it, I will make sure to see how how do it, since at a minimum it will definitely be educative.

jreback · 2018-07-06T21:20:47Z

looks good in master, needs tests

amelio-vazquez-reina changed the title ~~Unable to use lambda on groupby objects~~ Unable to use retrieve same-index dataframe with the last item per group Jul 30, 2014

cpcloud closed this as completed Jul 30, 2014

cpcloud reopened this Jul 30, 2014

amelio-vazquez-reina changed the title ~~Unable to use retrieve same-index dataframe with the last item per group~~ Unable to retrieve same-index dataframe with the last item per group Jul 30, 2014

cpcloud added the Bug label Jul 30, 2014

cpcloud self-assigned this Jul 30, 2014

cpcloud added this to the 0.15.0 milestone Jul 30, 2014

cpcloud added Dtypes labels Jul 30, 2014

jreback modified the milestones: 0.15.1, 0.15.0 Sep 10, 2014

jreback mentioned this issue Sep 29, 2014

BUG/ENH: categorical returned during a transform #8065

Closed

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

jreback modified the milestones: Contributions Welcome, 0.24.0 Jul 6, 2018

jreback added good first issue Needs Tests Unit test(s) needed to prevent regressions labels Jul 6, 2018

jreback modified the milestones: 0.24.0, Contributions Welcome Nov 6, 2018

mroeschke mentioned this issue Dec 30, 2019

TST: Regression testing for fixed issues #30554

Merged

9 tasks

simonjayhawkins modified the milestones: Contributions Welcome, 1.0 Dec 30, 2019

TomAugspurger closed this as completed in #30554 Dec 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to retrieve same-index dataframe with the last item per group #7883

Unable to retrieve same-index dataframe with the last item per group #7883

amelio-vazquez-reina commented Jul 30, 2014

cpcloud commented Jul 30, 2014

cpcloud commented Jul 30, 2014

amelio-vazquez-reina commented Jul 30, 2014

cpcloud commented Jul 30, 2014

amelio-vazquez-reina commented Jul 30, 2014

amelio-vazquez-reina commented Jul 30, 2014

cpcloud commented Jul 30, 2014

cpcloud commented Jul 30, 2014

cpcloud commented Jul 30, 2014

cpcloud commented Jul 30, 2014

cpcloud commented Jul 30, 2014

amelio-vazquez-reina commented Jul 30, 2014

cpcloud commented Jul 30, 2014

jreback commented Jul 30, 2014

cpcloud commented Jul 30, 2014

cpcloud commented Jul 30, 2014

cpcloud commented Jul 30, 2014

jreback commented Jul 30, 2014

cpcloud commented Jul 30, 2014

amelio-vazquez-reina commented Jul 31, 2014

jreback commented Jul 31, 2014

cpcloud commented Jul 31, 2014

cpcloud commented Jul 31, 2014

amelio-vazquez-reina commented Jul 31, 2014

amelio-vazquez-reina commented Aug 29, 2014

jreback commented Aug 29, 2014

jreback commented Sep 29, 2014

amelio-vazquez-reina commented Dec 2, 2014 •

edited

Loading

jreback commented Jul 6, 2018

Unable to retrieve same-index dataframe with the last item per group #7883

Unable to retrieve same-index dataframe with the last item per group #7883

Comments

amelio-vazquez-reina commented Jul 30, 2014

cpcloud commented Jul 30, 2014

cpcloud commented Jul 30, 2014

amelio-vazquez-reina commented Jul 30, 2014

cpcloud commented Jul 30, 2014

amelio-vazquez-reina commented Jul 30, 2014

amelio-vazquez-reina commented Jul 30, 2014

cpcloud commented Jul 30, 2014

cpcloud commented Jul 30, 2014

cpcloud commented Jul 30, 2014

cpcloud commented Jul 30, 2014

cpcloud commented Jul 30, 2014

amelio-vazquez-reina commented Jul 30, 2014

cpcloud commented Jul 30, 2014

jreback commented Jul 30, 2014

cpcloud commented Jul 30, 2014

cpcloud commented Jul 30, 2014

cpcloud commented Jul 30, 2014

jreback commented Jul 30, 2014

cpcloud commented Jul 30, 2014

amelio-vazquez-reina commented Jul 31, 2014

jreback commented Jul 31, 2014

cpcloud commented Jul 31, 2014

cpcloud commented Jul 31, 2014

amelio-vazquez-reina commented Jul 31, 2014

amelio-vazquez-reina commented Aug 29, 2014

jreback commented Aug 29, 2014

jreback commented Sep 29, 2014

amelio-vazquez-reina commented Dec 2, 2014 • edited Loading

jreback commented Jul 6, 2018

amelio-vazquez-reina commented Dec 2, 2014 •

edited

Loading