Skip to content

Unable to retrieve same-index dataframe with the last item per group #7883

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amelio-vazquez-reina opened this issue Jul 30, 2014 · 29 comments · Fixed by #30554
Closed

Unable to retrieve same-index dataframe with the last item per group #7883

amelio-vazquez-reina opened this issue Jul 30, 2014 · 29 comments · Fixed by #30554
Assignees
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@amelio-vazquez-reina
Copy link
Contributor

The original thread is here.

> df
                   C         D  E
A    B                           
bar  one   -1.350006  0.260339  2
     three -0.236451 -0.056614  0
flux six   -0.515571 -0.155078  1
     three -0.365032  1.055669  2
foo  five   1.145811  1.514191  0
     one   -0.108427 -0.643262  2
     two    0.286897  0.405798  2
     two   -1.452286  1.149264  2

The goal is to get a dataframe with the same index with each entry having the last value in the groups associated with A, i.e.

> df
                   C         D  E
A    B                           
bar  one   -0.236451 -0.056614  2
     three -0.236451 -0.056614  2
flux six   -0.365032  1.055669  2
     three -0.365032  1.055669  2
foo  five  -1.452286  1.149264  2
     one   -1.452286  1.149264  2
     two   -1.452286  1.149264  2
     two   -1.452286  1.149264  2

However, none of the following work:

  • df.groupby(level='A').apply(lambda x: x[-1])
  • df.groupby(level='A').transform(pd.Series.last)
  • df.groupby(level='A').transform(lambda x: x.iloc[-1])
@amelio-vazquez-reina amelio-vazquez-reina changed the title Unable to use lambda on groupby objects Unable to use retrieve same-index dataframe with the last item per group Jul 30, 2014
@cpcloud
Copy link
Member

cpcloud commented Jul 30, 2014

There's one thing you didn't try:

In [37]: df
Out[37]:
                 C       D       E
A    B
bar  one   -0.9384  0.3450 -0.8715
     three  1.1107  1.9896 -0.4797
flux six    0.5059  0.0494  0.0993
     three  0.3148 -0.7710  0.2409
foo  five   2.3012 -0.3600 -0.9216
     one   -0.3965 -1.0410 -0.0600
     two    1.2348  0.6059 -0.8019
     two    0.0912  0.5812 -0.2724

In [38]: df.groupby(level='A').transform(lambda x: x.iloc[-1])
Out[38]:
                 C       D       E
A    B
bar  one    1.1107  1.9896 -0.4797
     three  1.1107  1.9896 -0.4797
flux six    0.3148 -0.7710  0.2409
     three  0.3148 -0.7710  0.2409
foo  five   0.0912  0.5812 -0.2724
     one    0.0912  0.5812 -0.2724
     two    0.0912  0.5812 -0.2724
     two    0.0912  0.5812 -0.2724

@cpcloud
Copy link
Member

cpcloud commented Jul 30, 2014

These lines from EdChum's answer

last = lambda x: x.iloc[-1]
df.groupby(level='A').transform(last)

are not the same as doing

df.groupby(level='A').transform(last = lambda x: x.iloc[-1])

In the second one you're passing a keyword argument to transform which will of course raise a TypeError because it expects at least two positional arguments and you're only passing a keyword argument and no positionals (except the implicit self that comes with every method)

@cpcloud cpcloud closed this as completed Jul 30, 2014
@amelio-vazquez-reina
Copy link
Contributor Author

Thanks @cpcloud I actually mistyped it at the beginning of this issue. I did try (and just tried again):

last = lambda x: x.iloc[-1]
df.groupby(level='A').transform(last)

but I do not get the desired dataframe. I get the original frame.

@cpcloud
Copy link
Member

cpcloud commented Jul 30, 2014

@ribonoous Can you show a reproducible example? Hard to debug without seeing the data. Can make random data too, if you can't show the data where it's failing, I just need to be able to reproduce it.

@amelio-vazquez-reina
Copy link
Contributor Author

Thanks @cpcloud . The example I included above is from my current session. I understand that your answer should work, but it's literally not working on my machine as I type it:

N.In <121>: df
N.Out<121>: 
                   C         D  E
A    B                           
bar  one   -0.507507 -0.719877  0
     three  0.735206  0.476516  0
flux six    0.347799 -0.201347  2
     three -0.271447 -0.936737  2
foo  five   0.930560 -0.237395  2
     one   -1.759220 -0.564822  1
     two    1.759313  0.277731  2
     two   -0.506662  0.758981  1

N.In <122>: df.groupby(level='A').transform(lambda x: x.iloc[-1])
N.Out<122>: 
                   C         D  E
A    B                           
bar  one   -0.507507 -0.719877  0
     three  0.735206  0.476516  0
flux six    0.347799 -0.201347  2
     three -0.271447 -0.936737  2
foo  five   0.930560 -0.237395  2
     one   -1.759220 -0.564822  1
     two    1.759313  0.277731  2
     two   -0.506662  0.758981  1

I am on Python 3 in an embedded shell in IPython with the following packages:

$ pip freeze
Cython==0.20.1
DataShape==0.2
Flask==0.10.1
Jinja2==2.7.2
MarkupSafe==0.18
Pillow==2.4.0
PyYAML==3.11
Pygments==1.6
SQLAlchemy==0.9.4
Sphinx==1.2.2
Werkzeug==0.9.6
XlsxWriter==0.5.5
appnope==0.0.6
argcomplete==0.6.7
astropy==0.3.2
backports.ssl-match-hostname==3.4.0.2
beautifulsoup4==4.3.1
binstar==0.5.3
bitarray==0.8.1
blaze==0.5
blz==0.6.2
bokeh==0.4.4
cdecimal==2.3
colorama==0.2.7
configobj==5.0.5
docutils==0.11
elpy==1.4.2
flake8==2.2.2
future==0.12.1
greenlet==0.4.2
h5py==2.3.0
ipython==2.1.0
itsdangerous==0.24
jdcal==1.0
llvmpy==0.12.6
lpsolve55==5.5.2.0
lxml==3.3.5
matplotlib==1.3.1
mccabe==0.2.1
mock==1.0.1
multipledispatch==0.4.3
networkx==1.8.1
nose==1.3.3
numba==0.13.2
numexpr==2.3.1
numpy==1.8.1
openpyxl==1.8.5
pandas==0.14.1
patsy==0.2.1
pep8==1.5.7
ply==3.4
psutil==2.1.1
psycopg2==2.5.3
py==1.4.20
py4j==0.8.1
pycosat==0.6.1
pycparser==2.10
pycrypto==2.6.1
pyflakes==0.8.1
pymc==2.3.4
pyparsing==2.0.1
pytest==2.5.2
python-dateutil==2.1
pytz==2014.4
pyzmq==14.3.0
redis==2.9.1
requests==2.3.0
rope-py3k==0.9.4-1
runipy==0.1.0
scikit-image==0.10.0
scikit-learn==0.15.0
scipy==0.14.0
seaborn==0.4.dev
six==1.7.3
statsmodels==0.5.0
sympy==0.7.5
tables==3.1.1
tornado==3.2.1
ujson==1.33
xlrd==0.9.3

@cpcloud cpcloud reopened this Jul 30, 2014
@amelio-vazquez-reina
Copy link
Contributor Author

@cpcloud I just tried it on a fresh IPython session (no terminal embedded):

def create_random_multi_index():
  df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                            'foo', 'flux', 'foo', 'flux'],  
                     'B' : ['one', 'one', 'two', 'three',
                            'two', 'six', 'five', 'three'],
                     'C' : randn(8), 'D' : randn(8), 'E': randint(0,3, size=(8,))})
  df.set_index(['A', 'B'], inplace=True)
  df.sort_index(inplace=True)
  return df
In [8]: df = create_random_multi_index()

In [9]: df
Out[9]: 
                   C         D  E
A    B                           
bar  one    0.113720  1.000103  1
     three  0.778840 -0.546913  0
flux six    1.688509  1.541020  2
     three  2.973229 -0.063245  1
foo  five   0.208310  1.268056  0
     one    0.404996 -1.332461  2
     two   -1.038567  0.376829  1
     two    1.528148  1.386106  0

In [10]: df.groupby(level='A').transform(lambda x: x.iloc[-1])
Out[10]: 
                   C         D  E
A    B                           
bar  one    0.113720  1.000103  1
     three  0.778840 -0.546913  0
flux six    1.688509  1.541020  2
     three  2.973229 -0.063245  1
foo  five   0.208310  1.268056  0
     one    0.404996 -1.332461  2
     two   -1.038567  0.376829  1
     two    1.528148  1.386106  0

@cpcloud
Copy link
Member

cpcloud commented Jul 30, 2014

ok let me try

@amelio-vazquez-reina amelio-vazquez-reina changed the title Unable to use retrieve same-index dataframe with the last item per group Unable to retrieve same-index dataframe with the last item per group Jul 30, 2014
@cpcloud
Copy link
Member

cpcloud commented Jul 30, 2014

ok now i can repro

@cpcloud
Copy link
Member

cpcloud commented Jul 30, 2014

@ribonoous thx for the example!

@cpcloud
Copy link
Member

cpcloud commented Jul 30, 2014

interesting seems to be related to types

@cpcloud cpcloud added the Bug label Jul 30, 2014
@cpcloud cpcloud self-assigned this Jul 30, 2014
@cpcloud cpcloud added this to the 0.15.0 milestone Jul 30, 2014
@cpcloud
Copy link
Member

cpcloud commented Jul 30, 2014

In [12]: df
Out[12]:
                 C       D  E
A    B
bar  one   -0.4580  0.1261  2
     three  0.9299 -1.4628  0
flux six   -0.4497  0.2510  1
     three  0.6542 -0.7250  0
foo  five  -0.2595  1.5363  2
     one    0.3749 -0.3560  0
     two    0.1009  1.5566  1
     two   -0.9102 -1.2865  0

In [13]: df[['C', 'D']].groupby(level='A').transform(lambda x: x.iloc[-1])
Out[13]:
                 C       D
A    B
bar  one    0.9299 -1.4628
     three  0.9299 -1.4628
flux six    0.6542 -0.7250
     three  0.6542 -0.7250
foo  five  -0.9102 -1.2865
     one   -0.9102 -1.2865
     two   -0.9102 -1.2865
     two   -0.9102 -1.2865

In [14]: df[['E']].groupby(level='A').transform(lambda x: x.iloc[-1])
Out[14]:
            E
A    B
bar  one    0
     three  0
flux six    0
     three  0
foo  five   0
     one    0
     two    0
     two    0

In [15]: df[['D', 'E']].groupby(level='A').transform(lambda x: x.iloc[-1])
Out[15]:
                 D  E
A    B
bar  one    0.1261  2
     three -1.4628  0
flux six    0.2510  1
     three -0.7250  0
foo  five   1.5363  2
     one   -0.3560  0
     two    1.5566  1
     two   -1.2865  0

In [16]: df[['C', 'E']].groupby(level='A').transform(lambda x: x.iloc[-1])
Out[16]:
                 C  E
A    B
bar  one   -0.4580  2
     three  0.9299  0
flux six   -0.4497  1
     three  0.6542  0
foo  five  -0.2595  2
     one    0.3749  0
     two    0.1009  1
     two   -0.9102  0

@amelio-vazquez-reina
Copy link
Contributor Author

Thanks @cpcloud More than happy to help. Hopefully I will be able to contribute fixing the code and not just reporting examples soon.

@cpcloud
Copy link
Member

cpcloud commented Jul 30, 2014

let us know if you need any help!

@jreback
Copy link
Contributor

jreback commented Jul 30, 2014

issue is here: https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L2799

completely unsafe to do this with multiple dtypes (well it 'works', but no effect).

need to do this block by block and will work

@cpcloud
Copy link
Member

cpcloud commented Jul 30, 2014

hm darn, well gladd you asved me from that rabbit hole

@cpcloud
Copy link
Member

cpcloud commented Jul 30, 2014

this should at least raise on mixed dtypes

@cpcloud
Copy link
Member

cpcloud commented Jul 30, 2014

@ribonoous care to make a pull request, this is pretty straightforward

@jreback
Copy link
Contributor

jreback commented Jul 30, 2014

well....not that hard to fix ....

@cpcloud
Copy link
Member

cpcloud commented Jul 30, 2014

ok ... in that case @ribonoous offer still stands!

@amelio-vazquez-reina
Copy link
Contributor Author

Thanks @cpcloud I am looking at the line that jreback mentioned. This is my first time looking inside of pandas, as well as my first time contributing to an open source project on GitHub. Would you mind giving me a hint of where to look into?

From the discussion I inferred that the problem occurs when the dataframe has mixed dtypes. What is the problem with the existing block?:

            if isinstance(res, Series):
                if res.index.is_(obj.index):
                    group.T.values[:] = res
                else:
                    group.values[:] = res

                applied.append(group)
            else:

?

@jreback
Copy link
Contributor

jreback commented Jul 31, 2014

@ribonoous

the .values takes the pandas object and returns the underlying ndarray and works fine if its a snigle dtype. if its multiple, the this is an upcast COPY, e.g. mxiing floats/object gets you an object dtype. numpy does do welll with this either as assignment is broken for object dtype.

Something like

group.loc[:] = res WILL work, but this is slow, so need to do something like this

# mixed dtypes
if group._is_mixed_type:
    for b in group._data.blocks:
           b.setitem(slice(None),res.iloc[b.mgr_locs.as_array])
else:
    # do what it was doing above (as this is faster)

maybe need to add some timings for this as well. (and maybe a doc warning)

@cpcloud
Copy link
Member

cpcloud commented Jul 31, 2014

Sure. I'll outline what I think could be the start of a solution and we can iterate from there.

Let's say you have df

In [21]: A, B = ['bar'] * 2 + ['flux'] * 2 + ['foo'] * 4, ['one', 'three', 'six', 'three', 'five', 'one', 'two', 'two']

In [22]: mi = MultiIndex.from_tuples(list(zip(A, B)), names=['A', 'B'])

In [23]: n = len(mi.values)

In [24]: df = DataFrame({'C': np.random.randn(n), 'D': np.random.randn(n), 'E': np.random.randint(5, size=n)}, index=mi)

In [25]: df
Out[25]:
                 C       D  E
A    B
bar  one    0.4615 -0.7685  2
     three  0.2533  0.3017  3
flux six   -0.1020 -1.3196  0
     three  0.2417  0.3707  2
foo  five   1.8301 -0.1334  1
     one    1.1182  1.2973  0
     two   -0.3256  3.0025  0
     two   -1.2568  1.7894  3

In [26]: df.dtypes
Out[26]:
C    float64
D    float64
E      int64
dtype: object

And you want to execute

df.groupby(level='A').transform(lambda x: x.iloc[-1])

Eventually you'll get to _transform_general and this line:

        for name, group in gen:

is iterating over each group generated by the call to groupby(column).

Ignoring the fast path/slow path stuff (not relevant ATM since the issue lies with the dtype of group), res for the first application of iloc[-1] will be:

In [28]: df.loc[np.s_['bar', :], :].iloc[-1]
Out[28]:
C    0.2533
D    0.3017
E    3.0000
Name: (bar, three), dtype: float64

Notice that the dtype is float64 even for E, which is not the original type. As @jreback says this because of an upcast and said upcast will make a copy. When you assign to that, it isn't reflected in the original frame, so you need to essentially do the operation over blocks (blocks are the blocks of the different dtypes: this is how the underlying data in pandas objects are represented).

@cpcloud
Copy link
Member

cpcloud commented Jul 31, 2014

This is actually a simple enough problem that you can gain some understanding about how everything works :)

@amelio-vazquez-reina
Copy link
Contributor Author

Thanks @cpcloud and @jreback I greatly appreciate the help! Today my schedule is tight, but I should have time tomorrow afternoon/evening to look into this.

@amelio-vazquez-reina
Copy link
Contributor Author

I think I understand the problem. I presume I need to iterate through dtypes (blocks of columns) solving the problem for one dtype at a time, correct?

@jreback
Copy link
Contributor

jreback commented Aug 29, 2014

yep u need to do this block by block (which by definition are a single dtype)
use the blocks as this is internal (so u avoid lots of creation overhead - blocks are thin wrappers in ndarrays)

@jreback
Copy link
Contributor

jreback commented Sep 29, 2014

@ribonoous lmk if you have made progress on this. I think have to address (for 0.15.1 for sure). to include Categorical (so by definition need to do it block-by-block), see #8065 .

One possibility is to have a fast-path where you can do integer-like as the current path (e.g. float/int data). And have a 2nd path for multi-dtypes (which basically accumulates the results in a list then concats at the end). which is slower, but more robust.

@amelio-vazquez-reina
Copy link
Contributor Author

amelio-vazquez-reina commented Dec 2, 2014

Thanks @jreback I apologize for not having responded earlier. Work has become hectic in recent weeks and haven't had a chance to look into it. Everyone was very helpful in the thread and am very sorry I haven't had a chance to pick it up from where @cpcloud left it. I certainly want to learn to contribute, but can't promise anything at this point :((.

If any of you decides to fix it, I will make sure to see how how do it, since at a minimum it will definitely be educative.

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@jreback jreback modified the milestones: Contributions Welcome, 0.24.0 Jul 6, 2018
@jreback jreback added good first issue Needs Tests Unit test(s) needed to prevent regressions labels Jul 6, 2018
@jreback
Copy link
Contributor

jreback commented Jul 6, 2018

looks good in master, needs tests

@jreback jreback modified the milestones: 0.24.0, Contributions Welcome Nov 6, 2018
@simonjayhawkins simonjayhawkins modified the milestones: Contributions Welcome, 1.0 Dec 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants