Skip to content

ENH: Keep series name when merging GroupBy result #6068

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

bburan-galenea
Copy link
Contributor

closes #6124
closes #6265

Use case

This will facilitate DataFrame group/apply transformations when using a function that returns a Series. Right now, if we perform the following:

import pandas
df = pandas.DataFrame(
        {'a':  [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
         'b':  [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1],
         'c':  [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0],
         'd':  [0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1],
         })

def count_values(df):
    return pandas.Series({'count': df['b'].sum(), 'mean': df['c'].mean()}, name='metrics')

result = df.groupby('a').apply(count_values)
print result.stack().reset_index()

We get the following output:

   a level_1    0
0  0   count  2.0
1  0    mean  0.5
2  1   count  2.0
3  1    mean  0.5
4  2   count  2.0
5  2    mean  0.5

[6 rows x 3 columns]

Ideally, the series name should be preserved and propagated through these operations such that we get the following output:

   a metrics    0
0  0   count  2.0
1  0    mean  0.5
2  1   count  2.0
3  1    mean  0.5
4  2   count  2.0
5  2    mean  0.5

[6 rows x 3 columns]

The only way to achieve this (currently) is:

result = df.groupby('a').apply(count_values)
result.columns.name = 'metrics'
print result.stack().reset_index()

However, the key issue here is 1) this adds an extra line of code and 2) the name of the series created in the applied function may not be known in the outside block (so we can't properly fix the result.columns.name attribute).

The other work-around is to name the index of the series:

def count_values(df):
    series = pandas.Series({'count': df['b'].sum(), 'mean': df['c'].mean()})
    series.index.name = 'metrics'
    return series

During the group/apply operation, this pull request will check to see whether series.index has the name attribute set. If the name attribute is not set, it will set the index.name attribute to the name of the series (thus ensuring the name propagates).

@jreback
Copy link
Contributor

jreback commented Jan 24, 2014

does this have an associated issue?

@bburan-galenea
Copy link
Contributor Author

I didn't create an issue. Should I?

@jreback
Copy link
Contributor

jreback commented Jan 24, 2014

you don't have to, but pls put an example of what is wrong, and then what this PR does in the top section

e.g. this feature is missing, this PR makes it do .....

@bburan-galenea
Copy link
Contributor Author

Updated

@jreback
Copy link
Contributor

jreback commented Jan 25, 2014

actually...why don't you create an issue for this....that way can tag it with some labels.....
you can put the top part into the issue (e.g. the rationale / code sample) instead

@bburan-galenea
Copy link
Contributor Author

No problem. Made an issue. Thanks for looking at this.

jreback added a commit that referenced this pull request Jan 29, 2014
Updated docs to reflect a pagination bug that was fixed. Closes: issue #6068
@jreback
Copy link
Contributor

jreback commented Feb 15, 2014

can you rebase?

@jreback
Copy link
Contributor

jreback commented Feb 18, 2014

hows this coming?

@bburan-galenea
Copy link
Contributor Author

@jreback - Second commit should close #6265 and I rebased. Had to push with the --force option. Hope that's OK.

@jreback
Copy link
Contributor

jreback commented Feb 24, 2014

yes force pushing is correct

@jreback
Copy link
Contributor

jreback commented Feb 24, 2014

pls add in a tests for #6265 (and release note mention) as well..thanks

@bburan-galenea
Copy link
Contributor Author

Rebased, added tests, made sure tests passed, fixed as per discussion in #6124 and added release note mention.

@bburan-galenea
Copy link
Contributor Author

FYI, did another rebase/force push.

@jreback
Copy link
Contributor

jreback commented Mar 5, 2014

this looks really good!

can you add the example (in the top of the PR) to the groupby.rst/examples section (just the fixed version). If you are really adventurous, you can put a 1-liner v0.14.0 under API changes with a reference to this. Its not an API change, but useful to know.

@bburan-galenea
Copy link
Contributor Author

Added the example. There was already a note in the release file about the changes (it's more than one line though). Let me know if that should be shortened.

@jreback
Copy link
Contributor

jreback commented Mar 5, 2014

minor issue
can u annotate the test where 6124 is tested?

When possible, attempt to preserve the series name when performing groupby
operations.  This facilitates reshaping/indexing operations on the result of the
groupby/apply or groupby/agg operation.  Fixes GH6265 and GH6124.  Added
example to groupby.rst and description to API changes for v0.14.
@bburan-galenea
Copy link
Contributor Author

Done. Amended previous commit and force-updated.

jreback added a commit that referenced this pull request Mar 5, 2014
…_apply_series_name

ENH: Keep series name when merging GroupBy result
@jreback jreback merged commit 1fca5be into pandas-dev:master Mar 5, 2014
@jreback
Copy link
Contributor

jreback commented Mar 5, 2014

thanks @bburan-galenea this was a great effort!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants