Skip to content

BUG: Preserve subclassing with groupby operations #28573

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

alkasm
Copy link

@alkasm alkasm commented Sep 23, 2019

Preserve subclassing with groupby operations

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Looks good overall.

@jreback jreback added Compat pandas objects compatability with Numpy or Python functions Groupby labels Sep 23, 2019
@pep8speaks
Copy link

pep8speaks commented Sep 29, 2019

Hello @alkasm! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-09-29 08:12:48 UTC

@@ -134,6 +134,23 @@ def func(dataf):
result = df.groupby("X", squeeze=False).count()
assert isinstance(result, DataFrame)

# https://github.com/pandas-dev/pandas/issues/28330
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally we split these into several tests, but could be a followup

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test change, pls merge master, ping on green.

@@ -134,6 +134,23 @@ def func(dataf):
result = df.groupby("X", squeeze=False).count()
assert isinstance(result, DataFrame)

# https://github.com/pandas-dev/pandas/issues/28330
# Test groupby operations on subclassed dataframes/series

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually these tests are different that the above ones, can you make a new test

cdf = tm.SubclassedDataFrame(
[
{"val1": 1, "val2": 20},
{"val1": 1, "val2": 19},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also pls parameterize over as many functions that you can here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use reduction_func from conftest. Probably worth adding something similar for transforming functions as well

@jreback jreback added this to the 1.0 milestone Oct 2, 2019
@jbrockmendel
Copy link
Member

@alkasm the fixes here look great. can you rebase and address comments about making the tests more thorough

{"val1": 1, "val2": 20},
{"val1": 1, "val2": 19},
{"val1": 2, "val2": 27},
{"val1": 2, "val2": 12},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update to comments here

@jreback
Copy link
Contributor

jreback commented Oct 18, 2019

pls also merge master

@jbrockmendel
Copy link
Member

jbrockmendel commented Oct 31, 2019

@alkasm can you rebase

@alkasm
Copy link
Author

alkasm commented Nov 4, 2019

Sorry for the delay y'all. Should be able to clean up, write new tests, and rebase this week sometime.

@alkasm
Copy link
Author

alkasm commented Nov 14, 2019

Due to the many changes from #28835 I found it easier to just make a new branch. Here's my current branch off master: https://github.com/alkasm/pandas/tree/alkasm/groupby-preserve-subclass (I can force push the new changes to this current branch, or just open a new PR if it's more helpful).

However there's a few issues I'm running into getting this to work in all cases; and I'm having trouble locating the source of the issues.

Borrowing from tests/groupby/test_transform.py and the suggestion from @WillAyd to use conftest.reduction_kernels, I wrote a short script to test the transforms and aggregations to see if the subclasses are returned properly:

import pandas as pd
import numpy as np
from pandas.util.testing import SubclassedSeries, SubclassedDataFrame
from pandas.tests.groupby.conftest import reduction_kernels

objs = [
    SubclassedDataFrame(
        dict(a=[0, 0, 0, 1, 1, 1], b=range(6)), index=["A", "B", "C", "D", "E", "F"]
    ),
    SubclassedSeries([0, 0, 0, 1, 1, 1], index=["A", "B", "C", "D", "E", "F"]),
]

for obj in objs:
    g = obj.groupby(np.repeat([0, 1], 3))
    for k in reduction_kernels:
        if k in ("ngroup", "size"):
            continue
        args = {"nth": [0], "quantile": [0.5]}.get(k, [])
        xform = g.transform(k, *args)
        agg = g.agg(k, *args)
        if not (type(obj) == type(xform) == type(agg)):
            print(f"{k:>10}, {obj.__class__.__name__}, {xform.__class__.__name__}, {agg.__class__.__name__}")

and from running this script, it's seen that three of the aggregation functions don't produce the proper subclasses (i.e. somewhere they are not using self.obj._constructor):

      skew, SubclassedDataFrame, SubclassedDataFrame, DataFrame
   nunique, SubclassedDataFrame, SubclassedDataFrame, DataFrame
       mad, SubclassedDataFrame, SubclassedDataFrame, DataFrame

I can't really find how these functions end up constructing their result as a DataFrame. Can anyone with some more knowledge here point me in the right direction? Hopefully that test is sufficient so that this is otherwise ready to go.

@jreback
Copy link
Contributor

jreback commented Nov 14, 2019

can you merge master

@jbrockmendel
Copy link
Member

Based on a quick look, I think you're best bet is in core.groupby.generic _wrap_applied_output methods.

I also recently noticed that core.apply has a place where it does f(Series([])) instead of f(obj._constructor([])) that might not play well with subclassing.

@jreback
Copy link
Contributor

jreback commented Nov 20, 2019

can you merge master

@WillAyd
Copy link
Member

WillAyd commented Dec 17, 2019

@alkasm can you fix up merge conflicts?

@jreback
Copy link
Contributor

jreback commented Dec 27, 2019

can you merge master and will look again

@TomAugspurger
Copy link
Contributor

@alkasm do you have time to fix the merge conflicts? We're doing a release this week.

@jreback jreback removed this from the 1.0 milestone Jan 3, 2020
@WillAyd
Copy link
Member

WillAyd commented Jan 17, 2020

Nice PR but looks like this has gone stale. Ping if you'd like to pick back up and can address merge conflicts / comments

@WillAyd WillAyd closed this Jan 17, 2020
@JBGreisman
Copy link
Contributor

I recently came across this bug/PR while working on a project that uses a subclassed DataFrame. It seems like this PR has been closed and the issue removed from "Contributions Welcome", but the initial issue #28330 is still open.

I just wanted to check on the current status here -- it seems like this PR was abandoned. If this were to be picked up, would you recommend starting from here or in a fresh PR?

@WillAyd
Copy link
Member

WillAyd commented Apr 27, 2020

@JBGreisman if interested you can merge these changes into your own local branch and push up a new PR

@alkasm
Copy link
Author

alkasm commented May 2, 2020

@JBGreisman thank you for picking this back up!

@JBGreisman
Copy link
Contributor

@alkasm No worries -- thanks for laying the groundwork!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

groupby and resample methods do not preserve subclassed data structures
7 participants