Skip to content

BUG: add reset logic for Grouper if new obj is passed in (#26564) #29800

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

alichaudry
Copy link

@alichaudry alichaudry commented Nov 22, 2019

@jbrockmendel
Copy link
Member

needs tests that the relevant bug is fixed

@jreback
Copy link
Contributor

jreback commented Nov 22, 2019

this PR is likely duplicating this one: #29131

@alichaudry
Copy link
Author

this PR is likely duplicating this one: #29131

Maybe I'll add my test here and see if it passes with #29131 .

@alichaudry
Copy link
Author

@jreback I wrote a test here and ran it against #29131 which doesn't pass the test. So this is still an outstanding issue:

=========================================================== test session starts ============================================================
platform linux -- Python 3.7.3, pytest-5.3.0, py-1.8.0, pluggy-0.13.0
rootdir: /home/ali/repo/pydatascience/pandas-dev2, inifile: setup.cfg
plugins: cov-2.8.1, xdist-1.30.0, hypothesis-4.47.1, forked-1.1.2
collected 1 item                                                                                                                           

pandas/tests/resample/test_resampler_grouper.py F                                                                                    [100%]

================================================================= FAILURES =================================================================
__________________________________________________ test_same_grouper_on_different_frames ___________________________________________________

    def test_same_grouper_on_different_frames():
    
        df1 = pd.DataFrame(
            [["a", 1, 2], ["a", 4, 5], ["b", 2, 3]], columns=["type", "num1", "num2"],
        )
        df1["date"] = pd.to_datetime(["05/29/2019", "05/28/2019", "05/27/2019"])
    
        df2 = pd.DataFrame([["c", 6, 7], ["d", 8, 9]], columns=["type", "num1", "num2"],)
        df2["date"] = pd.to_datetime(["02/12/2018", "03/13/2018"])
    
        groupbys = ["type", pd.Grouper(key="date", freq="1D")]
    
        df1.groupby(groupbys).sum()
>       df2.groupby(groupbys).count()

pandas/tests/resample/test_resampler_grouper.py:294: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/core/groupby/generic.py:1674: in count
    ids, _, ngroups = self.grouper.group_info
pandas/_libs/properties.pyx:34: in pandas._libs.properties.CachedProperty.__get__
    val = self.func(obj)
pandas/core/groupby/ops.py:299: in group_info
    comp_ids, obs_group_ids = self._get_compressed_labels()
pandas/core/groupby/ops.py:315: in _get_compressed_labels
    all_labels = [ping.labels for ping in self.groupings]
pandas/core/groupby/ops.py:315: in <listcomp>
    all_labels = [ping.labels for ping in self.groupings]
pandas/core/groupby/grouper.py:405: in labels
    self._make_labels()
pandas/core/groupby/grouper.py:424: in _make_labels
    labels = self.grouper.label_info
pandas/_libs/properties.pyx:34: in pandas._libs.properties.CachedProperty.__get__
    val = self.func(obj)
pandas/core/groupby/ops.py:310: in label_info
    sorter = np.lexsort((labels, self.indexer))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

args = ((array([ 0, 29]), array([2, 1, 0])),), kwargs = {}, relevant_args = (array([ 0, 29]), array([2, 1, 0]))

>   ???
E   ValueError: all keys need to be the same shape

<__array_function__ internals>:6: ValueError
============================================================ 1 failed in 0.17s =============================================================

@jreback
Copy link
Contributor

jreback commented Nov 25, 2019

@jreback I wrote a test here and ran it against #29131 which doesn't pass the test. So this is still an outstanding issue:

=========================================================== test session starts ============================================================
platform linux -- Python 3.7.3, pytest-5.3.0, py-1.8.0, pluggy-0.13.0
rootdir: /home/ali/repo/pydatascience/pandas-dev2, inifile: setup.cfg
plugins: cov-2.8.1, xdist-1.30.0, hypothesis-4.47.1, forked-1.1.2
collected 1 item                                                                                                                           

pandas/tests/resample/test_resampler_grouper.py F                                                                                    [100%]

================================================================= FAILURES =================================================================
__________________________________________________ test_same_grouper_on_different_frames ___________________________________________________

    def test_same_grouper_on_different_frames():
    
        df1 = pd.DataFrame(
            [["a", 1, 2], ["a", 4, 5], ["b", 2, 3]], columns=["type", "num1", "num2"],
        )
        df1["date"] = pd.to_datetime(["05/29/2019", "05/28/2019", "05/27/2019"])
    
        df2 = pd.DataFrame([["c", 6, 7], ["d", 8, 9]], columns=["type", "num1", "num2"],)
        df2["date"] = pd.to_datetime(["02/12/2018", "03/13/2018"])
    
        groupbys = ["type", pd.Grouper(key="date", freq="1D")]
    
        df1.groupby(groupbys).sum()
>       df2.groupby(groupbys).count()

pandas/tests/resample/test_resampler_grouper.py:294: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/core/groupby/generic.py:1674: in count
    ids, _, ngroups = self.grouper.group_info
pandas/_libs/properties.pyx:34: in pandas._libs.properties.CachedProperty.__get__
    val = self.func(obj)
pandas/core/groupby/ops.py:299: in group_info
    comp_ids, obs_group_ids = self._get_compressed_labels()
pandas/core/groupby/ops.py:315: in _get_compressed_labels
    all_labels = [ping.labels for ping in self.groupings]
pandas/core/groupby/ops.py:315: in <listcomp>
    all_labels = [ping.labels for ping in self.groupings]
pandas/core/groupby/grouper.py:405: in labels
    self._make_labels()
pandas/core/groupby/grouper.py:424: in _make_labels
    labels = self.grouper.label_info
pandas/_libs/properties.pyx:34: in pandas._libs.properties.CachedProperty.__get__
    val = self.func(obj)
pandas/core/groupby/ops.py:310: in label_info
    sorter = np.lexsort((labels, self.indexer))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

args = ((array([ 0, 29]), array([2, 1, 0])),), kwargs = {}, relevant_args = (array([ 0, 29]), array([2, 1, 0]))

>   ???
E   ValueError: all keys need to be the same shape

<__array_function__ internals>:6: ValueError
============================================================ 1 failed in 0.17s =============================================================

@alichaudry that PR is not merged, it may or may not pass the test you created; its not clear if this is a different issue or not. happy for you to investigate.

@alichaudry
Copy link
Author

@alichaudry that PR is not merged, it may or may not pass the test you created; its not clear if this is a different issue or not. happy for you to investigate.

@jreback So what I meant is I took the following steps:

  1. cloned [BUG] Fixed behavior of DataFrameGroupBy.apply to respect _group_selection_context #29131 locally (the PR you mentioned)
  2. built pandas from scratch in a separate conda env (given that PR author's code)
  3. put my test into the local version of [BUG] Fixed behavior of DataFrameGroupBy.apply to respect _group_selection_context #29131 and ran it, which failed

My point is, from what I can tell my test doesn't work on the code/changes in #29131, which means those changes don't address the bug that I found.

@jreback
Copy link
Contributor

jreback commented Nov 26, 2019

@alichaudry well i suspect that that PR solves some of the problem

happy to take a partial or full patch

@jreback
Copy link
Contributor

jreback commented Jan 1, 2020

@alichaudry pls merge master

@WillAyd
Copy link
Member

WillAyd commented Feb 2, 2020

@alichaudry looks like CI is red - can you investigate and try to get passing?

@pep8speaks
Copy link

pep8speaks commented Feb 6, 2020

Hello @alichaudry! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-02-07 17:41:07 UTC

@alichaudry
Copy link
Author

@WillAyd At this point I'm not sure how to proceed with this PR. I know what the issue is, and I've written a test in this PR which has the correct pass/fail behavior. I also put in a hacky solution for pd.Grouper in this PR which passes my test but fails a number of other tests. I don't have the bandwidth right now to re-work the pd.Grouper class to fix the root cause so I'm not sure where to go with this.

This issue as I described it in #26564 comes from the fact that I use pd.Grouper in a stateless way -- but when I use the grouper in a group-by, say df_first.groupby(my_grouper), and then use the same grouper in another group-by: df_second.groupby(my_grouper) it fails as my_grouper was mutated by the first group-by and the internal state of the grouper object (and its attributes) makes it fail on the latter one.

@jbrockmendel
Copy link
Member

@alichaudry Will may have some more specific thoughts, but worst-case scenario you can make a PR with just the test and xfail it, so we'll know if this gets fixed down the road.

@alichaudry
Copy link
Author

@jbrockmendel followed your advice; updated the PR to include only the test and marked it as xfail. I am seeing some unrelated test errors (they seem to have to do with pyarrow/feather functionality) so I was wondering if you have any insight into this. I will still wait for feedback from @WillAyd.

@jbrockmendel
Copy link
Member

The pyarrow/feather failures are unrelated, you can ignore them for now.

@alichaudry
Copy link
Author

@WillAyd If you get a chance I'd love to hear your thoughts on this PR. It's passed all tests as expected (only failing unrelated pyarrow/feather tests).

@@ -295,3 +296,35 @@ def test_median_duplicate_columns():
result = df.resample("5s").median()
expected.columns = result.columns
tm.assert_frame_equal(result, expected)


@pytest.mark.xfail(reason="marked as xfail for: #26564")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

giving a GH reference is good, but the rest of this doesnt give much information. is there a single-line description that a reader would find informative?

can you add a "GH" in front of the "#" pls

ca

@jbrockmendel
Copy link
Member

The OP says that this closes #26564, is that still accurate?

@WillAyd
Copy link
Member

WillAyd commented Feb 20, 2020

@alichaudry if you can merge master to get CI green can take another look

@jreback
Copy link
Contributor

jreback commented Jun 14, 2020

closing as stale if you want to continue, please open a new PR.

@jreback jreback closed this Jun 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pd.groupby seems to mutate my pd.Grouper in-place
6 participants