Remove blocks from GroupBy Code #28782

WillAyd · 2019-10-04T02:40:04Z

Just trying to simplify things...

Currently fails (locally at least) with 5 tests, which interestingly enough seem mostly related to mean. Need to debug further but submitting now for feedback

jbrockmendel · 2019-10-04T20:44:32Z

Is the idea to operate column-by-column instead of block-by-block?

jbrockmendel · 2019-10-04T20:46:32Z

pandas/core/groupby/generic.py

-            finally:
-                if result is not no_result:
-                    # see if we can cast the block back to the original dtype
-                    result = maybe_downcast_numeric(result, block.dtype)


troubleshooting this chunk of code may be irrelevant if its getting ripped out, but FWIW: there are cases, particularly via L194, where we get here with result that is a DataFrame instead of an ndarray/EA. In those cases, the make_block call below raises.

WillAyd · 2019-10-07T16:58:14Z

Is the idea to operate column-by-column instead of block-by-block?

I think that's what it will end up being. There are a lot of code paths in groupby that can be simplified and I think that iteration over elements is inconsistent. This is block iteration and we separately have iteration maybe by Series or Frame elements (depending on whether or not there are duplicate column labels) which makes the code harder to reason about and buggy

So not going strictly column by column in this PR yet; just removing what I think is a one-off code path to make things easier and will leave more for follow ups

pep8speaks · 2019-10-07T18:22:03Z

Hello @WillAyd! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-10-08 01:29:39 UTC

WillAyd

OK I think this is ready for review. I was surprised by some of the seemingly unrelated changes I had to do to make this work but have added commentary where appropriate to explain.

Net benefit here is getting block management out of GroupBy and reducing the number of code paths. I think this should make it easier to address a few of the follow up issues mentioned in comments and hopefully get the output of GroupBy to be more predictable

pandas/core/groupby/generic.py

WillAyd · 2019-10-07T18:29:51Z

pandas/core/groupby/groupby.py

@@ -692,7 +692,7 @@ def __iter__(self):
        Generator yielding sequence of (name, subsetted object)
        for each group
        """
-        return self.grouper.get_iterator(self.obj, axis=self.axis)
+        return self.grouper.get_iterator(self._selected_obj, axis=self.axis)


This was rather surprising that I had to change here, but getting rid of the block code seemed to change the behavior of the following snippet:

In [14]: df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=["g", "a", "a"]) In [15]: df.groupby("g").transform("first") Out[15]: a a 0 2 3 1 5 6

Without referencing self._selected_obj this would still include the grouping as part of the output (this change might close some open issues; will check later)

That is surprising. Is it possible that it is wrong now but untested? (i.e. could be fixed independently)

WillAyd · 2019-10-07T18:30:29Z

pandas/tests/groupby/test_categorical.py

@@ -300,7 +300,7 @@ def test_observed(observed):
    exp_index = CategoricalIndex(
        list("ab"), name="cat", categories=list("abc"), ordered=True
    )
-    expected = DataFrame({"ints": [1.5, 1.5], "val": [20.0, 30]}, index=exp_index)
+    expected = DataFrame({"ints": [1.5, 1.5], "val": [20, 30]}, index=exp_index)


This was coercing to float with the block code but I don't think that was necessarily desired; fits in an int with new impl

Though we might not want this either. Here is code to reproduce:

In [8]: d = { ...: "ints": [1, 1, 2, 2], ...: "val": [10, 20, 30, 40], ...: } ...: df = pd.DataFrame(d) In [8]: df.groupby(list("abab")).mean() Out[7]: ints val a 1.5 20 b 1.5 30

A typical call to mean on Series doesn't preserve the int type

My intuition is that small changes like this are inevitable when changing from block-wise to column-wise. Definitely happened with arithmetic changeover.

WillAyd · 2019-10-07T18:34:23Z

pandas/tests/groupby/test_function.py

@@ -179,7 +179,13 @@ def test_arg_passthru():
        tm.assert_index_equal(result.columns, expected_columns_numeric)

        result = f(numeric_only=False)
-        tm.assert_frame_equal(result.reindex_like(expected), expected)
+
+	# TODO: median isn't implemented for DTI but was working blockwise before?


Surprised by this as well. I guess the by block iteration would bypass some of the checks that were put into a DTA, so while this doesn't work on master:

>>> df = pd.DataFrame([pd.Timestamp("2000-01-01"), pd.Timestamp("2000-01-01")]) >>> df[0].median() --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-18-c894fad0307b> in <module> ----> 1 df[0].median() /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/generic.py in stat_func(self, axis, skipna, level, numeric_only, **kwargs) 11616 return self._agg_by_level(name, axis=axis, level=level, skipna=skipna) 11617 return self._reduce( > 11618 f, name, axis=axis, skipna=skipna, numeric_only=numeric_only 11619 ) 11620 /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/series.py in _reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds) 4097 numeric_only=numeric_only, 4098 filter_type=filter_type, -> 4099 **kwds 4100 ) 4101 /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/base.py in _reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds) 1218 raise TypeError( 1219 "{klass} cannot perform the operation {op}".format( -> 1220 klass=self.__class__.__name__, op=name 1221 ) 1222 ) TypeError: DatetimeIndex cannot perform the operation median

This somehow did

>>> df.groupby([1, 1]).median(numeric_only=False) 0 1 2000-01-01

Whether or not that is intentional I'm not sure so cc @jbrockmendel

Not sure off the top of my head what is going on here, but I can say that DTA/DTI should have median. #28165 implements reductions for TDA, waiting to get that merged before doing DTA and other reductions.

Yea I just saw the update to that come through. I'll try to give it another look review wise tomorrow. Might be the precursor we need for this

Might be the precursor we need for this

Any thoughts on how this interacts with the ongoing catch-less project? I'm pretty sure the block-wise method this rips out is involved in a lot of that catching.

Tough to say but I think it can only help. I think generally a lot of the blanket Exception catching and state management in GroupBy might make it one of the harder parts of the code base to digest, so limiting the ways in which we construct results here should make things easier in the future (lot more to be done)

WillAyd · 2019-10-07T18:37:33Z

pandas/core/groupby/generic.py


-        counter = partial(lib.count_level_2d, labels=ids, max_bin=ngroups, axis=1)
-        blk = map(make_block, map(counter, val), loc)
+        for index, (name, obj) in enumerate(iter_obj.items()):


This isn't the ideal way to do this, but to keep the scope of the PR minimum and get tests passing I put this iteration directly in this method. This really should be shared by a few aggregation functions, but I believe is currently blocked by #25610 and #21668, amongst possibly others

Would prefer to address comprehensively in a follow up

jbrockmendel · 2019-10-08T01:21:50Z

Any idea of perf impact? Presumably wide-block cases will be the hardest hit right?

WillAyd · 2019-10-08T01:35:58Z

Any idea of perf impact? Presumably wide-block cases will be the hardest hit right?

I think generally so yes, but unfortunately will depend on function (some don't go through the block management code). Once I get green in CI will run some tests and maybe add a few if coverage is bare, so will have something more tangible for you then

WillAyd · 2019-10-08T01:43:20Z

pandas/core/groupby/generic.py

+
+        Notes
+        -----
+        Ideally output should always have integers as a key and the col_labels


Added this as a note for now because I didn't want duplicate labels to complicate the diff, but this should be bundled with a more comprehensive handling of iteration that I think will need to go column-by-column

Here's an example where we iterate by "slice":

pandas/pandas/core/groupby/groupby.py

Line 2263 in 558f754

for name, obj in self._iterate_slices():

But a slice can still be a DataFrame in case of duplicated column labels, and especially if those duplicate labels don't have a contiguous dtype can make things messy.

I think comprehensively need to change the iterations in these modules and ensure consistency

WillAyd · 2019-10-08T15:22:41Z

This was a good exercise but I think need a few pre-cursors before coming back to this. Depending on how hard it is to rebase may ultimately end up with a new PR, but will see

jbrockmendel · 2019-10-08T15:28:45Z

This was a good exercise but I think need a few pre-cursors before coming back to this.

Definitely +1 on the concept. Pending the performance hit, I'm looking forward to seeing this make a comeback.

jbrockmendel · 2019-10-13T03:38:36Z

The more I work in/around this method, the more I think it needs to DIAF

WillAyd added 2 commits October 3, 2019 18:41

Removed block code

b694b09

Changed impl; fixed test

ee85e5a

WillAyd added Groupby Clean labels Oct 4, 2019

jbrockmendel reviewed Oct 4, 2019

View reviewed changes

More cleanup

32d1b6b

WillAyd force-pushed the remove-groupby-blocks branch from aace087 to 32d1b6b Compare October 7, 2019 17:04

WillAyd added 4 commits October 7, 2019 10:39

More robust _wrap_aggregated_output, test fix for median

6115323

Fixed issue with duplicate column names in count

2324c0c

Fixed issue with count over axis=1

5de118f

Fixed issue with transform output

e6b5fd1

WillAyd commented Oct 7, 2019

View reviewed changes

WillAyd added 4 commits October 7, 2019 11:42

Blackify

44a0c6a

Fixed issue with index names not carrying over

9efdab6

flake8 fixup

654ccf2

typing fixup

2d3c5dd

Fixed issue with multiindex names

6c521f2

WillAyd commented Oct 8, 2019

View reviewed changes

WillAyd closed this Oct 8, 2019

WillAyd deleted the remove-groupby-blocks branch January 16, 2020 00:33

WillAyd mentioned this pull request Feb 6, 2020

REF: _cython_agg_blocks #31752

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove blocks from GroupBy Code #28782

Remove blocks from GroupBy Code #28782

WillAyd commented Oct 4, 2019

jbrockmendel commented Oct 4, 2019

jbrockmendel Oct 4, 2019

WillAyd commented Oct 7, 2019

pep8speaks commented Oct 7, 2019 •

edited

Loading

WillAyd left a comment

WillAyd Oct 7, 2019

jbrockmendel Oct 8, 2019

WillAyd Oct 7, 2019

WillAyd Oct 7, 2019

jbrockmendel Oct 8, 2019

WillAyd Oct 7, 2019

jbrockmendel Oct 8, 2019

WillAyd Oct 8, 2019

jbrockmendel Oct 8, 2019

WillAyd Oct 8, 2019

WillAyd Oct 7, 2019

jbrockmendel commented Oct 8, 2019

WillAyd commented Oct 8, 2019

WillAyd Oct 8, 2019

WillAyd commented Oct 8, 2019

jbrockmendel commented Oct 8, 2019

jbrockmendel commented Oct 13, 2019

Remove blocks from GroupBy Code #28782

Remove blocks from GroupBy Code #28782

Conversation

WillAyd commented Oct 4, 2019

jbrockmendel commented Oct 4, 2019

Choose a reason for hiding this comment

WillAyd commented Oct 7, 2019

pep8speaks commented Oct 7, 2019 • edited Loading

Comment last updated at 2019-10-08 01:29:39 UTC

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Oct 8, 2019

WillAyd commented Oct 8, 2019

Choose a reason for hiding this comment

WillAyd commented Oct 8, 2019

jbrockmendel commented Oct 8, 2019

jbrockmendel commented Oct 13, 2019

pep8speaks commented Oct 7, 2019 •

edited

Loading