REF: Move Categorical logic from Grouping (#46203) #46207

NickCrews · 2022-03-03T00:04:28Z

closes REF: Refactor Grouping class to use delegates/polymorphism #46203 (Replace xxxx with the Github issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

pep8speaks · 2022-03-03T05:17:12Z

Hello @NickCrews! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-03-03 20:48:50 UTC

NickCrews · 2022-03-03T22:43:23Z

The two failed CI runs are unrelated to this change:

Building docs failed because of a network error trying to download an intersphinx inventory
ASV failed because of Benchmark for indexing with .loc for sorted/unsorted DatetimeIndex #46193

So this is passing all the relevant checks.

NickCrews · 2022-03-03T22:46:05Z

pandas/core/groupby/categorical.py

+        self, cat: Categorical
+    ) -> tuple[npt.NDArray[np.signedinteger], Categorical]:
+        """
+        This does a version of `algorithms.factorize()` that works on categoricals.


Perhaps should just rename this (and the grouping.codes_and_uniques()) to factorize() to be more consistent, since really these are just wrappers for that.

seems reasonable

Addressed this in #46258

NickCrews · 2022-03-03T22:48:44Z

pandas/core/groupby/categorical.py

+            sort=sort,
+        )
+
+    def codes_and_uniques(


This whole function overlaps a ton with recode_for_groupby, maybe even doing the exact same thing. Maybe should get unified? In this refactor though I just moved code though, didn't want to change any content yet.

Also, perhaps recode_for_groupby and recode_from_groupby should just get absorbed into this class as methods?

NickCrews · 2022-03-03T22:49:18Z

pandas/core/groupby/categorical.py

+        )
+        return cls(
+            original_grouping_vector=original_grouping_vector,
+            new_grouping_vector=new_grouping_vector,


Not totally happy with these names

NickCrews · 2022-03-03T22:50:35Z

pandas/core/groupby/grouper.py

@@ -461,8 +459,7 @@ class Grouping:

    _codes: npt.NDArray[np.signedinteger] | None = None
    _group_index: Index | None = None
-    _passed_categorical: bool
-    _all_grouper: Categorical | None
+    _cat_grouper: CategoricalGrouper | None = None


This is the big benefit of this PR: It condenses all of the other instance variables such as _all_grouper into this one container object, since those variables are irrelevant if we aren't dealing with categoricals.

NickCrews · 2022-03-03T22:54:07Z

pandas/core/groupby/grouper.py

-        elif self._all_grouper is not None:
+        elif (
+            self._cat_grouper is not None
+            and self._cat_grouper.original_grouping_vector is not None


not sure I like how we are reaching inside the _cat_grouper to ask it about original_grouping_vector. Maybe should make it more opaque like in result_index():

if self._cat_grouper is not None: return self._cat_grouper.group_arraylike(?????)

NickCrews · 2022-03-03T22:59:43Z

pandas/core/groupby/grouper.py

-            )
-            return cat.codes, uniques
-
+    def _codes_and_uniques(self) -> tuple[npt.NDArray[np.signedinteger], AnyArrayLike]:


Notice the type annotation change here.

I'm not sure why mypy was happy before, since algorithms.factorize returns np.ndarray | Index for uniques, and an Index does not satisfy ArrayLike.

uniques is typed as merely ArrayLike, mypy gets mad with the result from algorithms.factorize:
error: Incompatible types in assignment (expression has type "Union[ndarray[Any, Any], Index]", variable has type "Union[ExtensionArray, ndarray[Any, Any]]") [assignment]

Not totally sure this is the best solution, alternatively keep the return type as ArrayLike and do a cast with the result from algorithms.factorize.

Looks like the factorize docstring/annotation is inaccurate. it can return ArrayLike, and im pretty sure does in the case(s) here

NickCrews · 2022-03-03T23:02:22Z

pandas/core/groupby/categorical.py

+
+    @classmethod
+    def make(cls, grouping_vector, sort: bool, observed: bool) -> CategoricalGrouper:
+        new_grouping_vector, original_grouping_vector = recode_for_groupby(


instead of saving new_grouping_vector as an instance variable, could just return it from this method, since it isn't ever needed by this class.

why 'make' instead of just __init__?

NickCrews · 2022-03-03T23:03:29Z

pandas/core/groupby/grouper.py

                self.grouping_vector, sort, observed
            )
-
+            self.grouping_vector = self._cat_grouper.new_grouping_vector


See comment above, could change this to just self._cat_grouper, self.grouping_vector = CategoricalGrouper.make(...)

NickCrews · 2022-03-04T23:06:36Z

@jbrockmendel, if you get the chance for a review/initial glance, that would be awesome! Thanks.

jreback

this is a partial kludge. CategoricalGrouper should be on the same level (and inherit from) BaseGrouper. Then this would make much more sense.

jreback · 2022-03-05T17:09:29Z

pandas/core/groupby/grouper.py

                self.grouping_vector, sort, observed
            )
-
+            self.grouping_vector = self._cat_grouper.new_grouping_vector


why is this the way you would do it? it seems much more natural to have self.groupbyin_vector itself returr be a class that can be handled, similar to BaseGrouper (in fact you should inherit from BaseGrouper)

Thank you for the review @jreback. I'll admit this felt like a kludge. I tried to grok how all the groupby logic interacts, but I mostly failed. There's a BaseGrouper as well as Grouper, but Grouper doesn't inherit from BaseGrouper? What's the difference between a Grouping and a Grouper? A lot of outdated docstrings and lack of typing annotations (and just plain loose typing, eg grouping_vector is sometimes not at all like a vector?) didn't help either.

Even though my choice of name might make it appear otherwise, I wasn't really trying to fit it into any type hierarchy. It is just a container for a few bits of data with some help methods.

I would love to clean this so it makes more sense, but I've been pounding my head on the wall long enough that I think I will need to give up without a bit of guidance. If you're willing to give a summary of how all the classes are intended to interact, then I can try to fix this up. But, I understand you're very busy so if you don't have time to onboard someone new then I get it. Thank you!

There's a BaseGrouper as well as Grouper, but Grouper doesn't inherit from BaseGrouper? What's the difference between a Grouping and a Grouper?

You're right that the hierarchy/naming is confusing. If you can come up with better names (for the not-public-API items) please go for it.

jbrockmendel · 2022-03-06T00:33:03Z

I expect this was a useful exercise for getting to know this piece of the code, but it isn't clear that this refactor is an improvement.

Grouping._codes_and_uniques and BaseGrouper._get_compressed_codes were both semantically doing the same thing as core.algorithms.factorize(), so rename them so that it's just a bit easier to follow. Per a review comment in pandas-dev#46207

NickCrews · 2022-03-08T00:52:47Z

I expect this was a useful exercise for getting to know this piece of the code, but it isn't clear that this refactor is an improvement.

Exactly what I thought :) Sounds good, closing this.

NickCrews force-pushed the grouping-refactor branch 2 times, most recently from 5481324 to 7a3ec4b Compare March 3, 2022 05:17

NickCrews force-pushed the grouping-refactor branch 2 times, most recently from 5f2009a to e7f6b0f Compare March 3, 2022 15:01

REF: Move Categorical logic from Grouping (pandas-dev#46203)

23aafc7

NickCrews force-pushed the grouping-refactor branch from e7f6b0f to 23aafc7 Compare March 3, 2022 20:48

NickCrews marked this pull request as ready for review March 3, 2022 22:43

NickCrews mentioned this pull request Mar 3, 2022

REF: Refactor Grouping class to use delegates/polymorphism #46203

Closed

NickCrews commented Mar 3, 2022

View reviewed changes

jreback requested changes Mar 5, 2022

View reviewed changes

jreback added Categorical Categorical Data Type Groupby Refactor Internal refactoring of code labels Mar 5, 2022

NickCrews closed this Mar 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REF: Move Categorical logic from Grouping (#46203) #46207

REF: Move Categorical logic from Grouping (#46203) #46207

NickCrews commented Mar 3, 2022 •

edited

Loading

pep8speaks commented Mar 3, 2022 •

edited

Loading

NickCrews commented Mar 3, 2022

NickCrews Mar 3, 2022

jbrockmendel Mar 6, 2022

NickCrews Mar 8, 2022

NickCrews Mar 3, 2022 •

edited

Loading

NickCrews Mar 3, 2022

NickCrews Mar 3, 2022

NickCrews Mar 3, 2022

NickCrews Mar 3, 2022 •

edited

Loading

jbrockmendel Mar 6, 2022

NickCrews Mar 3, 2022

jbrockmendel Mar 6, 2022

NickCrews Mar 3, 2022

NickCrews commented Mar 4, 2022

jreback left a comment

jreback Mar 5, 2022

NickCrews Mar 5, 2022

jbrockmendel Mar 6, 2022

jbrockmendel commented Mar 6, 2022

NickCrews commented Mar 8, 2022

REF: Move Categorical logic from Grouping (#46203) #46207

REF: Move Categorical logic from Grouping (#46203) #46207

Conversation

NickCrews commented Mar 3, 2022 • edited Loading

pep8speaks commented Mar 3, 2022 • edited Loading

Comment last updated at 2022-03-03 20:48:50 UTC

NickCrews commented Mar 3, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NickCrews Mar 3, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NickCrews Mar 3, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NickCrews commented Mar 4, 2022

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Mar 6, 2022

NickCrews commented Mar 8, 2022

NickCrews commented Mar 3, 2022 •

edited

Loading

pep8speaks commented Mar 3, 2022 •

edited

Loading

NickCrews Mar 3, 2022 •

edited

Loading

NickCrews Mar 3, 2022 •

edited

Loading