-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REF: Move Categorical logic from Grouping (#46203) #46207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5481324
to
7a3ec4b
Compare
Hello @NickCrews! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2022-03-03 20:48:50 UTC |
5f2009a
to
e7f6b0f
Compare
e7f6b0f
to
23aafc7
Compare
The two failed CI runs are unrelated to this change:
So this is passing all the relevant checks. |
self, cat: Categorical | ||
) -> tuple[npt.NDArray[np.signedinteger], Categorical]: | ||
""" | ||
This does a version of `algorithms.factorize()` that works on categoricals. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps should just rename this (and the grouping.codes_and_uniques()
) to factorize()
to be more consistent, since really these are just wrappers for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems reasonable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed this in #46258
sort=sort, | ||
) | ||
|
||
def codes_and_uniques( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole function overlaps a ton with recode_for_groupby
, maybe even doing the exact same thing. Maybe should get unified? In this refactor though I just moved code though, didn't want to change any content yet.
Also, perhaps recode_for_groupby
and recode_from_groupby
should just get absorbed into this class as methods?
) | ||
return cls( | ||
original_grouping_vector=original_grouping_vector, | ||
new_grouping_vector=new_grouping_vector, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not totally happy with these names
@@ -461,8 +459,7 @@ class Grouping: | |||
|
|||
_codes: npt.NDArray[np.signedinteger] | None = None | |||
_group_index: Index | None = None | |||
_passed_categorical: bool | |||
_all_grouper: Categorical | None | |||
_cat_grouper: CategoricalGrouper | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the big benefit of this PR: It condenses all of the other instance variables such as _all_grouper into this one container object, since those variables are irrelevant if we aren't dealing with categoricals.
elif self._all_grouper is not None: | ||
elif ( | ||
self._cat_grouper is not None | ||
and self._cat_grouper.original_grouping_vector is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure I like how we are reaching inside the _cat_grouper to ask it about original_grouping_vector. Maybe should make it more opaque like in result_index()
:
if self._cat_grouper is not None:
return self._cat_grouper.group_arraylike(?????)
) | ||
return cat.codes, uniques | ||
|
||
def _codes_and_uniques(self) -> tuple[npt.NDArray[np.signedinteger], AnyArrayLike]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notice the type annotation change here.
I'm not sure why mypy was happy before, since algorithms.factorize returns np.ndarray | Index
for uniques
, and an Index
does not satisfy ArrayLike
.
uniques is typed as merely ArrayLike, mypy gets mad with the result from algorithms.factorize:
error: Incompatible types in assignment (expression has type "Union[ndarray[Any, Any], Index]", variable has type "Union[ExtensionArray, ndarray[Any, Any]]") [assignment]
Not totally sure this is the best solution, alternatively keep the return type as ArrayLike and do a cast with the result from algorithms.factorize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the factorize docstring/annotation is inaccurate. it can return ArrayLike, and im pretty sure does in the case(s) here
|
||
@classmethod | ||
def make(cls, grouping_vector, sort: bool, observed: bool) -> CategoricalGrouper: | ||
new_grouping_vector, original_grouping_vector = recode_for_groupby( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of saving new_grouping_vector as an instance variable, could just return it from this method, since it isn't ever needed by this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why 'make' instead of just __init__
?
self.grouping_vector, sort, observed | ||
) | ||
|
||
self.grouping_vector = self._cat_grouper.new_grouping_vector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above, could change this to just self._cat_grouper, self.grouping_vector = CategoricalGrouper.make(...)
@jbrockmendel, if you get the chance for a review/initial glance, that would be awesome! Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a partial kludge. CategoricalGrouper should be on the same level (and inherit from) BaseGrouper. Then this would make much more sense.
self.grouping_vector, sort, observed | ||
) | ||
|
||
self.grouping_vector = self._cat_grouper.new_grouping_vector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this the way you would do it? it seems much more natural to have self.groupbyin_vector
itself returr be a class that can be handled, similar to BaseGrouper (in fact you should inherit from BaseGrouper)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review @jreback. I'll admit this felt like a kludge. I tried to grok how all the groupby logic interacts, but I mostly failed. There's a BaseGrouper as well as Grouper, but Grouper doesn't inherit from BaseGrouper? What's the difference between a Grouping and a Grouper? A lot of outdated docstrings and lack of typing annotations (and just plain loose typing, eg grouping_vector is sometimes not at all like a vector?) didn't help either.
Even though my choice of name might make it appear otherwise, I wasn't really trying to fit it into any type hierarchy. It is just a container for a few bits of data with some help methods.
I would love to clean this so it makes more sense, but I've been pounding my head on the wall long enough that I think I will need to give up without a bit of guidance. If you're willing to give a summary of how all the classes are intended to interact, then I can try to fix this up. But, I understand you're very busy so if you don't have time to onboard someone new then I get it. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a BaseGrouper as well as Grouper, but Grouper doesn't inherit from BaseGrouper? What's the difference between a Grouping and a Grouper?
You're right that the hierarchy/naming is confusing. If you can come up with better names (for the not-public-API items) please go for it.
I expect this was a useful exercise for getting to know this piece of the code, but it isn't clear that this refactor is an improvement. |
Grouping._codes_and_uniques and BaseGrouper._get_compressed_codes were both semantically doing the same thing as core.algorithms.factorize(), so rename them so that it's just a bit easier to follow. Per a review comment in pandas-dev#46207
Exactly what I thought :) Sounds good, closing this. |
Grouping
class to use delegates/polymorphism #46203 (Replace xxxx with the Github issue number)doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.