Skip to content

REF: Move Categorical logic from Grouping (#46203) #46207

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

NickCrews
Copy link
Contributor

@NickCrews NickCrews commented Mar 3, 2022

@NickCrews NickCrews force-pushed the grouping-refactor branch 2 times, most recently from 5481324 to 7a3ec4b Compare March 3, 2022 05:17
@pep8speaks
Copy link

pep8speaks commented Mar 3, 2022

Hello @NickCrews! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-03-03 20:48:50 UTC

@NickCrews NickCrews force-pushed the grouping-refactor branch 2 times, most recently from 5f2009a to e7f6b0f Compare March 3, 2022 15:01
@NickCrews NickCrews force-pushed the grouping-refactor branch from e7f6b0f to 23aafc7 Compare March 3, 2022 20:48
@NickCrews
Copy link
Contributor Author

The two failed CI runs are unrelated to this change:

So this is passing all the relevant checks.

@NickCrews NickCrews marked this pull request as ready for review March 3, 2022 22:43
self, cat: Categorical
) -> tuple[npt.NDArray[np.signedinteger], Categorical]:
"""
This does a version of `algorithms.factorize()` that works on categoricals.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps should just rename this (and the grouping.codes_and_uniques()) to factorize() to be more consistent, since really these are just wrappers for that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems reasonable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this in #46258

sort=sort,
)

def codes_and_uniques(
Copy link
Contributor Author

@NickCrews NickCrews Mar 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole function overlaps a ton with recode_for_groupby, maybe even doing the exact same thing. Maybe should get unified? In this refactor though I just moved code though, didn't want to change any content yet.

Also, perhaps recode_for_groupby and recode_from_groupby should just get absorbed into this class as methods?

)
return cls(
original_grouping_vector=original_grouping_vector,
new_grouping_vector=new_grouping_vector,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not totally happy with these names

@@ -461,8 +459,7 @@ class Grouping:

_codes: npt.NDArray[np.signedinteger] | None = None
_group_index: Index | None = None
_passed_categorical: bool
_all_grouper: Categorical | None
_cat_grouper: CategoricalGrouper | None = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the big benefit of this PR: It condenses all of the other instance variables such as _all_grouper into this one container object, since those variables are irrelevant if we aren't dealing with categoricals.

elif self._all_grouper is not None:
elif (
self._cat_grouper is not None
and self._cat_grouper.original_grouping_vector is not None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I like how we are reaching inside the _cat_grouper to ask it about original_grouping_vector. Maybe should make it more opaque like in result_index():

if self._cat_grouper is not None:
            return self._cat_grouper.group_arraylike(?????)

)
return cat.codes, uniques

def _codes_and_uniques(self) -> tuple[npt.NDArray[np.signedinteger], AnyArrayLike]:
Copy link
Contributor Author

@NickCrews NickCrews Mar 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice the type annotation change here.

I'm not sure why mypy was happy before, since algorithms.factorize returns np.ndarray | Index for uniques, and an Index does not satisfy ArrayLike.

uniques is typed as merely ArrayLike, mypy gets mad with the result from algorithms.factorize:
error: Incompatible types in assignment (expression has type "Union[ndarray[Any, Any], Index]", variable has type "Union[ExtensionArray, ndarray[Any, Any]]") [assignment]

Not totally sure this is the best solution, alternatively keep the return type as ArrayLike and do a cast with the result from algorithms.factorize.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the factorize docstring/annotation is inaccurate. it can return ArrayLike, and im pretty sure does in the case(s) here


@classmethod
def make(cls, grouping_vector, sort: bool, observed: bool) -> CategoricalGrouper:
new_grouping_vector, original_grouping_vector = recode_for_groupby(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of saving new_grouping_vector as an instance variable, could just return it from this method, since it isn't ever needed by this class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 'make' instead of just __init__?

self.grouping_vector, sort, observed
)

self.grouping_vector = self._cat_grouper.new_grouping_vector
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above, could change this to just self._cat_grouper, self.grouping_vector = CategoricalGrouper.make(...)

@NickCrews
Copy link
Contributor Author

@jbrockmendel, if you get the chance for a review/initial glance, that would be awesome! Thanks.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a partial kludge. CategoricalGrouper should be on the same level (and inherit from) BaseGrouper. Then this would make much more sense.

self.grouping_vector, sort, observed
)

self.grouping_vector = self._cat_grouper.new_grouping_vector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this the way you would do it? it seems much more natural to have self.groupbyin_vector itself returr be a class that can be handled, similar to BaseGrouper (in fact you should inherit from BaseGrouper)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review @jreback. I'll admit this felt like a kludge. I tried to grok how all the groupby logic interacts, but I mostly failed. There's a BaseGrouper as well as Grouper, but Grouper doesn't inherit from BaseGrouper? What's the difference between a Grouping and a Grouper? A lot of outdated docstrings and lack of typing annotations (and just plain loose typing, eg grouping_vector is sometimes not at all like a vector?) didn't help either.

Even though my choice of name might make it appear otherwise, I wasn't really trying to fit it into any type hierarchy. It is just a container for a few bits of data with some help methods.

I would love to clean this so it makes more sense, but I've been pounding my head on the wall long enough that I think I will need to give up without a bit of guidance. If you're willing to give a summary of how all the classes are intended to interact, then I can try to fix this up. But, I understand you're very busy so if you don't have time to onboard someone new then I get it. Thank you!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a BaseGrouper as well as Grouper, but Grouper doesn't inherit from BaseGrouper? What's the difference between a Grouping and a Grouper?

You're right that the hierarchy/naming is confusing. If you can come up with better names (for the not-public-API items) please go for it.

@jreback jreback added Categorical Categorical Data Type Groupby Refactor Internal refactoring of code labels Mar 5, 2022
@jbrockmendel
Copy link
Member

I expect this was a useful exercise for getting to know this piece of the code, but it isn't clear that this refactor is an improvement.

NickCrews added a commit to NickCrews/pandas that referenced this pull request Mar 8, 2022
Grouping._codes_and_uniques and BaseGrouper._get_compressed_codes
were both semantically doing the same thing as
core.algorithms.factorize(), so rename them so that it's just a bit
easier to follow.

Per a review comment in pandas-dev#46207
@NickCrews
Copy link
Contributor Author

I expect this was a useful exercise for getting to know this piece of the code, but it isn't clear that this refactor is an improvement.

Exactly what I thought :) Sounds good, closing this.

@NickCrews NickCrews closed this Mar 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Groupby Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

REF: Refactor Grouping class to use delegates/polymorphism
4 participants