-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Implement groupby.sample #34069
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation looks good. Just some doc comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @TomAugspurger @jorisvandenbossche if you'd have a look
pandas/core/groupby/groupby.py
Outdated
else: | ||
ws = [None] * self.ngroups | ||
|
||
if random_state: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think this is enough, you need to always have a random_state here that is consistent across the entire groupby.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think either is fine. Either we get a random state from NumPy's global random state initially and re-use it, or we have each group draw from the global random state pool. It's similar to these two calls
.sample(random_state=0)
# each call uses the seed 0.sample(random_state=np.random.RandomState(0))
# each call makes an independent draw
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually meant to make this random_state is not None
(didn't consider other "falsey" values)
doc/source/whatsnew/v1.1.0.rst
Outdated
@@ -275,6 +275,7 @@ Other enhancements | |||
such as ``dict`` and ``list``, mirroring the behavior of :meth:`DataFrame.update` (:issue:`33215`) | |||
- :meth:`~pandas.core.groupby.GroupBy.transform` and :meth:`~pandas.core.groupby.GroupBy.aggregate` has gained ``engine`` and ``engine_kwargs`` arguments that supports executing functions with ``Numba`` (:issue:`32854`, :issue:`33388`) | |||
- :meth:`~pandas.core.resample.Resampler.interpolate` now supports SciPy interpolation method :class:`scipy.interpolate.CubicSpline` as method ``cubicspline`` (:issue:`33670`) | |||
- :class:`DataFrameGroupBy` and :class:`SeriesGroupBy` now implement the ``sample`` method for doing random sampling within groups (:issue:`31775`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need the full path to these classes in the docs.
pandas/core/groupby/groupby.py
Outdated
else: | ||
ws = [None] * self.ngroups | ||
|
||
if random_state: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think either is fine. Either we get a random state from NumPy's global random state initially and re-use it, or we have each group draw from the global random state pool. It's similar to these two calls
.sample(random_state=0)
# each call uses the seed 0.sample(random_state=np.random.RandomState(0))
# each call makes an independent draw
the underlying object and will be used as sampling probabilities | ||
after normalization within each group. | ||
random_state : int, array-like, BitGenerator, np.random.RandomState, optional | ||
If int, array-like, or BitGenerator (NumPy>=1.17), seed for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It it is a BitGenerator, do you use a Generator to produce the random samples or a RandomState. Best practice is to use a Generator since RandomState is effectively frozen in time. If an int, it is used as a seed for np.random.default_rng()
or RandomState
if NumPy >= 1.17?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is following a pattern similar to the one used in pandas.core.generic.sample of processing the random_state according to pandas.core.common.random_state:
Line 394 in c71bfc3
def random_state(state=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks fine, can you add a reference in doc/source/reference/groupby.rst
also a mention / small example in user_guide/groupby.rst if appropriate
thanks @dsaxton very nice! |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff