Use np.random's RandomState when seed is None #13161

ariddell · 2016-05-12T19:05:29Z

closes sample not using numpy's random state #13143
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

The handle for numpy's current random state is
np.random.mtrand._rand.

Compare https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/validation.py#L573

shoyer · 2016-05-12T21:19:53Z

pandas/core/common.py

@@ -2072,7 +2072,7 @@ def _random_state(state=None):
    elif isinstance(state, np.random.RandomState):
        return state
    elif state is None:
-        return np.random.RandomState()
+        return np.random.mtrand._rand


Can we use something like this instead?

seed = np.random.randint(2 ** 32) return np.random.RandomState(seed)

That way we don't need to use NumPy's private API.

I don't think that will do the same?

As this is to catch a previously called np.random.seed(), so feeding it another seed will not give the desired result I think?

We want results to be reproducible after calling np.random.seed(). But we don't need to reuse exactly the same seed -- it's OK to also use a seed derived from NumPy's seed.

Using the exact same seed would only be necessary if we want to promise the sample makes the exact same choice as np.random.choice. But that's really an implementation detail.

(Clearly a comment in the code would also be in order if we use my suggested approach.)

Aha, I see.
For me it does not really matter which approach we take (sklearn also uses _rand, so although it is private it seems ok)

jreback · 2016-05-13T00:23:11Z

@ariddell pls also add a whatsnew entry. I would do it in API changes section.

ariddell · 2016-05-16T17:22:33Z

@jreback added. Thanks!

shoyer · 2016-05-16T17:53:25Z

@ariddell could you please fix the use of NumPy's private state as noted above? This is something simple that could avoid significant pain down the road.

ariddell · 2016-05-16T21:34:03Z

Fair enough. What should the seed be? Did you mean 2 ** 32 above?

shoyer · 2016-05-16T22:04:07Z

@ariddell Yes, I meant 2 ** 32 :).

rkern · 2016-05-17T07:11:42Z

Using np.random.mtrand._rand is the right approach. It's not going to disappear. This is an authorized use.

shoyer · 2016-05-17T08:11:38Z

Let's just return np.random from _random_state instead of np.random.mtrand._rand, using @rkern's suggestion of duck typing for RandomState objects from the related thread on numpy-discussion: https://mail.scipy.org/pipermail/numpy-discussion/2016-May/075489.html

ariddell · 2016-05-17T20:48:48Z

@shoyer duck typing it is.

shoyer · 2016-05-17T21:00:30Z

pandas/core/common.py

@@ -2072,7 +2072,7 @@ def _random_state(state=None):
    elif isinstance(state, np.random.RandomState):
        return state
    elif state is None:
-        return np.random.RandomState()
+        return np.random


can you please update the docstring, too?

Yes. Fixed.

On 05/17, Stephan Hoyer wrote:

@@ -2072,7 +2072,7 @@ def _random_state(state=None):
elif isinstance(state, np.random.RandomState):
return state
elif state is None:

return np.random.RandomState()

return np.random

can you please update the docstring, too?

You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/pydata/pandas/pull/13161/files/c8a38f99637ad44dad0db9118b17fd4e3c8643f3#r63603099

jreback · 2016-05-18T13:10:36Z

you have a failing test

jreback · 2016-05-20T14:07:47Z

@ariddell can you rebase / update

The handle for numpy's current random state is ``np.random.mtrand._rand``. Rather than use the private API, return np.random, as the module makes available the same functions as an instance of RandomState. Compare https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/validation.py#L573

ariddell · 2016-05-20T14:57:00Z

OK. I think I've got it.

jreback · 2016-05-21T14:38:25Z

thanks @ariddell

dolan-a · 2017-02-09T18:25:58Z

I'm not getting the behavior I'd expect from this fix in Pandas 0.19.2. Example:

>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(12345678)
>>> df = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
>>> df.sample(n=2)
          a         b         c         d         e
8  0.365786 -0.855795 -0.511062 -0.116734  1.133541
0  1.281455 -0.423852  0.011236 -0.945194 -0.088124
>>> df.sample(n=2)
          a         b         c         d         e
2  0.341751 -1.158737  0.313814  1.827552 -0.351045
9  1.355003  0.532651  0.250463 -0.281751 -0.342741
>>> df.sample(n=2, random_state=12345678)
          a         b         c         d         e
8  0.365786 -0.855795 -0.511062 -0.116734  1.133541
6 -0.401077 -0.047115 -0.410951 -1.608354  0.290594
>>> df.sample(n=2, random_state=12345678)
          a         b         c         d         e
8  0.365786 -0.855795 -0.511062 -0.116734  1.133541
6 -0.401077 -0.047115 -0.410951 -1.608354  0.290594
>>> pd.__version__
u'0.19.2'
>>> np.__version__
'1.12.0'

Any thoughts on why df.sample(n=2) doesn't match df.sample(n=2, random_state=12345678)?

shoyer · 2017-02-09T18:27:47Z

@pydolan is there a missing np.random.seed() call somewhere in your example?

dolan-a · 2017-02-09T18:29:14Z

@shoyer -- sorry about that; will update example with my actual code (i.e., with the one missing line)

shoyer · 2017-02-09T18:42:37Z

@pydolan It looks like you are calling np.random.randn() after setting the seed before df.sample()

dolan-a · 2017-02-09T20:42:15Z

If I move the seed() call after instantiating the dataframe, I still get inconsistent behavior with calls to sample(), except when I provide the random_state arg. Example:

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
>>> np.random.seed(12345678)
>>> df.sample(n=2)
          a         b         c         d         e
8 -1.250204  0.551508  1.408080  0.397452  0.424326
6 -0.028298  0.203270  0.939094 -1.802227 -0.088679
>>> df.sample(n=2)
          a         b         c         d         e
4  0.895497  0.609853 -1.548664 -1.238415 -1.058904
5  0.196420  0.472877 -0.918205  1.019862 -0.631993
>>> df.sample(n=2, random_state=12345678)
          a         b         c         d         e
8 -1.250204  0.551508  1.408080  0.397452  0.424326
6 -0.028298  0.203270  0.939094 -1.802227 -0.088679
>>> df.sample(n=2, random_state=12345678)
          a         b         c         d         e
8 -1.250204  0.551508  1.408080  0.397452  0.424326
6 -0.028298  0.203270  0.939094 -1.802227 -0.088679

Note that this time, the first call to sample() uses the seed, but the second call does not use the seed. Is it expected that seed() needs to be called before every sample call? I thought it is supposed to be, "set once", and all future randomization-related calls should use (including my original example, where randn() is called after seed() and before sample()).

(Also note that I did verify that calling seed() before every call to sample() does indeed produce the same sampled rows.)

shoyer · 2017-02-09T21:00:40Z

Is it expected that seed() needs to be called before every sample call? I thought it is supposed to be, "set once", and all future randomization-related calls should use (including my original example, where randn() is called after seed() and before sample()).

This is working as intended. Try just sampling random numbers from numpy -- it does the exact same thing.

dolan-a · 2017-02-09T21:16:38Z

Good point; I completely spaced on the nature of random sampling. Sorry for the waste of time!

shoyer reviewed May 12, 2016
View reviewed changes

jreback added Numeric Operations Arithmetic, Comparison, and Logical operations Compat pandas objects compatability with Numpy or Python functions labels May 13, 2016

shoyer reviewed May 17, 2016
View reviewed changes

jreback added this to the 0.18.2 milestone May 18, 2016

jreback closed this in 6f90340 May 21, 2016

ariddell deleted the feature/sample-numpy-random-seed branch February 18, 2017 16:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use np.random's RandomState when seed is None #13161

Use np.random's RandomState when seed is None #13161

ariddell commented May 12, 2016 •

edited

Loading

shoyer May 12, 2016 •

edited

Loading

jorisvandenbossche May 12, 2016

shoyer May 12, 2016

jorisvandenbossche May 12, 2016

jreback commented May 13, 2016

ariddell commented May 16, 2016

shoyer commented May 16, 2016

ariddell commented May 16, 2016

shoyer commented May 16, 2016

rkern commented May 17, 2016

shoyer commented May 17, 2016

ariddell commented May 17, 2016

shoyer May 17, 2016

ariddell May 18, 2016

jreback commented May 18, 2016

jreback commented May 20, 2016

ariddell commented May 20, 2016

jreback commented May 21, 2016

dolan-a commented Feb 9, 2017 •

edited

Loading

shoyer commented Feb 9, 2017

dolan-a commented Feb 9, 2017

shoyer commented Feb 9, 2017

dolan-a commented Feb 9, 2017

shoyer commented Feb 9, 2017

dolan-a commented Feb 9, 2017

Use np.random's RandomState when seed is None #13161

Use np.random's RandomState when seed is None #13161

Conversation

ariddell commented May 12, 2016 • edited Loading

shoyer May 12, 2016 • edited Loading

Choose a reason for hiding this comment

jorisvandenbossche May 12, 2016

Choose a reason for hiding this comment

shoyer May 12, 2016

Choose a reason for hiding this comment

jorisvandenbossche May 12, 2016

Choose a reason for hiding this comment

jreback commented May 13, 2016

ariddell commented May 16, 2016

shoyer commented May 16, 2016

ariddell commented May 16, 2016

shoyer commented May 16, 2016

rkern commented May 17, 2016

shoyer commented May 17, 2016

ariddell commented May 17, 2016

shoyer May 17, 2016

Choose a reason for hiding this comment

ariddell May 18, 2016

Choose a reason for hiding this comment

jreback commented May 18, 2016

jreback commented May 20, 2016

ariddell commented May 20, 2016

jreback commented May 21, 2016

dolan-a commented Feb 9, 2017 • edited Loading

shoyer commented Feb 9, 2017

dolan-a commented Feb 9, 2017

shoyer commented Feb 9, 2017

dolan-a commented Feb 9, 2017

shoyer commented Feb 9, 2017

dolan-a commented Feb 9, 2017

ariddell commented May 12, 2016 •

edited

Loading

shoyer May 12, 2016 •

edited

Loading

dolan-a commented Feb 9, 2017 •

edited

Loading