Skip to content

Use np.random's RandomState when seed is None #13161

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Use np.random's RandomState when seed is None #13161

wants to merge 1 commit into from

Conversation

ariddell
Copy link

@ariddell ariddell commented May 12, 2016

The handle for numpy's current random state is
np.random.mtrand._rand.

Compare https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/validation.py#L573

@@ -2072,7 +2072,7 @@ def _random_state(state=None):
elif isinstance(state, np.random.RandomState):
return state
elif state is None:
return np.random.RandomState()
return np.random.mtrand._rand
Copy link
Member

@shoyer shoyer May 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use something like this instead?

seed = np.random.randint(2 ** 32)
return np.random.RandomState(seed)

That way we don't need to use NumPy's private API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that will do the same?

As this is to catch a previously called np.random.seed(), so feeding it another seed will not give the desired result I think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want results to be reproducible after calling np.random.seed(). But we don't need to reuse exactly the same seed -- it's OK to also use a seed derived from NumPy's seed.

Using the exact same seed would only be necessary if we want to promise the sample makes the exact same choice as np.random.choice. But that's really an implementation detail.

(Clearly a comment in the code would also be in order if we use my suggested approach.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I see.
For me it does not really matter which approach we take (sklearn also uses _rand, so although it is private it seems ok)

@jreback jreback added Numeric Operations Arithmetic, Comparison, and Logical operations Compat pandas objects compatability with Numpy or Python functions labels May 13, 2016
@jreback
Copy link
Contributor

jreback commented May 13, 2016

@ariddell pls also add a whatsnew entry. I would do it in API changes section.

@ariddell
Copy link
Author

@jreback added. Thanks!

@shoyer
Copy link
Member

shoyer commented May 16, 2016

@ariddell could you please fix the use of NumPy's private state as noted above? This is something simple that could avoid significant pain down the road.

@ariddell
Copy link
Author

Fair enough. What should the seed be? Did you mean 2 ** 32 above?

@shoyer
Copy link
Member

shoyer commented May 16, 2016

@ariddell Yes, I meant 2 ** 32 :).

@rkern
Copy link
Contributor

rkern commented May 17, 2016

Using np.random.mtrand._rand is the right approach. It's not going to disappear. This is an authorized use.

@shoyer
Copy link
Member

shoyer commented May 17, 2016

Let's just return np.random from _random_state instead of np.random.mtrand._rand, using @rkern's suggestion of duck typing for RandomState objects from the related thread on numpy-discussion: https://mail.scipy.org/pipermail/numpy-discussion/2016-May/075489.html

@ariddell
Copy link
Author

@shoyer duck typing it is.

@@ -2072,7 +2072,7 @@ def _random_state(state=None):
elif isinstance(state, np.random.RandomState):
return state
elif state is None:
return np.random.RandomState()
return np.random
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please update the docstring, too?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Fixed.

On 05/17, Stephan Hoyer wrote:

@@ -2072,7 +2072,7 @@ def _random_state(state=None):
elif isinstance(state, np.random.RandomState):
return state
elif state is None:

  •    return np.random.RandomState()
    
  •    return np.random
    

can you please update the docstring, too?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/pydata/pandas/pull/13161/files/c8a38f99637ad44dad0db9118b17fd4e3c8643f3#r63603099

@jreback jreback added this to the 0.18.2 milestone May 18, 2016
@jreback
Copy link
Contributor

jreback commented May 18, 2016

you have a failing test

@jreback
Copy link
Contributor

jreback commented May 20, 2016

@ariddell can you rebase / update

The handle for numpy's current random state is
``np.random.mtrand._rand``. Rather than use the private API, return
np.random, as the module makes available the same functions as an
instance of RandomState.

Compare https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/validation.py#L573
@ariddell
Copy link
Author

OK. I think I've got it.

@jreback jreback closed this in 6f90340 May 21, 2016
@jreback
Copy link
Contributor

jreback commented May 21, 2016

thanks @ariddell

@dolan-a
Copy link

dolan-a commented Feb 9, 2017

I'm not getting the behavior I'd expect from this fix in Pandas 0.19.2. Example:

>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(12345678)
>>> df = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
>>> df.sample(n=2)
          a         b         c         d         e
8  0.365786 -0.855795 -0.511062 -0.116734  1.133541
0  1.281455 -0.423852  0.011236 -0.945194 -0.088124
>>> df.sample(n=2)
          a         b         c         d         e
2  0.341751 -1.158737  0.313814  1.827552 -0.351045
9  1.355003  0.532651  0.250463 -0.281751 -0.342741
>>> df.sample(n=2, random_state=12345678)
          a         b         c         d         e
8  0.365786 -0.855795 -0.511062 -0.116734  1.133541
6 -0.401077 -0.047115 -0.410951 -1.608354  0.290594
>>> df.sample(n=2, random_state=12345678)
          a         b         c         d         e
8  0.365786 -0.855795 -0.511062 -0.116734  1.133541
6 -0.401077 -0.047115 -0.410951 -1.608354  0.290594
>>> pd.__version__
u'0.19.2'
>>> np.__version__
'1.12.0'

Any thoughts on why df.sample(n=2) doesn't match df.sample(n=2, random_state=12345678)?

@shoyer
Copy link
Member

shoyer commented Feb 9, 2017

@pydolan is there a missing np.random.seed() call somewhere in your example?

@dolan-a
Copy link

dolan-a commented Feb 9, 2017

@shoyer -- sorry about that; will update example with my actual code (i.e., with the one missing line)

@shoyer
Copy link
Member

shoyer commented Feb 9, 2017

@pydolan It looks like you are calling np.random.randn() after setting the seed before df.sample()

@dolan-a
Copy link

dolan-a commented Feb 9, 2017

If I move the seed() call after instantiating the dataframe, I still get inconsistent behavior with calls to sample(), except when I provide the random_state arg. Example:

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
>>> np.random.seed(12345678)
>>> df.sample(n=2)
          a         b         c         d         e
8 -1.250204  0.551508  1.408080  0.397452  0.424326
6 -0.028298  0.203270  0.939094 -1.802227 -0.088679
>>> df.sample(n=2)
          a         b         c         d         e
4  0.895497  0.609853 -1.548664 -1.238415 -1.058904
5  0.196420  0.472877 -0.918205  1.019862 -0.631993
>>> df.sample(n=2, random_state=12345678)
          a         b         c         d         e
8 -1.250204  0.551508  1.408080  0.397452  0.424326
6 -0.028298  0.203270  0.939094 -1.802227 -0.088679
>>> df.sample(n=2, random_state=12345678)
          a         b         c         d         e
8 -1.250204  0.551508  1.408080  0.397452  0.424326
6 -0.028298  0.203270  0.939094 -1.802227 -0.088679

Note that this time, the first call to sample() uses the seed, but the second call does not use the seed. Is it expected that seed() needs to be called before every sample call? I thought it is supposed to be, "set once", and all future randomization-related calls should use (including my original example, where randn() is called after seed() and before sample()).

(Also note that I did verify that calling seed() before every call to sample() does indeed produce the same sampled rows.)

@shoyer
Copy link
Member

shoyer commented Feb 9, 2017

Is it expected that seed() needs to be called before every sample call? I thought it is supposed to be, "set once", and all future randomization-related calls should use (including my original example, where randn() is called after seed() and before sample()).

This is working as intended. Try just sampling random numbers from numpy -- it does the exact same thing.

@dolan-a
Copy link

dolan-a commented Feb 9, 2017

Good point; I completely spaced on the nature of random sampling. Sorry for the waste of time!

@ariddell ariddell deleted the feature/sample-numpy-random-seed branch February 18, 2017 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sample not using numpy's random state
6 participants