sample not using numpy's random state #13143

ariddell · 2016-05-11T17:19:34Z

After fixing a random seed with numpy.random.seed, I expect sample to yield the same results.

Expected behavior of numpy.random.choice but found something different. Here is pandas:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: df = pd.DataFrame(np.arange(1000))
In [12]: np.random.seed(5); df.sample(2)
Out[12]: 
       0
824  824
225  225

In [13]: np.random.seed(5); df.sample(2)
Out[13]: 
       0
182  182
586  586

Whereas numpy.random.choice is consistent

In [6]: np.random.seed(5); np.random.choice(1000)
Out[6]: 867

In [7]: np.random.seed(5); np.random.choice(1000)
Out[7]: 867

output of `pd.show_versions()`

In [8]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.16.0-67-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.1
pip: 8.1.1
setuptools: 18.4
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.16.1
statsmodels: 0.6.1
xarray: None
IPython: 4.0.1
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.4.4
matplotlib: 1.5.0
openpyxl: None
xlrd: 0.9.4
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.2.1
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2016-05-11T22:08:45Z

you have to pass the state in. This was designed this way on purpose IIRC.

In [10]: df = pd.DataFrame(np.arange(1000))

In [12]: df.sample(2, random_state=2)
Out[12]: 
       0
37    37
726  726

In [13]: df.sample(2, random_state=2)
Out[13]: 
       0
37    37
726  726

@nickeubank @jorisvandenbossche

jorisvandenbossche · 2016-05-11T23:49:03Z

I think we should provide the proposed behaviour (next to numpy, this is also how eg sklearn's train_test_split behaves)

It would be a change to this line: https://github.com/pydata/pandas/blob/master/pandas/core/common.py#L2075. Looking at sklearn, we should return np.random.mtrand._rand instead (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/validation.py#L573)

@ariddell interested to do a PR?

jreback · 2016-05-11T23:52:35Z

ahh I c, so that will then use the global state, makes sense.

ariddell · 2016-05-12T11:13:29Z

Yes, I'll do the PR. Thanks for the pointer to the relevant line.

On 05/11, Joris Van den Bossche wrote:

I think we should provide the proposed behaviour (next to numpy, this is also how eg sklearn's train_test_split behaves)

It would be a change to this line: https://github.com/pydata/pandas/blob/master/pandas/core/common.py#L2075. Looking at sklearn, we should return np.random.mtrand._rand instead (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/validation.py#L573)

@ariddell interested to do a PR?

You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#13143 (comment)

jreback closed this as completed May 11, 2016

jreback added Numeric Operations Arithmetic, Comparison, and Logical operations Compat pandas objects compatability with Numpy or Python functions labels May 11, 2016

jorisvandenbossche reopened this May 11, 2016

jreback added this to the 0.18.2 milestone May 11, 2016

jreback added Difficulty Novice labels May 11, 2016

ariddell mentioned this issue May 12, 2016

Use np.random's RandomState when seed is None #13161

Closed

4 tasks

jreback closed this as completed in 6f90340 May 21, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sample not using numpy's random state #13143

sample not using numpy's random state #13143

ariddell commented May 11, 2016 •

edited

Loading

jreback commented May 11, 2016

jorisvandenbossche commented May 11, 2016

jreback commented May 11, 2016

ariddell commented May 12, 2016

sample not using numpy's random state #13143

sample not using numpy's random state #13143

Comments

ariddell commented May 11, 2016 • edited Loading

output of pd.show_versions()

jreback commented May 11, 2016

jorisvandenbossche commented May 11, 2016

jreback commented May 11, 2016

ariddell commented May 12, 2016

ariddell commented May 11, 2016 •

edited

Loading

output of `pd.show_versions()`