Skip to content

Commit 8f0f417

Browse files
nickeubankjreback
authored andcommitted
ENH: Add sample function with tests and docs (GH2419)
1 parent 5e994b6 commit 8f0f417

File tree

7 files changed

+439
-2
lines changed

7 files changed

+439
-2
lines changed

doc/source/api.rst

+3
Original file line numberDiff line numberDiff line change
@@ -390,6 +390,7 @@ Reindexing / Selection / Label manipulation
390390
Series.reindex_like
391391
Series.rename
392392
Series.reset_index
393+
Series.sample
393394
Series.select
394395
Series.take
395396
Series.tail
@@ -824,6 +825,7 @@ Reindexing / Selection / Label manipulation
824825
DataFrame.reindex_like
825826
DataFrame.rename
826827
DataFrame.reset_index
828+
DataFrame.sample
827829
DataFrame.select
828830
DataFrame.set_index
829831
DataFrame.tail
@@ -1072,6 +1074,7 @@ Reindexing / Selection / Label manipulation
10721074
Panel.reindex_axis
10731075
Panel.reindex_like
10741076
Panel.rename
1077+
Panel.sample
10751078
Panel.select
10761079
Panel.take
10771080
Panel.truncate

doc/source/indexing.rst

+75
Original file line numberDiff line numberDiff line change
@@ -508,6 +508,81 @@ A list of indexers where any element is out of bounds will raise an
508508
509509
.. _indexing.basics.partial_setting:
510510

511+
Selecting Random Samples
512+
------------------------
513+
.. versionadded::0.16.1
514+
515+
A random selection of rows or columns from a Series, DataFrame, or Panel with the :meth:`~DataFrame.sample` method. The method will sample rows by default, and accepts a specific number of rows/columns to return, or a fraction of rows.
516+
517+
.. ipython :: python
518+
519+
s = Series([0,1,2,3,4,5])
520+
521+
# When no arguments are passed, returns 1 row.
522+
s.sample()
523+
524+
# One may specify either a number of rows:
525+
s.sample(n=3)
526+
527+
# Or a fraction of the rows:
528+
s.sample(frac=0.5)
529+
530+
By default, ``sample`` will return each row at most once, but one can also sample with replacement
531+
using the ``replace`` option:
532+
533+
.. ipython :: python
534+
535+
s = Series([0,1,2,3,4,5])
536+
537+
# Without replacement (default):
538+
s.sample(n=6, replace=False)
539+
540+
# With replacement:
541+
s.sample(n=6, replace=True)
542+
543+
544+
By default, each row has an equal probability of being selected, but if you want rows
545+
to have different probabilities, you can pass the ``sample`` function sampling weights as
546+
``weights``. These weights can be a list, a numpy array, or a Series, but they must be of the same length as the object you are sampling. Missing values will be treated as a weight of zero, and inf values are not allowed. If weights do not sum to 1, they will be re-normalized by dividing all weights by the sum of the weights. For example:
547+
548+
.. ipython :: python
549+
550+
s = Series([0,1,2,3,4,5])
551+
example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4]
552+
s.sample(n=3, weights=example_weights)
553+
554+
# Weights will be re-normalized automatically
555+
example_weights2 = [0.5, 0, 0, 0, 0, 0]
556+
s.sample(n=1, weights=example_weights2)
557+
558+
When applied to a DataFrame, you can use a column of the DataFrame as sampling weights
559+
(provided you are sampling rows and not columns) by simply passing the name of the column
560+
as a string.
561+
562+
.. ipython :: python
563+
564+
df2 = DataFrame({'col1':[9,8,7,6], 'weight_column':[0.5, 0.4, 0.1, 0]})
565+
df2.sample(n = 3, weights = 'weight_column')
566+
567+
``sample`` also allows users to sample columns instead of rows using the ``axis`` argument.
568+
569+
.. ipython :: python
570+
571+
df3 = DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
572+
df3.sample(n=1, axis=1)
573+
574+
Finally, one can also set a seed for ``sample``'s random number generator using the ``random_state`` argument, which will accept either an integer (as a seed) or a numpy RandomState object.
575+
576+
.. ipython :: python
577+
578+
df4 = DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
579+
580+
# With a given seed, the sample will always draw the same rows.
581+
df4.sample(n=2, random_state=2)
582+
df4.sample(n=2, random_state=2)
583+
584+
585+
511586
Setting With Enlargement
512587
------------------------
513588

doc/source/whatsnew/v0.16.1.txt

+44-1
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,12 @@ Highlights include:
1212
- Support for a ``CategoricalIndex``, a category based index, see :ref:`here <whatsnew_0161.enhancements.categoricalindex>`
1313
- New section on how-to-contribute to *pandas*, see :ref`here <contributing>`
1414

15+
- New method ``sample`` for drawing random samples from Series, DataFrames and Panels. See :ref:`here <whatsnew_0161.enchancements.sample>`
16+
1517
.. contents:: What's new in v0.16.1
1618
:local:
1719
:backlinks: none
1820

19-
2021
.. _whatsnew_0161.enhancements:
2122

2223
Enhancements
@@ -138,6 +139,48 @@ values NOT in the categories, similarly to how you can reindex ANY pandas index.
138139

139140
See the :ref:`documentation <advanced.categoricalindex>` for more. (:issue:`7629`)
140141

142+
.. _whatsnew_0161.enhancements.sample:
143+
144+
Sample
145+
^^^^^^
146+
147+
Series, DataFrames, and Panels now have a new method: :meth:`~pandas.DataFrame.sample`.
148+
The method accepts a specific number of rows or columns to return, or a fraction of the
149+
total number or rows or columns. It also has options for sampling with or without replacement,
150+
for passing in a column for weights for non-uniform sampling, and for setting seed values to
151+
facilitate replication. (:issue:`2419`)
152+
153+
.. ipython :: python
154+
155+
example_series = Series([0,1,2,3,4,5])
156+
157+
# When no arguments are passed, returns 1
158+
example_series.sample()
159+
160+
# One may specify either a number of rows:
161+
example_series.sample(n=3)
162+
163+
# Or a fraction of the rows:
164+
example_series.sample(frac=0.5)
165+
166+
# weights are accepted.
167+
example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4]
168+
example_series.sample(n=3, weights=example_weights)
169+
170+
# weights will also be normalized if they do not sum to one,
171+
# and missing values will be treated as zeros.
172+
example_weights2 = [0.5, 0, 0, 0, None, np.nan]
173+
example_series.sample(n=1, weights=example_weights2)
174+
175+
176+
When applied to a DataFrame, one may pass the name of a column to specify sampling weights
177+
when sampling from rows.
178+
179+
.. ipython :: python
180+
181+
df = DataFrame({'col1':[9,8,7,6], 'weight_column':[0.5, 0.4, 0.1, 0]})
182+
df.sample(n=3, weights='weight_column')
183+
141184
.. _whatsnew_0161.api:
142185

143186
API changes

pandas/core/common.py

+27
Original file line numberDiff line numberDiff line change
@@ -3319,3 +3319,30 @@ def _maybe_match_name(a, b):
33193319
if a_name == b_name:
33203320
return a_name
33213321
return None
3322+
3323+
def _random_state(state=None):
3324+
"""
3325+
Helper function for processing random_state arguments.
3326+
3327+
Parameters
3328+
----------
3329+
state : int, np.random.RandomState, None.
3330+
If receives an int, passes to np.random.RandomState() as seed.
3331+
If receives an np.random.RandomState object, just returns object.
3332+
If receives `None`, returns an np.random.RandomState object.
3333+
If receives anything else, raises an informative ValueError.
3334+
Default None.
3335+
3336+
Returns
3337+
-------
3338+
np.random.RandomState
3339+
"""
3340+
3341+
if is_integer(state):
3342+
return np.random.RandomState(state)
3343+
elif isinstance(state, np.random.RandomState):
3344+
return state
3345+
elif state is None:
3346+
return np.random.RandomState()
3347+
else:
3348+
raise ValueError("random_state must be an integer, a numpy RandomState, or None")

pandas/core/generic.py

+98-1
Original file line numberDiff line numberDiff line change
@@ -1948,6 +1948,103 @@ def tail(self, n=5):
19481948
return self
19491949
return self.iloc[-n:]
19501950

1951+
1952+
def sample(self, n=None, frac=None, replace=False, weights=None, random_state=None, axis=None):
1953+
"""
1954+
Returns a random sample of items from an axis of object.
1955+
1956+
Parameters
1957+
----------
1958+
n : int, optional
1959+
Number of items from axis to return. Cannot be used with `frac`.
1960+
Default = 1 if `frac` = None.
1961+
frac : float, optional
1962+
Fraction of axis items to return. Cannot be used with `n`.
1963+
replace : boolean, optional
1964+
Sample with or without replacement. Default = False.
1965+
weights : str or ndarray-like, optional
1966+
Default 'None' results in equal probability weighting.
1967+
If called on a DataFrame, will accept the name of a column
1968+
when axis = 0.
1969+
Weights must be same length as axis being sampled.
1970+
If weights do not sum to 1, they will be normalized to sum to 1.
1971+
Missing values in the weights column will be treated as zero.
1972+
inf and -inf values not allowed.
1973+
random_state : int or numpy.random.RandomState, optional
1974+
Seed for the random number generator (if int), or numpy RandomState
1975+
object.
1976+
axis : int or string, optional
1977+
Axis to sample. Accepts axis number or name. Default is stat axis
1978+
for given data type (0 for Series and DataFrames, 1 for Panels).
1979+
1980+
Returns
1981+
-------
1982+
Same type as caller.
1983+
"""
1984+
1985+
if axis is None:
1986+
axis = self._stat_axis_number
1987+
1988+
axis = self._get_axis_number(axis)
1989+
axis_length = self.shape[axis]
1990+
1991+
# Process random_state argument
1992+
rs = com._random_state(random_state)
1993+
1994+
# Check weights for compliance
1995+
if weights is not None:
1996+
1997+
# Strings acceptable if a dataframe and axis = 0
1998+
if isinstance(weights, string_types):
1999+
if isinstance(self, pd.DataFrame):
2000+
if axis == 0:
2001+
try:
2002+
weights = self[weights]
2003+
except KeyError:
2004+
raise KeyError("String passed to weights not a valid column")
2005+
else:
2006+
raise ValueError("Strings can only be passed to weights when sampling from rows on a DataFrame")
2007+
else:
2008+
raise ValueError("Strings cannot be passed as weights when sampling from a Series or Panel.")
2009+
2010+
weights = pd.Series(weights, dtype='float64')
2011+
2012+
if len(weights) != axis_length:
2013+
raise ValueError("Weights and axis to be sampled must be of same length")
2014+
2015+
if (weights == np.inf).any() or (weights == -np.inf).any():
2016+
raise ValueError("weight vector may not include `inf` values")
2017+
2018+
if (weights < 0).any():
2019+
raise ValueError("weight vector many not include negative values")
2020+
2021+
# If has nan, set to zero.
2022+
weights = weights.fillna(0)
2023+
2024+
# Renormalize if don't sum to 1
2025+
if weights.sum() != 1:
2026+
weights = weights / weights.sum()
2027+
2028+
weights = weights.values
2029+
2030+
# If no frac or n, default to n=1.
2031+
if n is None and frac is None:
2032+
n = 1
2033+
elif n is not None and frac is None and n % 1 != 0:
2034+
raise ValueError("Only integers accepted as `n` values")
2035+
elif n is None and frac is not None:
2036+
n = int(round(frac * axis_length))
2037+
elif n is not None and frac is not None:
2038+
raise ValueError('Please enter a value for `frac` OR `n`, not both')
2039+
2040+
# Check for negative sizes
2041+
if n < 0:
2042+
raise ValueError("A negative number of rows requested. Please provide positive value.")
2043+
2044+
locs = rs.choice(axis_length, size=n, replace=replace, p=weights)
2045+
return self.take(locs, axis=axis)
2046+
2047+
19512048
#----------------------------------------------------------------------
19522049
# Attribute access
19532050

@@ -3395,7 +3492,7 @@ def where(self, cond, other=np.nan, inplace=False, axis=None, level=None,
33953492

33963493
matches = (new_other == np.array(other))
33973494
if matches is False or not matches.all():
3398-
3495+
33993496
# coerce other to a common dtype if we can
34003497
if com.needs_i8_conversion(self.dtype):
34013498
try:

pandas/tests/test_common.py

+20
Original file line numberDiff line numberDiff line change
@@ -524,6 +524,26 @@ def test_is_recompilable():
524524
for f in fails:
525525
assert not com.is_re_compilable(f)
526526

527+
def test_random_state():
528+
import numpy.random as npr
529+
# Check with seed
530+
state = com._random_state(5)
531+
assert_equal(state.uniform(), npr.RandomState(5).uniform())
532+
533+
# Check with random state object
534+
state2 = npr.RandomState(10)
535+
assert_equal(com._random_state(state2).uniform(), npr.RandomState(10).uniform())
536+
537+
# check with no arg random state
538+
assert isinstance(com._random_state(), npr.RandomState)
539+
540+
# Error for floats or strings
541+
with tm.assertRaises(ValueError):
542+
com._random_state('test')
543+
544+
with tm.assertRaises(ValueError):
545+
com._random_state(5.5)
546+
527547

528548
class TestTake(tm.TestCase):
529549
# standard incompatible fill error

0 commit comments

Comments
 (0)