Skip to content

Commit 8b506a3

Browse files
nickeubankNick Eubank
authored and
Nick Eubank
committed
Add sample function with tests and docs
1 parent 76571d0 commit 8b506a3

File tree

6 files changed

+410
-0
lines changed

6 files changed

+410
-0
lines changed

doc/source/api.rst

+3
Original file line numberDiff line numberDiff line change
@@ -390,6 +390,7 @@ Reindexing / Selection / Label manipulation
390390
Series.reindex_like
391391
Series.rename
392392
Series.reset_index
393+
Series.sample
393394
Series.select
394395
Series.take
395396
Series.tail
@@ -713,6 +714,7 @@ Indexing, iteration
713714
DataFrame.where
714715
DataFrame.mask
715716
DataFrame.query
717+
DataFrame.sample
716718

717719
For more information on ``.at``, ``.iat``, ``.ix``, ``.loc``, and
718720
``.iloc``, see the :ref:`indexing documentation <indexing>`.
@@ -823,6 +825,7 @@ Reindexing / Selection / Label manipulation
823825
DataFrame.reindex_like
824826
DataFrame.rename
825827
DataFrame.reset_index
828+
DataFrame.sample
826829
DataFrame.select
827830
DataFrame.set_index
828831
DataFrame.tail

doc/source/indexing.rst

+75
Original file line numberDiff line numberDiff line change
@@ -508,6 +508,81 @@ A list of indexers where any element is out of bounds will raise an
508508
509509
.. _indexing.basics.partial_setting:
510510

511+
Selecting Random Samples
512+
------------------------
513+
.. versionadded::0.16.1
514+
515+
A random selection of rows or columns from a Series, DataFrame, or Panel with the ``.sample()`` method. The method will sample rows by default, and accepts a specific number of rows/columns to return, or a fraction of rows.
516+
517+
.. ipython :: python
518+
519+
s = Series([0,1,2,3,4,5])
520+
521+
# When no arguments are passed, returns 1 row.
522+
s.sample()
523+
524+
# One may specify either a number of rows:
525+
s.sample(n = 3)
526+
527+
# Or a fraction of the rows:
528+
s.sample(frac = 0.5)
529+
530+
By default, ``sample`` will return each row at most once, but one can also sample with replacement
531+
using the ``replace`` option:
532+
533+
.. ipython :: python
534+
535+
s = Series([0,1,2,3,4,5])
536+
537+
# Without replacement (default):
538+
s.sample(n = 6, replace = False)
539+
540+
# With replacement:
541+
s.sample(n = 6, replace = True)
542+
543+
544+
By default, each row has an equal probability of being selected, but if you want rows
545+
to have different probabilities, you can pass the ``sample`` function sampling weights as
546+
``weights``. These weights can be a list, a numpy array, or a Series, but they must be of the same length as the object you are sampling. Missing values will be treated as a weight of zero, and inf values are not allowed. If weights do not sum to 1, they will be re-normalized by dividing all weights by the sum of the weights. For example:
547+
548+
.. ipython :: python
549+
550+
s = Series([0,1,2,3,4,5])
551+
example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4]
552+
s.sample(n=3, weights = example_weights)
553+
554+
# Weights will be re-normalized automatically
555+
example_weights2 = [0.5, 0, 0, 0, 0, 0]
556+
s.sample(n=1, weights= example_weights2)
557+
558+
When applied to a DataFrame, you can use a column of the DataFrame as sampling weights
559+
(provided you are sampling rows and not columns) by simply passing the name of the column
560+
as a string.
561+
562+
.. ipython :: python
563+
564+
df2 = DataFrame({'col1':[9,8,7,6], 'weight_column':[0.5, 0.4, 0.1, 0]})
565+
df2.sample(n = 3, weights = 'weight_column')
566+
567+
``sample`` also allows users to sample columns instead of rows using the ``axis`` argument.
568+
569+
.. ipython :: python
570+
571+
df3 = DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
572+
df3.sample(n=1, axis = 1)
573+
574+
Finally, one can also set a seed for ``sample``'s random number generator using the ``random_state`` argument, which will accept either an integer (as a seed) or a numpy RandomState object.
575+
576+
.. ipython :: python
577+
578+
df4 = DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
579+
580+
# With a given seed, the sample will always draw the same rows.
581+
df4.sample(n=2, random_state = 2)
582+
df4.sample(n=2, random_state = 2)
583+
584+
585+
511586
Setting With Enlargement
512587
------------------------
513588

doc/source/whatsnew/v0.16.1.txt

+40
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ Highlights include:
2020

2121
Enhancements
2222
~~~~~~~~~~~~
23+
.. _whatsnew_0161.enhancements.sample:
2324

2425
- Added ``StringMethods.capitalize()`` and ``swapcase`` which behave as the same as standard ``str`` (:issue:`9766`)
2526
- Added ``StringMethods`` (.str accessor) to ``Index`` (:issue:`9068`)
@@ -135,6 +136,45 @@ values NOT in the categories, similarly to how you can reindex ANY pandas index.
135136

136137
See the :ref:`documentation <advanced.categoricalindex>` for more. (:issue:`7629`)
137138

139+
Sample
140+
^^^^^^^^^^^^^^^^
141+
142+
Series, DataFrames, and Panels now have a new method: :meth:`~pandas.core.sample`.
143+
The method accepts a specific number of rows or columns to return, or a fraction of the
144+
total number or rows or columns. It also has options for sampling with or without replacement,
145+
for passing in a column for weights for non-uniform sampling, and for setting seed values to facilitate replication.
146+
147+
.. ipython :: python
148+
149+
example_series = Series([0,1,2,3,4,5])
150+
151+
# When no arguments are passed, returns 5 rows like .head() or .tail()
152+
example_series.sample()
153+
154+
# One may specify either a number of rows:
155+
example_series.sample(n = 3)
156+
157+
# Or a fraction of the rows:
158+
example_series.sample(frac = 0.5)
159+
160+
# weights are accepted.
161+
example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4]
162+
example_series.sample(n=3, weights = example_weights)
163+
164+
# weights will also be normalized if they do not sum to one,
165+
# and missing values will be treated as zeros.
166+
example_weights2 = [0.5, 0, 0, 0, None, np.nan]
167+
example_series.sample(n=1, weights = example_weights2)
168+
169+
170+
When applied to a DataFrame, one may pass the name of a column to specify sampling weights,
171+
although note that the value of the weights column must sum to one.
172+
173+
.. ipython :: python
174+
175+
df = DataFrame({'col1':[9,8,7,6], 'weight_column':[0.5, 0.4, 0.1, 0]})
176+
df.sample(n = 3, weights = 'weight_column')
177+
138178
.. _whatsnew_0161.api:
139179

140180
API changes

pandas/core/common.py

+11
Original file line numberDiff line numberDiff line change
@@ -3319,3 +3319,14 @@ def _maybe_match_name(a, b):
33193319
if a_name == b_name:
33203320
return a_name
33213321
return None
3322+
3323+
def _random_state(state):
3324+
if isinstance(state, int):
3325+
return np.random.RandomState(state)
3326+
elif isinstance(state, np.random.RandomState):
3327+
return state
3328+
elif state is None:
3329+
return np.random.RandomState()
3330+
else:
3331+
raise ValueError("random_state must be either an integer or numpy RandomState")
3332+

pandas/core/generic.py

+112
Original file line numberDiff line numberDiff line change
@@ -1948,6 +1948,118 @@ def tail(self, n=5):
19481948
return self
19491949
return self.iloc[-n:]
19501950

1951+
1952+
def sample(self, n=None, frac=None, replace=False, weights=None, random_state=None, axis = 0):
1953+
"""
1954+
Returns a random sample of rows from object.
1955+
1956+
Parameters
1957+
----------
1958+
n : int, optional
1959+
Number of rows to return. Cannot be used with `frac`.
1960+
Default = 1 if `frac` = None.
1961+
frac : float, optional
1962+
Share of rows to return. Cannot be used with `n`.
1963+
replace : boolean, optional
1964+
Sample with or without replacement. Default = False.
1965+
weights : str or ndarray-like, optional
1966+
Default 'None' results in equal probability weighting.
1967+
If called on a DataFrame or Panel, will also accept the name of a
1968+
column as a string. Must be same length as index.
1969+
If weights do not sum to 1, they will be normalized to sum to 1.
1970+
Missing values in the weights column will be treated as zero.
1971+
inf and -inf values not allowed.
1972+
random_state : int or numpy.random.RandomState, optional
1973+
Seed for the random number generator (if int), or numpy RandomState
1974+
object.
1975+
axis : int or string, optional
1976+
Axis to sample. Accepts axis number or name. Default = 0.
1977+
1978+
Returns
1979+
-------
1980+
Same type as caller.
1981+
"""
1982+
1983+
###
1984+
# Processing axis argument
1985+
###
1986+
1987+
# Check validity of axis argument.
1988+
axis = self._get_axis_number(axis)
1989+
1990+
# Store length of relevant axis of object.
1991+
axis_length = self.shape[axis]
1992+
1993+
###
1994+
# Clean / process random_state argument
1995+
###
1996+
1997+
rs = com._random_state(random_state)
1998+
1999+
###
2000+
# Process weight argument
2001+
###
2002+
2003+
# Check weights for compliance
2004+
if weights is not None:
2005+
2006+
# Strings acceptable if not a series
2007+
if isinstance(weights, string_types):
2008+
2009+
if self.ndim > 1 :
2010+
try:
2011+
weights = self[weights]
2012+
except KeyError:
2013+
raise KeyError("String passed to weights not a valid column name")
2014+
2015+
else:
2016+
raise ValueError("Strings cannot be passed as weights when sampling from a Series.")
2017+
2018+
#normalize format of weights to ndarray.
2019+
weights = pd.Series(weights, dtype = 'float64')
2020+
2021+
# Check length (numpy does this, but has confusing errors with different argument labels.)
2022+
if len(weights) != axis_length:
2023+
raise ValueError("Weights and axis to be sampled must be of same length")
2024+
2025+
# No infs allowed. The np.nan_to_num() command below would make these large values
2026+
# which is pretty unintuitive.
2027+
if (weights == np.inf).any() or (weights == -np.inf).any():
2028+
raise ValueError("weight vector may not include `inf` values")
2029+
2030+
if (weights < 0).any():
2031+
raise ValueError("weight vector many not include negative values")
2032+
2033+
# If has nan, set to zero. Already know there are no infs.
2034+
weights = weights.fillna(0)
2035+
2036+
2037+
# Check that weights sum to 1. If not, renormalize.
2038+
if weights.sum() != 1:
2039+
weights = weights / weights.sum()
2040+
2041+
###
2042+
# Process n and frac arguments
2043+
###
2044+
2045+
# Check whether frac or N is passed. If neither, default to N=1.
2046+
if n is None and frac is None:
2047+
n = 1
2048+
elif n is not None and frac is None and n % 1 != 0:
2049+
raise ValueError("Only integers accepted as `n` values")
2050+
elif n is None and frac is not None:
2051+
n = int(round(frac * axis_length))
2052+
elif n is not None and frac is not None:
2053+
raise ValueError('Please enter a value for `frac` OR `n`, not both')
2054+
2055+
# Check for negative sizes
2056+
if n < 0:
2057+
raise ValueError("A negative number of rows requested. Please provide positive value.")
2058+
2059+
locs = rs.choice(axis_length, size=n, replace=replace, p=weights)
2060+
return self.take(locs, axis=axis)
2061+
2062+
19512063
#----------------------------------------------------------------------
19522064
# Attribute access
19532065

0 commit comments

Comments
 (0)