Skip to content

Commit ff4f442

Browse files
author
Nick Eubank
committed
fixed small issues jorvisvandenbossche noted
1 parent 4389271 commit ff4f442

File tree

4 files changed

+69
-68
lines changed

4 files changed

+69
-68
lines changed

doc/source/indexing.rst

+28-28
Original file line numberDiff line numberDiff line change
@@ -516,29 +516,29 @@ A random selection of rows or columns from a Series, DataFrame, or Panel with th
516516

517517
.. ipython :: python
518518
519-
s = Series([0,1,2,3,4,5])
519+
s = Series([0,1,2,3,4,5])
520520
521-
# When no arguments are passed, returns 1 row.
522-
s.sample()
523-
524-
# One may specify either a number of rows:
525-
s.sample(n = 3)
521+
# When no arguments are passed, returns 1 row.
522+
s.sample()
523+
524+
# One may specify either a number of rows:
525+
s.sample(n=3)
526526
527-
# Or a fraction of the rows:
528-
s.sample(frac = 0.5)
527+
# Or a fraction of the rows:
528+
s.sample(frac=0.5)
529529
530530
By default, ``sample`` will return each row at most once, but one can also sample with replacement
531531
using the ``replace`` option:
532532

533533
.. ipython :: python
534534
535535
s = Series([0,1,2,3,4,5])
536-
537-
# Without replacement (default):
538-
s.sample(n = 6, replace = False)
539-
540-
# With replacement:
541-
s.sample(n = 6, replace = True)
536+
537+
# Without replacement (default):
538+
s.sample(n=6, replace=False)
539+
540+
# With replacement:
541+
s.sample(n=6, replace=True)
542542
543543
544544
By default, each row has an equal probability of being selected, but if you want rows
@@ -549,37 +549,37 @@ to have different probabilities, you can pass the ``sample`` function sampling w
549549
550550
s = Series([0,1,2,3,4,5])
551551
example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4]
552-
s.sample(n=3, weights = example_weights)
553-
554-
# Weights will be re-normalized automatically
555-
example_weights2 = [0.5, 0, 0, 0, 0, 0]
556-
s.sample(n=1, weights= example_weights2)
552+
s.sample(n=3, weights=example_weights)
553+
554+
# Weights will be re-normalized automatically
555+
example_weights2 = [0.5, 0, 0, 0, 0, 0]
556+
s.sample(n=1, weights=example_weights2)
557557
558558
When applied to a DataFrame, you can use a column of the DataFrame as sampling weights
559559
(provided you are sampling rows and not columns) by simply passing the name of the column
560560
as a string.
561-
561+
562562
.. ipython :: python
563563
564-
df2 = DataFrame({'col1':[9,8,7,6], 'weight_column':[0.5, 0.4, 0.1, 0]})
565-
df2.sample(n = 3, weights = 'weight_column')
564+
df2 = DataFrame({'col1':[9,8,7,6], 'weight_column':[0.5, 0.4, 0.1, 0]})
565+
df2.sample(n = 3, weights = 'weight_column')
566566
567567
``sample`` also allows users to sample columns instead of rows using the ``axis`` argument.
568568

569569
.. ipython :: python
570570
571-
df3 = DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
572-
df3.sample(n=1, axis = 1)
571+
df3 = DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
572+
df3.sample(n=1, axis=1)
573573
574574
Finally, one can also set a seed for ``sample``'s random number generator using the ``random_state`` argument, which will accept either an integer (as a seed) or a numpy RandomState object.
575575

576576
.. ipython :: python
577577
578-
df4 = DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
578+
df4 = DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
579579
580-
# With a given seed, the sample will always draw the same rows.
581-
df4.sample(n=2, random_state = 2)
582-
df4.sample(n=2, random_state = 2)
580+
# With a given seed, the sample will always draw the same rows.
581+
df4.sample(n=2, random_state=2)
582+
df4.sample(n=2, random_state=2)
583583
584584
585585

doc/source/whatsnew/v0.16.1.txt

+11-10
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,6 @@ Highlights include:
2020

2121
Enhancements
2222
~~~~~~~~~~~~
23-
.. _whatsnew_0161.enhancements.sample:
2423

2524
- Added ``StringMethods.capitalize()`` and ``swapcase`` which behave as the same as standard ``str`` (:issue:`9766`)
2625
- Added ``StringMethods`` (.str accessor) to ``Index`` (:issue:`9068`)
@@ -136,10 +135,12 @@ values NOT in the categories, similarly to how you can reindex ANY pandas index.
136135

137136
See the :ref:`documentation <advanced.categoricalindex>` for more. (:issue:`7629`)
138137

138+
.. _whatsnew_0161.enhancements.sample:
139+
139140
Sample
140141
^^^^^^^^^^^^^^^^
141142

142-
Series, DataFrames, and Panels now have a new method: :meth:`~pandas.core.sample`.
143+
Series, DataFrames, and Panels now have a new method: :meth:`~pandas.DataFrame.sample`.
143144
The method accepts a specific number of rows or columns to return, or a fraction of the
144145
total number or rows or columns. It also has options for sampling with or without replacement,
145146
for passing in a column for weights for non-uniform sampling, and for setting seed values to facilitate replication.
@@ -148,32 +149,32 @@ for passing in a column for weights for non-uniform sampling, and for setting se
148149

149150
example_series = Series([0,1,2,3,4,5])
150151

151-
# When no arguments are passed, returns 5 rows like .head() or .tail()
152+
# When no arguments are passed, returns 1
152153
example_series.sample()
153154

154155
# One may specify either a number of rows:
155-
example_series.sample(n = 3)
156+
example_series.sample(n=3)
156157

157158
# Or a fraction of the rows:
158-
example_series.sample(frac = 0.5)
159+
example_series.sample(frac=0.5)
159160

160161
# weights are accepted.
161162
example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4]
162-
example_series.sample(n=3, weights = example_weights)
163+
example_series.sample(n=3, weights=example_weights)
163164

164165
# weights will also be normalized if they do not sum to one,
165166
# and missing values will be treated as zeros.
166167
example_weights2 = [0.5, 0, 0, 0, None, np.nan]
167-
example_series.sample(n=1, weights = example_weights2)
168+
example_series.sample(n=1, weights=example_weights2)
168169

169170

170-
When applied to a DataFrame, one may pass the name of a column to specify sampling weights,
171-
although note that the value of the weights column must sum to one.
171+
When applied to a DataFrame, one may pass the name of a column to specify sampling weights
172+
when sampling from rows.
172173

173174
.. ipython :: python
174175

175176
df = DataFrame({'col1':[9,8,7,6], 'weight_column':[0.5, 0.4, 0.1, 0]})
176-
df.sample(n = 3, weights = 'weight_column')
177+
df.sample(n=3, weights='weight_column')
177178

178179
.. _whatsnew_0161.api:
179180

pandas/core/generic.py

+15-14
Original file line numberDiff line numberDiff line change
@@ -1959,13 +1959,14 @@ def sample(self, n=None, frac=None, replace=False, weights=None, random_state=No
19591959
Number of rows to return. Cannot be used with `frac`.
19601960
Default = 1 if `frac` = None.
19611961
frac : float, optional
1962-
Share of rows to return. Cannot be used with `n`.
1962+
Fraction of rows to return. Cannot be used with `n`.
19631963
replace : boolean, optional
19641964
Sample with or without replacement. Default = False.
19651965
weights : str or ndarray-like, optional
19661966
Default 'None' results in equal probability weighting.
1967-
If called on a DataFrame or Panel, will also accept the name of a
1968-
column as a string. Must be same length as index.
1967+
If called on a DataFrame, will accept the name of a column
1968+
when axis = 0.
1969+
Weights must be same length as axis being sampled.
19691970
If weights do not sum to 1, they will be normalized to sum to 1.
19701971
Missing values in the weights column will be treated as zero.
19711972
inf and -inf values not allowed.
@@ -2003,17 +2004,18 @@ def sample(self, n=None, frac=None, replace=False, weights=None, random_state=No
20032004
# Check weights for compliance
20042005
if weights is not None:
20052006

2006-
# Strings acceptable if not a series
2007+
# Strings acceptable if a dataframe and axis = 0
20072008
if isinstance(weights, string_types):
2008-
2009-
if self.ndim > 1 :
2010-
try:
2011-
weights = self[weights]
2012-
except KeyError:
2013-
raise KeyError("String passed to weights not a valid column name")
2014-
2009+
if isinstance(self, pd.DataFrame):
2010+
if axis == 0:
2011+
try:
2012+
weights = self[weights]
2013+
except KeyError:
2014+
raise KeyError("String passed to weights not a valid column")
2015+
else:
2016+
raise ValueError("Strings can only be passed to weights when sampling from rows on a DataFrame")
20152017
else:
2016-
raise ValueError("Strings cannot be passed as weights when sampling from a Series.")
2018+
raise ValueError("Strings cannot be passed as weights when sampling from a Series or Panel.")
20172019

20182020
#normalize format of weights to ndarray.
20192021
weights = pd.Series(weights, dtype = 'float64')
@@ -2022,8 +2024,7 @@ def sample(self, n=None, frac=None, replace=False, weights=None, random_state=No
20222024
if len(weights) != axis_length:
20232025
raise ValueError("Weights and axis to be sampled must be of same length")
20242026

2025-
# No infs allowed. The np.nan_to_num() command below would make these large values
2026-
# which is pretty unintuitive.
2027+
# No infs allowed.
20272028
if (weights == np.inf).any() or (weights == -np.inf).any():
20282029
raise ValueError("weight vector may not include `inf` values")
20292030

pandas/tests/test_generic.py

+15-16
Original file line numberDiff line numberDiff line change
@@ -431,10 +431,6 @@ def test_sample(self):
431431
weights_with_ninf[0] = -np.inf
432432
o.sample(n=3, weights=weights_with_ninf)
433433

434-
# Ensure proper error if string given as weight for Series
435-
s = Series(range(10))
436-
with tm.assertRaises(ValueError):
437-
s.sample(n=3, weights='weight_column')
438434

439435
# A few dataframe test with degenerate weights.
440436
easy_weight_list = [0]*10
@@ -447,29 +443,32 @@ def test_sample(self):
447443
sample1 = df.sample(n=1, weights='easyweights')
448444
assert_frame_equal(sample1, df.iloc[5:6])
449445

446+
# Ensure proper error if string given as weight for Series, panel, or
447+
# DataFrame with axis = 1.
448+
s = Series(range(10))
449+
with tm.assertRaises(ValueError):
450+
s.sample(n=3, weights='weight_column')
451+
452+
panel = pd.Panel(items = [0,1,2], major_axis = [2,3,4], minor_axis = [3,4,5])
453+
with tm.assertRaises(ValueError):
454+
panel.sample(n=1, weights='weight_column')
455+
456+
with tm.assertRaises(ValueError):
457+
df.sample(n=1, weights='weight_column', axis = 1)
458+
450459
# Check weighting key error
451460
with tm.assertRaises(KeyError):
452461
df.sample(n=3, weights='not_a_real_column_name')
453462

454463
# Check np.nan are replaced by zeros.
455464
weights_with_nan = [np.nan]*10
456465
weights_with_nan[5] = 0.5
457-
458-
sampled_df = df.sample(n=1, weights = weights_with_nan)
459-
tm.assert_frame_equal(sampled_df, df.iloc[5:6])
460-
461-
sampled_s = s.sample(n=1, weights = weights_with_nan)
462-
tm.assert_series_equal(sampled_s, s.iloc[5:6])
466+
self._compare(o.sample(n=1, weights=weights_with_nan), o.iloc[5:6])
463467

464468
# Check None are also replaced by zeros.
465469
weights_with_None = [None]*10
466470
weights_with_None[5] = 0.5
467-
468-
sampled_df2 = df.sample(n=1, weights = weights_with_None)
469-
tm.assert_frame_equal(sampled_df2, df.iloc[5:6])
470-
471-
sampled_s2 = s.sample(n=1, weights = weights_with_None)
472-
tm.assert_series_equal(sampled_s2, s.iloc[5:6])
471+
self._compare(o.sample(n=1, weights=weights_with_None), o.iloc[5:6])
473472

474473
# Check that re-normalizes weights that don't sum to one.
475474
weights_less_than_1 = [0]*10

0 commit comments

Comments
 (0)