New function to sample from frames (Issue #2419 ) #9666

nickeubank · 2015-03-16T21:05:39Z

closes #2419
Creates a function .sample() in generic.py to sample from pandas objects. Returns either a fixed number of rows or a share. Also includes a number of tests in test_sample() -- I am open to added suggestions.

Input from users with panel experience appreciated!

nickeubank · 2015-03-16T21:10:34Z

Please excuse the many (meaningless) commits -- was learning to use git from commandline.

shoyer · 2015-03-16T21:14:08Z

pandas/core/generic.py

+
+        Parameters
+        ----------
+            n: Number of rows to return. Cannot be used with frac.


please take another look at the numpy docstring standard:
https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt

In particular, you are indenting too much here and mixing up the lines for types/descriptions.

shoyer · 2015-03-16T21:28:56Z

This will also need documentation updates:

An addition to "what's new" for v0.16.1 (not sure if this file has been created yet; best to wait until 0.16 is released)
New documentation in an appropriate section of the docs
Addition to the API docs (see api.rst).

nickeubank · 2015-03-16T21:54:55Z

Thanks @shoyer. Adding changes and and testing.

shoyer · 2015-03-16T22:58:42Z

pandas/core/generic.py

+
+
+        # Check whether frac or N
+        if n == None and frac == None:


again, you want to be using is None here instead of == None

ah right, sorry! Updating and testing before pushing.

nickeubank · 2015-03-17T22:35:34Z

Doc additions soon -- wanted to pin down the function first, but may need to do some "real job" things before finishing.

shoyer · 2015-03-17T22:38:49Z

pandas/core/generic.py

+
+        Parameters
+        ----------
+        n: in, optional, Default = 5 if frac = None. 


this should be int not in.

Also, the numpy docstring convention is usually not to list default values after the type. Rather, they should be described in the description part below.

jorisvandenbossche · 2015-04-25T06:52:26Z

pandas/core/generic.py

+
+                if self.ndim > 1 :
+                    try:
+                        weights = self[weights]


this should access the 'weights' column in the correct axis. Eg if you sample a dataframe it's columns (axis=1), should it then be possible to give a row index name?
In any case, now this can trigger the "Weights and axis to be sampled must be of same length" error when the dataframe is not square (and would give faulty results if it would be square)

jorisvandenbossche · 2015-04-25T07:01:43Z

Some other comments / questions:

When passing a series as weights, should the values of the series be aligned on the index? (this is not the case at the moment)
Should the default n=1 case for DataFrames return a 1-row DataFrame, or that row as a Series (and for sampling a Series -> return 1-element Series or a scalar)? As a comparison np.random.choice reduces the dimensionality for n=1, but in the case of pandas you then loose the info of the index. So maybe keep it as is (DataFrame -> 1-row DataFrame)?
I know nothing about Panels (never used it), but should the default axis be 0? (for DataFrame's the info axis is 1, so by default samples from axis 0. But for panels the info axis is 0). Are there similar functions to look at to do this consistently?
The weights don't work for Panels at the moment (and I also don't see how it could work?). But if this is by design, it should get an appropriate error message, testing and docs (the docstring now says: "If called on a DataFrame or Panel, will also accept the name of a column as a string" but this does not work)

nickeubank · 2015-04-25T22:53:20Z

Thanks @jorvisvandenbossche!

"should one row of df be returned as df or series?" I really dislike the idea that different arguments (n=1 versus n=2) return different types of objects, so I strongly prefer it returning a one row DataFrame if that's the item being sampled.
I think it makes sense to just allow for strings to be passed as weights for DataFrames when sampling from rows. I starting writing it in a more general form to allow for rows as weights, but it gets very hairy fast, especially because rows may be of mixed types. Since sampling from rows is likely the most common use case, and one can just explicitly pass a row as a weight if needed, I think this is sufficient. Changing documentation accordingly, and tests added for appropriate errors added for panels.
I also know nothing about panels -- I was assuming that axis 0 and 1 are same as DataFrame and axis=2 is just time, but sounds like that's incorrect? Input by someone who uses panels would be appreciated.
the weights = weights.values line does negate much of the value of using Series, but I've found that np.choice() is not consistent in its ability to work with Series as weights. When I tried passing a Series as weights, I found that it worked on my computer, but 3 of the 5 Travis builds kept failing. The weights = weights.values is the only way I could fix the problem.
- should the values of series be aligned: given np.choice won't take a Series, how would I go about this? If you can suggest a way to do this, I'd be happy to do so.

Small things fixed:

tabs were causing issues with format. All spaces now.
PEP-8 in function signatures fixed. Sorry about that.

sinhrks · 2015-04-26T01:31:04Z

doc/source/api.rst

@@ -713,6 +714,7 @@ Indexing, iteration
   DataFrame.where
   DataFrame.mask
   DataFrame.query
+   DataFrame.sample


I understand the duplication is intentional, but only adding to "... Selection ... " section may be better? I understand the function doesn't do indexing or iteration.

And pls add Panel.sample.

jorisvandenbossche · 2015-04-27T21:40:41Z

doc/source/indexing.rst

+------------------------
+.. versionadded::0.16.1
+
+A random selection of rows or columns from a Series, DataFrame, or Panel with the ``.sample()`` method. The method will sample rows by default, and accepts a specific number of rows/columns to return, or a fraction of rows. 


can you replace .sample() with :meth:~DataFrame.sample (of `:meth:`Series.sample), the result will be almost the same, but also be a link to the api docstring page

jorisvandenbossche · 2015-04-27T21:48:38Z

@jreback could you shed some light on the panel issues?

@nickeubank

agree on the n=1 and n>1 giving the same output type
OK for me on limiting specifying weights as a string for now to DataFrame columns
on the weights = weights.values thing, it just seems to me that if you want to end up with an array, the initial weights = pd.Series(weights, dtype = 'float64') is not needed? EDIT ah, I see you use fillna, for which you need it as a series

jreback · 2015-04-28T10:37:44Z

doc/source/whatsnew/v0.16.1.txt

+
+Sample
+^^^^^^^^^^^^^^^^
+


needs to be exactly length of the text

jreback · 2015-04-28T10:49:58Z

some minor code comments.

if an axis is not provided, it is usual to use axis = self._stat_axis, this will yield:
Series->0, DataFrame->0, Panel->1. This type of code is used all over generic.py

pls squash.
otherwise lgtm. once you have made changes, ping when green.

nickeubank · 2015-04-30T03:27:38Z

@jreback : I think we're good to go.

Added axis = self._stat_axis and associated tests; added is_integer() to _random_state(); thinned comments; fixed format in "what's new"; updated docs on only supporting column names for dataframes.

@jreback @shoyer @jorisvandenbossche @TomAugspurger @sinhrks : thank you all for your input on this -- it's been a great learning experience, and hopefully a useful new feature!

shoyer · 2015-04-30T05:04:57Z

pandas/core/generic.py

+                    raise ValueError("Strings cannot be passed as weights when sampling from a Series or Panel.")
+
+            #normalize format of weights to Series. 
+            weights = pd.Series(weights, dtype = 'float64')


PEP8 would be weights = pd.Series(weights, dtype='float64')

jorisvandenbossche · 2015-05-01T10:47:51Z

@nickeubank Looks good to me!
Certainly a very usefull feature. We have been very 'commenting' here (but I think that is good, to deliver good quality, which it is!)

sinhrks · 2015-05-01T11:38:28Z

Thanks for the change, great job :)

jreback · 2015-05-01T12:04:53Z

merged via 8f0f417

@nickeubank thanks for effort! and responding to comments. all of this just makes pandas better!

shoyer · 2015-05-01T17:39:18Z

woohoo, thanks!

nickeubank changed the title ~~New function to sample from frames~~ New function to sample from frames (Issue 2419 ) Mar 16, 2015

nickeubank changed the title ~~New function to sample from frames (Issue 2419 )~~ New function to sample from frames (Issue #2419 ) Mar 16, 2015

nickeubank mentioned this pull request Mar 16, 2015

Series/DataFrame sample method with/without replacement #2419

Closed

shoyer reviewed Mar 16, 2015
View reviewed changes

jreback added Enhancement Stats labels Mar 17, 2015

jreback added this to the 0.16.1 milestone Mar 17, 2015

shoyer reviewed Mar 17, 2015
View reviewed changes

jorisvandenbossche reviewed Apr 25, 2015
View reviewed changes

nickeubank force-pushed the sample_func branch from 2a9c91d to ff4f442 Compare April 25, 2015 22:57

sinhrks reviewed Apr 26, 2015
View reviewed changes

nickeubank force-pushed the sample_func branch 2 times, most recently from dcaa84b to 87ac130 Compare April 27, 2015 03:54

jorisvandenbossche reviewed Apr 27, 2015
View reviewed changes

jreback reviewed Apr 28, 2015
View reviewed changes

doc/source/whatsnew/v0.16.1.txt

Sample

^^^^^^^^^^^^^^^^

Copy link

Contributor

jreback Apr 28, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be exactly length of the text

nickeubank force-pushed the sample_func branch 2 times, most recently from f49489d to 9689dd1 Compare April 29, 2015 18:09

shoyer reviewed Apr 30, 2015
View reviewed changes

Add sample function with tests and docs

b08c1e0

nickeubank force-pushed the sample_func branch from 9689dd1 to b08c1e0 Compare April 30, 2015 15:40

jorisvandenbossche closed this May 1, 2015

shoyer mentioned this pull request May 3, 2015

weighted mean #10030

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New function to sample from frames (Issue #2419 ) #9666

New function to sample from frames (Issue #2419 ) #9666

nickeubank commented Mar 16, 2015

nickeubank commented Mar 16, 2015

shoyer Mar 16, 2015

shoyer commented Mar 16, 2015

nickeubank commented Mar 16, 2015

shoyer Mar 16, 2015

nickeubank Mar 16, 2015

nickeubank commented Mar 17, 2015

shoyer Mar 17, 2015

jorisvandenbossche Apr 25, 2015

jorisvandenbossche commented Apr 25, 2015

nickeubank commented Apr 25, 2015

sinhrks Apr 26, 2015

jorisvandenbossche Apr 27, 2015

jorisvandenbossche commented Apr 27, 2015

jreback Apr 28, 2015

jreback commented Apr 28, 2015

nickeubank commented Apr 30, 2015

shoyer Apr 30, 2015

jorisvandenbossche commented May 1, 2015

sinhrks commented May 1, 2015

jreback commented May 1, 2015

shoyer commented May 1, 2015

New function to sample from frames (Issue #2419 ) #9666

New function to sample from frames (Issue #2419 ) #9666

Conversation

nickeubank commented Mar 16, 2015

nickeubank commented Mar 16, 2015

Choose a reason for hiding this comment

shoyer commented Mar 16, 2015

nickeubank commented Mar 16, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nickeubank commented Mar 17, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Apr 25, 2015

nickeubank commented Apr 25, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Apr 27, 2015

Choose a reason for hiding this comment

jreback commented Apr 28, 2015

nickeubank commented Apr 30, 2015

Choose a reason for hiding this comment

jorisvandenbossche commented May 1, 2015

sinhrks commented May 1, 2015

jreback commented May 1, 2015

shoyer commented May 1, 2015