Skip to content

New function to sample from frames (Issue #2419 ) #9666

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

nickeubank
Copy link
Contributor

closes #2419
Creates a function .sample() in generic.py to sample from pandas objects. Returns either a fixed number of rows or a share. Also includes a number of tests in test_sample() -- I am open to added suggestions.

Input from users with panel experience appreciated!

@nickeubank nickeubank changed the title New function to sample from frames New function to sample from frames (Issue 2419 ) Mar 16, 2015
@nickeubank nickeubank changed the title New function to sample from frames (Issue 2419 ) New function to sample from frames (Issue #2419 ) Mar 16, 2015
@nickeubank
Copy link
Contributor Author

Please excuse the many (meaningless) commits -- was learning to use git from commandline.


Parameters
----------
n: Number of rows to return. Cannot be used with frac.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please take another look at the numpy docstring standard:
https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt

In particular, you are indenting too much here and mixing up the lines for types/descriptions.

@shoyer
Copy link
Member

shoyer commented Mar 16, 2015

This will also need documentation updates:

  • An addition to "what's new" for v0.16.1 (not sure if this file has been created yet; best to wait until 0.16 is released)
  • New documentation in an appropriate section of the docs
  • Addition to the API docs (see api.rst).

@nickeubank
Copy link
Contributor Author

Thanks @shoyer. Adding changes and and testing.



# Check whether frac or N
if n == None and frac == None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, you want to be using is None here instead of == None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah right, sorry! Updating and testing before pushing.

@jreback jreback added this to the 0.16.1 milestone Mar 17, 2015
@nickeubank
Copy link
Contributor Author

Doc additions soon -- wanted to pin down the function first, but may need to do some "real job" things before finishing.


Parameters
----------
n: in, optional, Default = 5 if frac = None.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be int not in.

Also, the numpy docstring convention is usually not to list default values after the type. Rather, they should be described in the description part below.


if self.ndim > 1 :
try:
weights = self[weights]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should access the 'weights' column in the correct axis. Eg if you sample a dataframe it's columns (axis=1), should it then be possible to give a row index name?
In any case, now this can trigger the "Weights and axis to be sampled must be of same length" error when the dataframe is not square (and would give faulty results if it would be square)

@jorisvandenbossche
Copy link
Member

Some other comments / questions:

  • When passing a series as weights, should the values of the series be aligned on the index? (this is not the case at the moment)
  • Should the default n=1 case for DataFrames return a 1-row DataFrame, or that row as a Series (and for sampling a Series -> return 1-element Series or a scalar)? As a comparison np.random.choice reduces the dimensionality for n=1, but in the case of pandas you then loose the info of the index. So maybe keep it as is (DataFrame -> 1-row DataFrame)?
  • I know nothing about Panels (never used it), but should the default axis be 0? (for DataFrame's the info axis is 1, so by default samples from axis 0. But for panels the info axis is 0). Are there similar functions to look at to do this consistently?
  • The weights don't work for Panels at the moment (and I also don't see how it could work?). But if this is by design, it should get an appropriate error message, testing and docs (the docstring now says: "If called on a DataFrame or Panel, will also accept the name of a column as a string" but this does not work)

@nickeubank
Copy link
Contributor Author

Thanks @jorvisvandenbossche!

  • "should one row of df be returned as df or series?" I really dislike the idea that different arguments (n=1 versus n=2) return different types of objects, so I strongly prefer it returning a one row DataFrame if that's the item being sampled.
  • I think it makes sense to just allow for strings to be passed as weights for DataFrames when sampling from rows. I starting writing it in a more general form to allow for rows as weights, but it gets very hairy fast, especially because rows may be of mixed types. Since sampling from rows is likely the most common use case, and one can just explicitly pass a row as a weight if needed, I think this is sufficient. Changing documentation accordingly, and tests added for appropriate errors added for panels.
  • I also know nothing about panels -- I was assuming that axis 0 and 1 are same as DataFrame and axis=2 is just time, but sounds like that's incorrect? Input by someone who uses panels would be appreciated.
  • the weights = weights.values line does negate much of the value of using Series, but I've found that np.choice() is not consistent in its ability to work with Series as weights. When I tried passing a Series as weights, I found that it worked on my computer, but 3 of the 5 Travis builds kept failing. The weights = weights.values is the only way I could fix the problem.
    • should the values of series be aligned: given np.choice won't take a Series, how would I go about this? If you can suggest a way to do this, I'd be happy to do so.

Small things fixed:

  • tabs were causing issues with format. All spaces now.
  • PEP-8 in function signatures fixed. Sorry about that.

@@ -713,6 +714,7 @@ Indexing, iteration
DataFrame.where
DataFrame.mask
DataFrame.query
DataFrame.sample
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the duplication is intentional, but only adding to "... Selection ... " section may be better? I understand the function doesn't do indexing or iteration.

And pls add Panel.sample.

@nickeubank nickeubank force-pushed the sample_func branch 2 times, most recently from dcaa84b to 87ac130 Compare April 27, 2015 03:54
------------------------
.. versionadded::0.16.1

A random selection of rows or columns from a Series, DataFrame, or Panel with the ``.sample()`` method. The method will sample rows by default, and accepts a specific number of rows/columns to return, or a fraction of rows.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you replace .sample() with :meth:~DataFrame.sample (of `:meth:`Series.sample), the result will be almost the same, but also be a link to the api docstring page

@jorisvandenbossche
Copy link
Member

@jreback could you shed some light on the panel issues?

@nickeubank

  • agree on the n=1 and n>1 giving the same output type
  • OK for me on limiting specifying weights as a string for now to DataFrame columns
  • on the weights = weights.values thing, it just seems to me that if you want to end up with an array, the initial weights = pd.Series(weights, dtype = 'float64') is not needed? EDIT ah, I see you use fillna, for which you need it as a series


Sample
^^^^^^^^^^^^^^^^

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be exactly length of the text

@jreback
Copy link
Contributor

jreback commented Apr 28, 2015

some minor code comments.

if an axis is not provided, it is usual to use axis = self._stat_axis, this will yield:
Series->0, DataFrame->0, Panel->1. This type of code is used all over generic.py

pls squash.
otherwise lgtm. once you have made changes, ping when green.

@nickeubank nickeubank force-pushed the sample_func branch 2 times, most recently from f49489d to 9689dd1 Compare April 29, 2015 18:09
@nickeubank
Copy link
Contributor Author

@jreback : I think we're good to go.

Added axis = self._stat_axis and associated tests; added is_integer() to _random_state(); thinned comments; fixed format in "what's new"; updated docs on only supporting column names for dataframes.

@jreback @shoyer @jorisvandenbossche @TomAugspurger @sinhrks : thank you all for your input on this -- it's been a great learning experience, and hopefully a useful new feature!

raise ValueError("Strings cannot be passed as weights when sampling from a Series or Panel.")

#normalize format of weights to Series.
weights = pd.Series(weights, dtype = 'float64')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8 would be weights = pd.Series(weights, dtype='float64')

@jorisvandenbossche
Copy link
Member

@nickeubank Looks good to me!
Certainly a very usefull feature. We have been very 'commenting' here (but I think that is good, to deliver good quality, which it is!)

@sinhrks
Copy link
Member

sinhrks commented May 1, 2015

Thanks for the change, great job :)

@jreback
Copy link
Contributor

jreback commented May 1, 2015

merged via 8f0f417

@nickeubank thanks for effort! and responding to comments. all of this just makes pandas better!

@shoyer
Copy link
Member

shoyer commented May 1, 2015

woohoo, thanks!

@shoyer shoyer mentioned this pull request May 3, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Series/DataFrame sample method with/without replacement
7 participants