-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New function to sample from frames (Issue #2419 ) #9666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Please excuse the many (meaningless) commits -- was learning to use git from commandline. |
|
||
Parameters | ||
---------- | ||
n: Number of rows to return. Cannot be used with frac. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please take another look at the numpy docstring standard:
https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt
In particular, you are indenting too much here and mixing up the lines for types/descriptions.
This will also need documentation updates:
|
Thanks @shoyer. Adding changes and and testing. |
|
||
|
||
# Check whether frac or N | ||
if n == None and frac == None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, you want to be using is None
here instead of == None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah right, sorry! Updating and testing before pushing.
Doc additions soon -- wanted to pin down the function first, but may need to do some "real job" things before finishing. |
|
||
Parameters | ||
---------- | ||
n: in, optional, Default = 5 if frac = None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be int
not in
.
Also, the numpy docstring convention is usually not to list default values after the type. Rather, they should be described in the description part below.
|
||
if self.ndim > 1 : | ||
try: | ||
weights = self[weights] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should access the 'weights' column in the correct axis. Eg if you sample a dataframe it's columns (axis=1
), should it then be possible to give a row index name?
In any case, now this can trigger the "Weights and axis to be sampled must be of same length" error when the dataframe is not square (and would give faulty results if it would be square)
Some other comments / questions:
|
Thanks @jorvisvandenbossche!
Small things fixed:
|
@@ -713,6 +714,7 @@ Indexing, iteration | |||
DataFrame.where | |||
DataFrame.mask | |||
DataFrame.query | |||
DataFrame.sample |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the duplication is intentional, but only adding to "... Selection ... " section may be better? I understand the function doesn't do indexing or iteration.
And pls add Panel.sample
.
dcaa84b
to
87ac130
Compare
------------------------ | ||
.. versionadded::0.16.1 | ||
|
||
A random selection of rows or columns from a Series, DataFrame, or Panel with the ``.sample()`` method. The method will sample rows by default, and accepts a specific number of rows/columns to return, or a fraction of rows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you replace .sample()
with :meth:
~DataFrame.sample (of `:meth:`Series.sample
), the result will be almost the same, but also be a link to the api docstring page
@jreback could you shed some light on the panel issues?
|
|
||
Sample | ||
^^^^^^^^^^^^^^^^ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs to be exactly length of the text
some minor code comments. if an axis is not provided, it is usual to use pls squash. |
f49489d
to
9689dd1
Compare
@jreback : I think we're good to go. Added axis = self._stat_axis and associated tests; added is_integer() to _random_state(); thinned comments; fixed format in "what's new"; updated docs on only supporting column names for dataframes. @jreback @shoyer @jorisvandenbossche @TomAugspurger @sinhrks : thank you all for your input on this -- it's been a great learning experience, and hopefully a useful new feature! |
raise ValueError("Strings cannot be passed as weights when sampling from a Series or Panel.") | ||
|
||
#normalize format of weights to Series. | ||
weights = pd.Series(weights, dtype = 'float64') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8 would be weights = pd.Series(weights, dtype='float64')
@nickeubank Looks good to me! |
Thanks for the change, great job :) |
merged via 8f0f417 @nickeubank thanks for effort! and responding to comments. all of this just makes pandas better! |
woohoo, thanks! |
closes #2419
Creates a function
.sample()
in generic.py to sample from pandas objects. Returns either a fixed number of rows or a share. Also includes a number of tests intest_sample()
-- I am open to added suggestions.Input from users with panel experience appreciated!