-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
added random_split in generic.py, for DataFrames etc. #11253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
A previous similar PR was closed as out of scope for pandas: #6687 @lukovnikov you can go through the discussion there, and see if you have arguments that this should be included, then we can always reconsider. |
@jorisvandenbossche this is more general than train/test split, you can make as many random splits as you want (and this is something you also mentioned for #6687). And although indexing yourself or using groupby like in #6687 can get you the same, I find having this method much easier, it's a no-brainer that seems interesting if you don't want to depend on sklearn or other libraries for easy splitting or think about it yourself. The .sample() method is also a no-brainer for selecting random samples which could've been done by the users and could also be excluded. I can also make a general .split() method, with a randomness option but with the same notion of relative sizes of the splits (which, without randomness, are just taken from the beginning). Edit: I'm not aware what statsmodels provides, maybe they also have something. Anyway, in my opinion an optionally random .split() in pandas would still be interesting, even if it only prevents other dependencies. Edit: generalized it to an optionally random .split() method. |
From an API design perspective, I think it's much clean to have On the whole, I think this could be a reasonable thing to add to pandas as long as it's called If we do want to go forward with this, it's worth looking at the |
#7387 is the master issue for splitting. Pls have a read over the many usecases / features that are needed for general splitting/partitioning. Random splitting is a specialization of this further this API needs to start similarly to np.split
could be a decent start. |
@jreback yeah, but the np.split() is a function on np, not on an array so the first argument is already the pandas object we're calling it on. The second argument of np.split() is a int or list of ints, which specifies the number of equal splits or the indices along which to split. In the generalized .split() method here, I aim to accept weights, which are proportionate to the desired relative sizes of the output splits. Accepting a list of indices as arguments in my opinion is a little superfluous. If you already have those indices, why don't you just use them? Anyway, they could be added as a yet another option. I understand your concern to not to overcrowd the API but I'm more with @shoyer on the issue of a separate random_split() method. It's easier to find a dedicated function with limited number of arguments/switches than reading through long argument descriptions of over-argumented functions. @shoyer First it was random_split() but in the second commit I made a split() out of it. In my opinion, a dedicated method is also better. We could make a collection of splitting methods:
OR, similar to what @jreback mentions for #7387 (if I understood that discussion correctly) we could have
In my opinion, the first option is easier for beginners and is a no-brainer (and thus better for wider adoption of pandas) while the second option seems like a cleaner implementation but is less of a no-brainer. |
this is the point of the buffer interface, so it in fact is useful, no need to repeat things, except where pandas can add value. This is where a we already have this can accomplish all of the objectives with a minimum of keywords. Adding a plethora of methods is just plain confusing and bloats the API. This is a very similar method in nature to |
@jreback do you mean a Edit: the |
|
how can random splits be realized if So my question is: can we put a general |
it becomes trivial actually (at least for random, not partitioned)
|
I think though it does have a different return type (e.g. |
and what I would do to support 'kinds' of splitting is make things like:
which are sub-classes of
|
yes, this seems like a cleaner implementation but maybe a |
I could be convinced of that as well |
…split() on a GroupBy still need to test
so, I hope this implementation is good (if someone could quickly review the code, that would be helpful) |
class OrderedGrouper(Grouper): | ||
|
||
def __init__(self, proportions=(1,1), axis=None): | ||
self._proportions = proportions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
obviously add a doc-string :)
I am personally not a fan of adding extra objects like (apart from that, a |
I agree with @jorisvandenbossche that I would rather not add extra objects to the API exclusively for use with groupby. I think a better approach is create alternative methods (other than However, it is not clear to me why a groupby object is an intuitive result from a method that produces random partitions or splits. What is the actual use case here? Would you really want to write something like
|
would be train, test = df.groupby(RandomPartition(( 0.7, .3)).split() in resample eg df.resample('24h') defaults to .mean() (if it was None) then this would work |
I agree that there's no easy way to change resample in a backwards compatible way. Perhaps |
We could include both approaches (general What do you think? |
still need to write tests
closing, but if you want to propose a nice API we can re-open |
Added a method random_split() for NDFrames, to split the Frame into several frames according to one axis.
Basic use: for train/test or train/validation/test splitting of a dataframe.
Note: I'm not sure this feature fits well in NDFrame. If you think this feature can be added, I'll add more tests and docs.