Skip to content

added random_split in generic.py, for DataFrames etc. #11253

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed

added random_split in generic.py, for DataFrames etc. #11253

wants to merge 5 commits into from

Conversation

lukovnikov
Copy link

Added a method random_split() for NDFrames, to split the Frame into several frames according to one axis.

Basic use: for train/test or train/validation/test splitting of a dataframe.

Note: I'm not sure this feature fits well in NDFrame. If you think this feature can be added, I'll add more tests and docs.

@jorisvandenbossche
Copy link
Member

A previous similar PR was closed as out of scope for pandas: #6687

@lukovnikov you can go through the discussion there, and see if you have arguments that this should be included, then we can always reconsider.

@lukovnikov
Copy link
Author

@jorisvandenbossche this is more general than train/test split, you can make as many random splits as you want (and this is something you also mentioned for #6687).

And although indexing yourself or using groupby like in #6687 can get you the same, I find having this method much easier, it's a no-brainer that seems interesting if you don't want to depend on sklearn or other libraries for easy splitting or think about it yourself. The .sample() method is also a no-brainer for selecting random samples which could've been done by the users and could also be excluded.

I can also make a general .split() method, with a randomness option but with the same notion of relative sizes of the splits (which, without randomness, are just taken from the beginning).

Edit: I'm not aware what statsmodels provides, maybe they also have something. Anyway, in my opinion an optionally random .split() in pandas would still be interesting, even if it only prevents other dependencies.

Edit: generalized it to an optionally random .split() method.

@lukovnikov lukovnikov closed this Oct 6, 2015
@lukovnikov lukovnikov reopened this Oct 6, 2015
@shoyer
Copy link
Member

shoyer commented Oct 7, 2015

From an API design perspective, I think it's much clean to have .random_split() rather than a completely general function that lets flags change every bit of behavior it has. Pandas has too many of those already.

On the whole, I think this could be a reasonable thing to add to pandas as long as it's called random_split and not the hopelessly ambiguous split. Pandas shouldn't have detailed stats or machine learning routines, but randomized splits really are data science 101.

If we do want to go forward with this, it's worth looking at the DataFrame.random_split() method that was added to dask.dataframe to verify the consistency of the API: dask/dask#276

@jreback
Copy link
Contributor

jreback commented Oct 7, 2015

#7387 is the master issue for splitting. Pls have a read over the many usecases / features that are needed for general splitting/partitioning.

Random splitting is a specialization of this .split and should not be a separate function, merely an argument here. You are already having a number of arguments, I don't find another to be an issue. It is more confusing IMHO to have to look for random (pun intended) functions.

further this API needs to start similarly to np.split

def split(self, n=None, weights=None, random=None, axis=0)

could be a decent start.

@lukovnikov
Copy link
Author

@jreback yeah, but the np.split() is a function on np, not on an array so the first argument is already the pandas object we're calling it on. The second argument of np.split() is a int or list of ints, which specifies the number of equal splits or the indices along which to split. In the generalized .split() method here, I aim to accept weights, which are proportionate to the desired relative sizes of the output splits. Accepting a list of indices as arguments in my opinion is a little superfluous. If you already have those indices, why don't you just use them? Anyway, they could be added as a yet another option.

I understand your concern to not to overcrowd the API but I'm more with @shoyer on the issue of a separate random_split() method. It's easier to find a dedicated function with limited number of arguments/switches than reading through long argument descriptions of over-argumented functions.

@shoyer First it was random_split() but in the second commit I made a split() out of it. In my opinion, a dedicated method is also better.

We could make a collection of splitting methods:

  • isplit() that accepts indices (like numpy.split())
  • rsplit() that accepts weights for relative split sizes
  • random_split() that accepts weights but behaves randomly
  • split(), a method with switching arguments for randomness and indices/weights

OR, similar to what @jreback mentions for #7387 (if I understood that discussion correctly) we could have

  • a split() on GroupBy
  • a shuffle() method on a Frame that returns a GroupBy or a Shuffler Grouper that can be passed to .groupby()
  • but then we might still want at least a simple split() method on a Frame, for non-random splitting (maybe following the numpy semantics) OR a Partitioner Grouper as an argument to .groupby(), like @jreback suggested for Split/Partition Master Issue #7387.

In my opinion, the first option is easier for beginners and is a no-brainer (and thus better for wider adoption of pandas) while the second option seems like a cleaner implementation but is less of a no-brainer.

@jreback
Copy link
Contributor

jreback commented Oct 7, 2015

but the np.split() is a function on np, not on an array so the first argument is already the pandas object we're calling it on.

this is the point of the buffer interface, so it in fact is useful, no need to repeat things, except where pandas can add value. This is where a .split() method comes in

we already have .sample() so no need for shuffle.
I think the only API expansion needed here is .split()

this can accomplish all of the objectives with a minimum of keywords. Adding a plethora of methods is just plain confusing and bloats the API.

This is a very similar method in nature to .sample so it should follow a similar pattern.

@lukovnikov
Copy link
Author

@jreback do you mean a .split() on a GroupBy or a .split() on a Frame?

Edit: the .split() I implemented on the Frame (see second commit) already is optionally random. So if I add an optional indexes argument (that can not be used with the weights argument), you think it's good? The indexes argument will have the same semantics as the indexes argument for np.split() whereas the weights argument allows to specify proportions you want (maybe should be renamed to proportions).

@jreback
Copy link
Contributor

jreback commented Oct 7, 2015

.split could exist on a groupby, though not really sure of the semantics, as you can trivially

[ grp for _, grp in df.groupby(....) ] is basically what you want, though
df.groupby(....).split() might make sense for some applications here

@lukovnikov
Copy link
Author

how can random splits be realized if .split() is attached to a GroupBy? Then you also need some RandomGrouper to feed to .groupby()?

So my question is: can we put a general .split() on a Frame, in which case I can quickly extend the functionality to have numpy.split()-like indexes or you don't want .split() to become a part of a Frame's API?

@jreback
Copy link
Contributor

jreback commented Oct 7, 2015

it becomes trivial actually (at least for random, not partitioned)

df.groupby(np.random.randint(0,len(df),size=df)).split()

@jreback
Copy link
Contributor

jreback commented Oct 7, 2015

I think .split() on the frame is ok, but since we already have .groupby which handles partitioned, column splitting and such, it makes a lot more sense on groupby.

though it does have a different return type (e.g. .groupby...... returns a pandas object), now it would return a list of them (but I think this is ok here)

@jreback
Copy link
Contributor

jreback commented Oct 7, 2015

and what I would do to support 'kinds' of splitting is make things like:

Random
Partition

which are sub-classes of pd.Grouper to control exactly how splitting is done. Then you don't have lots of arguments, and it is pretty semantic

df.groupby(Random(1234)).split()
df.groupby(Partition([1,2,3])).split()

@lukovnikov
Copy link
Author

yes, this seems like a cleaner implementation but maybe a .split() on a Frame that uses .groupby() internally would also be interesting as a convenience method?

@jreback
Copy link
Contributor

jreback commented Oct 7, 2015

I could be convinced of that as well

@lukovnikov
Copy link
Author

so, I hope this implementation is good (if someone could quickly review the code, that would be helpful)
still need to write proper tests and test it though

class OrderedGrouper(Grouper):

def __init__(self, proportions=(1,1), axis=None):
self._proportions = proportions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obviously add a doc-string :)

@jorisvandenbossche
Copy link
Member

I am personally not a fan of adding extra objects like Random and Partition to use with groupby. Using objects for this is more cognitive load than a method for users, and further I also find it a bit unintuitive to have to use groupby when I want to split randomly.

(apart from that, a split method on groupby can be useful, but I find it a bit a separate issue)

@shoyer
Copy link
Member

shoyer commented Oct 7, 2015

I agree with @jorisvandenbossche that I would rather not add extra objects to the API exclusively for use with groupby.

I think a better approach is create alternative methods (other than .groupby) that return GroupBy like objects. For example, it would make a lot of sense to make this the case for resample (removing the need to use TimeGrouper), e.g., df.resample('24H').sum() instead of df.resample('24H', how='sum') (this is a whole different issue, for course).

However, it is not clear to me why a groupby object is an intuitive result from a method that produces random partitions or splits. What is the actual use case here? Would you really want to write something like df.groupby(RandomPartition({'train': 0.7, 'test': 0.3})).apply(model.predict)? That doesn't look very readable to me. I would rather write:

test, train = df.random_split([0.7, 0.3])
test_predictions = model.predict(test)
train_prediction = model.predict(train)

@jreback
Copy link
Contributor

jreback commented Oct 7, 2015

would be

train, test = df.groupby(RandomPartition(( 0.7, .3)).split()

in resample
it's quite easy to return the GroupBy
but I think impossible to change and maintain back compatible

eg df.resample('24h') defaults to .mean() (if it was None) then this would work

@shoyer
Copy link
Member

shoyer commented Oct 7, 2015

I agree that there's no easy way to change resample in a backwards compatible way. Perhaps how='grouped' would be a reasonable way to expose this.

@lukovnikov
Copy link
Author

We could include both approaches (general .split() on a Frame and the new Groupers with .split() on a GroupBy) in the API. The .split() on the GroupBy could prove useful (and is trivial anyway) as it explicitly provides semantics natural for grouped Frame. Probably most people will just use a .split() on the Frame though because it's easier to find. However, if someone wants some advanced (random) splitting functionality (wouldn't know what though), one could easily subclass RandomGrouper or OrderedGrouper and use it with .groupby().

What do you think?

still need to write tests
@jreback
Copy link
Contributor

jreback commented Nov 23, 2015

closing, but if you want to propose a nice API we can re-open

@jreback jreback closed this Nov 23, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants