added random_split in generic.py, for DataFrames etc. #11253

lukovnikov · 2015-10-06T15:58:42Z

Added a method random_split() for NDFrames, to split the Frame into several frames according to one axis.

Basic use: for train/test or train/validation/test splitting of a dataframe.

Note: I'm not sure this feature fits well in NDFrame. If you think this feature can be added, I'll add more tests and docs.

jorisvandenbossche · 2015-10-06T16:49:49Z

A previous similar PR was closed as out of scope for pandas: #6687

@lukovnikov you can go through the discussion there, and see if you have arguments that this should be included, then we can always reconsider.

lukovnikov · 2015-10-06T17:20:24Z

@jorisvandenbossche this is more general than train/test split, you can make as many random splits as you want (and this is something you also mentioned for #6687).

And although indexing yourself or using groupby like in #6687 can get you the same, I find having this method much easier, it's a no-brainer that seems interesting if you don't want to depend on sklearn or other libraries for easy splitting or think about it yourself. The .sample() method is also a no-brainer for selecting random samples which could've been done by the users and could also be excluded.

I can also make a general .split() method, with a randomness option but with the same notion of relative sizes of the splits (which, without randomness, are just taken from the beginning).

Edit: I'm not aware what statsmodels provides, maybe they also have something. Anyway, in my opinion an optionally random .split() in pandas would still be interesting, even if it only prevents other dependencies.

Edit: generalized it to an optionally random .split() method.

shoyer · 2015-10-07T04:20:35Z

From an API design perspective, I think it's much clean to have .random_split() rather than a completely general function that lets flags change every bit of behavior it has. Pandas has too many of those already.

On the whole, I think this could be a reasonable thing to add to pandas as long as it's called random_split and not the hopelessly ambiguous split. Pandas shouldn't have detailed stats or machine learning routines, but randomized splits really are data science 101.

If we do want to go forward with this, it's worth looking at the DataFrame.random_split() method that was added to dask.dataframe to verify the consistency of the API: dask/dask#276

jreback · 2015-10-07T10:44:48Z

#7387 is the master issue for splitting. Pls have a read over the many usecases / features that are needed for general splitting/partitioning.

Random splitting is a specialization of this .split and should not be a separate function, merely an argument here. You are already having a number of arguments, I don't find another to be an issue. It is more confusing IMHO to have to look for random (pun intended) functions.

further this API needs to start similarly to np.split

def split(self, n=None, weights=None, random=None, axis=0)

could be a decent start.

lukovnikov · 2015-10-07T12:50:51Z

@jreback yeah, but the np.split() is a function on np, not on an array so the first argument is already the pandas object we're calling it on. The second argument of np.split() is a int or list of ints, which specifies the number of equal splits or the indices along which to split. In the generalized .split() method here, I aim to accept weights, which are proportionate to the desired relative sizes of the output splits. Accepting a list of indices as arguments in my opinion is a little superfluous. If you already have those indices, why don't you just use them? Anyway, they could be added as a yet another option.

I understand your concern to not to overcrowd the API but I'm more with @shoyer on the issue of a separate random_split() method. It's easier to find a dedicated function with limited number of arguments/switches than reading through long argument descriptions of over-argumented functions.

@shoyer First it was random_split() but in the second commit I made a split() out of it. In my opinion, a dedicated method is also better.

We could make a collection of splitting methods:

isplit() that accepts indices (like numpy.split())
rsplit() that accepts weights for relative split sizes
random_split() that accepts weights but behaves randomly
split(), a method with switching arguments for randomness and indices/weights

OR, similar to what @jreback mentions for #7387 (if I understood that discussion correctly) we could have

a split() on GroupBy
a shuffle() method on a Frame that returns a GroupBy or a Shuffler Grouper that can be passed to .groupby()
but then we might still want at least a simple split() method on a Frame, for non-random splitting (maybe following the numpy semantics) OR a Partitioner Grouper as an argument to .groupby(), like @jreback suggested for Split/Partition Master Issue #7387.

In my opinion, the first option is easier for beginners and is a no-brainer (and thus better for wider adoption of pandas) while the second option seems like a cleaner implementation but is less of a no-brainer.

jreback · 2015-10-07T15:11:33Z

but the np.split() is a function on np, not on an array so the first argument is already the pandas object we're calling it on.

this is the point of the buffer interface, so it in fact is useful, no need to repeat things, except where pandas can add value. This is where a .split() method comes in

we already have .sample() so no need for shuffle.
I think the only API expansion needed here is .split()

this can accomplish all of the objectives with a minimum of keywords. Adding a plethora of methods is just plain confusing and bloats the API.

This is a very similar method in nature to .sample so it should follow a similar pattern.

lukovnikov · 2015-10-07T15:18:10Z

@jreback do you mean a .split() on a GroupBy or a .split() on a Frame?

Edit: the .split() I implemented on the Frame (see second commit) already is optionally random. So if I add an optional indexes argument (that can not be used with the weights argument), you think it's good? The indexes argument will have the same semantics as the indexes argument for np.split() whereas the weights argument allows to specify proportions you want (maybe should be renamed to proportions).

jreback · 2015-10-07T15:31:41Z

.split could exist on a groupby, though not really sure of the semantics, as you can trivially

[ grp for _, grp in df.groupby(....) ] is basically what you want, though
df.groupby(....).split() might make sense for some applications here

lukovnikov · 2015-10-07T15:52:44Z

how can random splits be realized if .split() is attached to a GroupBy? Then you also need some RandomGrouper to feed to .groupby()?

So my question is: can we put a general .split() on a Frame, in which case I can quickly extend the functionality to have numpy.split()-like indexes or you don't want .split() to become a part of a Frame's API?

jreback · 2015-10-07T16:08:05Z

it becomes trivial actually (at least for random, not partitioned)

df.groupby(np.random.randint(0,len(df),size=df)).split()

jreback · 2015-10-07T16:11:03Z

I think .split() on the frame is ok, but since we already have .groupby which handles partitioned, column splitting and such, it makes a lot more sense on groupby.

though it does have a different return type (e.g. .groupby...... returns a pandas object), now it would return a list of them (but I think this is ok here)

jreback · 2015-10-07T16:12:25Z

and what I would do to support 'kinds' of splitting is make things like:

Random
Partition

which are sub-classes of pd.Grouper to control exactly how splitting is done. Then you don't have lots of arguments, and it is pretty semantic

df.groupby(Random(1234)).split()
df.groupby(Partition([1,2,3])).split()

lukovnikov · 2015-10-07T16:27:37Z

yes, this seems like a cleaner implementation but maybe a .split() on a Frame that uses .groupby() internally would also be interesting as a convenience method?

jreback · 2015-10-07T16:30:32Z

I could be convinced of that as well

…split() on a GroupBy still need to test

lukovnikov · 2015-10-07T19:11:20Z

so, I hope this implementation is good (if someone could quickly review the code, that would be helpful)
still need to write proper tests and test it though

jreback · 2015-10-07T20:45:52Z

pandas/core/groupby.py

+class OrderedGrouper(Grouper):
+
+    def __init__(self, proportions=(1,1), axis=None):
+        self._proportions = proportions


obviously add a doc-string :)

jorisvandenbossche · 2015-10-07T20:57:55Z

I am personally not a fan of adding extra objects like Random and Partition to use with groupby. Using objects for this is more cognitive load than a method for users, and further I also find it a bit unintuitive to have to use groupby when I want to split randomly.

(apart from that, a split method on groupby can be useful, but I find it a bit a separate issue)

shoyer · 2015-10-07T22:31:27Z

I agree with @jorisvandenbossche that I would rather not add extra objects to the API exclusively for use with groupby.

I think a better approach is create alternative methods (other than .groupby) that return GroupBy like objects. For example, it would make a lot of sense to make this the case for resample (removing the need to use TimeGrouper), e.g., df.resample('24H').sum() instead of df.resample('24H', how='sum') (this is a whole different issue, for course).

However, it is not clear to me why a groupby object is an intuitive result from a method that produces random partitions or splits. What is the actual use case here? Would you really want to write something like df.groupby(RandomPartition({'train': 0.7, 'test': 0.3})).apply(model.predict)? That doesn't look very readable to me. I would rather write:

test, train = df.random_split([0.7, 0.3])
test_predictions = model.predict(test)
train_prediction = model.predict(train)

jreback · 2015-10-07T23:12:30Z

would be

train, test = df.groupby(RandomPartition(( 0.7, .3)).split()

in resample
it's quite easy to return the GroupBy
but I think impossible to change and maintain back compatible

eg df.resample('24h') defaults to .mean() (if it was None) then this would work

shoyer · 2015-10-07T23:18:25Z

I agree that there's no easy way to change resample in a backwards compatible way. Perhaps how='grouped' would be a reasonable way to expose this.

lukovnikov · 2015-10-08T13:48:49Z

We could include both approaches (general .split() on a Frame and the new Groupers with .split() on a GroupBy) in the API. The .split() on the GroupBy could prove useful (and is trivial anyway) as it explicitly provides semantics natural for grouped Frame. Probably most people will just use a .split() on the Frame though because it's easier to find. However, if someone wants some advanced (random) splitting functionality (wouldn't know what though), one could easily subclass RandomGrouper or OrderedGrouper and use it with .groupby().

What do you think?

still need to write tests

jreback · 2015-11-23T00:11:47Z

closing, but if you want to propose a nice API we can re-open

added random_split in generic.py, for DataFrames etc.

add1acb

jorisvandenbossche closed this Oct 6, 2015

jorisvandenbossche reopened this Oct 6, 2015

lukovnikov closed this Oct 6, 2015

lukovnikov reopened this Oct 6, 2015

generalized to split(), with a randomness option, changed test

ef350cd

jreback added the API Design label Oct 7, 2015

implemented optionally random splitting as Ordered/Random Grouper + .…

8b2b06c

…split() on a GroupBy still need to test

removed superfluous lines

81d8ba5

jreback reviewed Oct 7, 2015
View reviewed changes

improved code

ed261ee

still need to write tests

jreback closed this Nov 23, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added random_split in generic.py, for DataFrames etc. #11253

added random_split in generic.py, for DataFrames etc. #11253

lukovnikov commented Oct 6, 2015

jorisvandenbossche commented Oct 6, 2015

lukovnikov commented Oct 6, 2015

shoyer commented Oct 7, 2015

jreback commented Oct 7, 2015

lukovnikov commented Oct 7, 2015

jreback commented Oct 7, 2015

lukovnikov commented Oct 7, 2015

jreback commented Oct 7, 2015

lukovnikov commented Oct 7, 2015

jreback commented Oct 7, 2015

jreback commented Oct 7, 2015

jreback commented Oct 7, 2015

lukovnikov commented Oct 7, 2015

jreback commented Oct 7, 2015

lukovnikov commented Oct 7, 2015

jreback Oct 7, 2015

jorisvandenbossche commented Oct 7, 2015

shoyer commented Oct 7, 2015

jreback commented Oct 7, 2015

shoyer commented Oct 7, 2015

lukovnikov commented Oct 8, 2015

jreback commented Nov 23, 2015

added random_split in generic.py, for DataFrames etc. #11253

added random_split in generic.py, for DataFrames etc. #11253

Conversation

lukovnikov commented Oct 6, 2015

jorisvandenbossche commented Oct 6, 2015

lukovnikov commented Oct 6, 2015

shoyer commented Oct 7, 2015

jreback commented Oct 7, 2015

lukovnikov commented Oct 7, 2015

jreback commented Oct 7, 2015

lukovnikov commented Oct 7, 2015

jreback commented Oct 7, 2015

lukovnikov commented Oct 7, 2015

jreback commented Oct 7, 2015

jreback commented Oct 7, 2015

jreback commented Oct 7, 2015

lukovnikov commented Oct 7, 2015

jreback commented Oct 7, 2015

lukovnikov commented Oct 7, 2015

jreback Oct 7, 2015

Choose a reason for hiding this comment

jorisvandenbossche commented Oct 7, 2015

shoyer commented Oct 7, 2015

jreback commented Oct 7, 2015

shoyer commented Oct 7, 2015

lukovnikov commented Oct 8, 2015

jreback commented Nov 23, 2015