adds splitBy extension method on scala.collection.Iterator #4

jeantil · 2018-05-18T16:26:05Z

Iterator#splitBy constructs an iterator where consecutive
elements of the original iterator are accumulated as long as the output
of a key function for each element doesn't change. They are emitted as
an Iterable as soon as the key function changes.

This operation makes sense as soon as you are trying to process an
iterator where you know the elements will be sorted in a certain way and
you need to group them without loading all the data in memory.

For instance

processing a file where the ordering is guaranteed but the file
doesn't fit in the heap,
processing a streaming resultset where the underlying database
guarantees the ordering because of a sort clause.

The same operation is added on Iterables with the difference that the
specific container type of the input is preserved for both collection
levels of the output, thus

Set(1,2,3).splitBy(identity) returns
Set(Set(1), Set(2), Set(3))
Vector(1,2,3).splitBy(identity) returns
Vector(Vector1), Vector2), Vector3))
etc.

thebignet

Seems like a good method to add and the use cases are fairly common to add to this lib.

src/main/scala/scala/collection/decorators/IteratorDecorator.scala

jeantil · 2018-05-20T20:18:29Z

I made cosmetic changes after the first review :

inlined local val key as it was used only once
replace custom-defined readHead with Iterator#nextOption since the behaviour is exactly the same.

jeantil · 2018-06-29T15:53:23Z

Hello

@julienrf I don't mind if this isn't a priority, but just want to make sure this PR was submitted at the right place (not to mention I would love feedback on the actual implementation :) )

cheers

julienrf · 2018-06-29T16:00:55Z

Hello Jean! Sorry for being silent. Yes, the PR looks great! We are still unsure about what will happen to scala-collection-contrib, though. I will come back to you soon. Is that ok for you?

jeantil · 2018-06-29T16:03:34Z

Hi Julien, thats perfectly ok for me. I'll keep monitoring the thread here and at contributors.scala-lang.org

cheers :)

src/main/scala/scala/collection/decorators/IteratorDecorator.scala

joshlemer · 2019-05-02T15:05:15Z

If we're gonna add this splitBy method on Iterators, it probably makes sense to add it on the Iterables as well, which delegates to basically `this`.iterator.splitBy(f).map(`this`.fromSpecific) or something like that. See for instance, how this PR adds both implementations https://github.com/scala/scala-collection-contrib/pull/18/files#diff-ea5808f2cdf7aad91c2879f38e526320R34

jeantil · 2019-05-03T07:20:11Z

@joshlemer I gave a try to adding this to IterableDecorator.
I am not entirely sure what the resulting signature should be. I went for Iterator[C[A]] according to your hint so that Vector(1,1,2,2,3,3).splitBy(identity) would return Iterator[Vector[A]] but I don't know if rewrapping the inner collections in a Vector actually brings much value.

julienrf · 2019-07-30T13:44:34Z

@jeantil I think it should return at least Iterable[CC[A]], but maybe CC[CC[A]] is a more obvious default?

julienrf

What do you think of having the following signature as a Seq[A] decorator?

def splitBy[B](f: A => B): CC[CC[A]]

And, similarly, on Iterator[A]:

def splitBy[B](f: A => B): Iterator[Iterator[A]]

joshlemer · 2019-07-30T15:42:38Z

@julienrf
for Iterators:

We could aim for consistency with the most closely related method on iterator, grouped, it would then return an Iterator[immutable.Seq[A]]

for Seq's:

Since it's not changing the underlying element type, I think it could return CC[C]:
def splitBy[B](f: A => B): CC[C]

This would also be more consistent with def splitAt(n: Int): (C, C)

jeantil · 2019-07-30T15:46:35Z

Here is what I came up with for a quick shake up :

def splitBy[K, That,CC](f: seq.A => K)(implicit bf: BuildFrom[C, seq.A, That], bff: BuildFrom[C,That,CC]):CC =
    bff.fromSpecific(coll)(seq(coll).iterator.splitBy(f).map(bf.fromSpecific(coll)))

this compiles, but then I get compilation errors on the tests because of ambiguous implicits

[error] Note that implicit conversions are not applicable because they are ambiguous:
[error]  both method IterableDecorator in package decorators of type [C](coll: C)(implicit it: scala.collection.generic.IsIterable[C])scala.collection.decorators.IterableDecorator[C,it.type]
[error]  and method SeqDecorator in package decorators of type [C](coll: C)(implicit seq: scala.collection.generic.IsSeq[C])scala.collection.decorators.SeqDecorator[C,seq.type]
[error]  are possible conversion functions from scala.collection.immutable.Vector[A] to ?{def splitBy: ?}
[error]     val groupedVector:Vector[Vector[Nothing]] = Vector.empty.splitBy(identity)
[error]                                                        ^

But that won't work with CC[C] since there is no guarantee that C is preserved (as per @joshlemer's proposal)

I tried getting it to work on Iterable instead but I get in other kind of errors and I'm short on time to dig deeper :(

julienrf · 2019-07-30T15:51:00Z

@jeantil is splitBy defined on both IterableDecorator and SeqDecorator? It should be defined only on one of those.

julienrf · 2019-07-30T15:52:06Z

@joshlemer You raised good points ;)

We could aim for consistency with the most closely related method on iterator, grouped, it would then return an Iterator[immutable.Seq[A]]

I don’t know what is the reason for grouped to have this return type, though.

for Seq's:

Since it's not changing the underlying element type, I think it could return CC[C]

👍

jeantil · 2019-07-30T15:54:06Z

I realize I may not have pushed everything, I had started adding it on IterableDecorator and it seems i never pushed the commit (it was in response to #4 (comment))
(sorry about the account juggling)

def splitBy[K, That](f: it.A => K)(implicit bf: BuildFrom[C, it.A, That]):Iterator[That]

is as far as I managed to compile for IterableDecorator.

jeantil · 2019-08-17T12:48:41Z

Hi,
I have squashed the iterations on IteratoDecorator#splitBy, narrowed it's signature to Iterator[immutable.Seq[A]]and changed IterableDecorator#splitBy to return CC[CC[A]] when it is given a C[A].

Doing the latter, I switched from delegating to iterator to a specific implementation which
I found easier to work with to achieve the expected type signature.

The corresponding tests have been updated too and I applied a formatter for
consistency.

julienrf

Thanks a lot for your patience, Jean!

Doing the latter, I switched from delegating to iterator to a specific implementation which
I found easier to work with to achieve the expected type signature.

I understand, however the drawback is that your implementation is strict by default. I think it would be nice if it was lazy on LazyList.

src/main/scala/scala/collection/decorators/IterableDecorator.scala

src/main/scala/scala/collection/decorators/IteratorDecorator.scala

jeantil · 2019-08-19T12:58:00Z

Hi Julien, It's alright, I'm doing this on the side and I really appreciate the time you spend for reviews. I may not be able to address your latest comments before next week. Le lun. 19 août 2019 à 10:32, Julien Richard-Foy <[email protected]> a écrit :

…

***@***.**** commented on this pull request. Thanks *a lot* for your patience, Jean! Doing the latter, I switched from delegating to iterator to a specific implementation which I found easier to work with to achieve the expected type signature. I understand, however the drawback is that your implementation is strict by default. I think it would be nice if it was lazy on LazyList. ------------------------------ In src/main/scala/scala/collection/decorators/IterableDecorator.scala <#4 (comment)> : > @@ -33,4 +33,46 @@ class IterableDecorator[C, I <: IsIterable[C]](coll: C)(implicit val it: I) { def lazyFoldRight[B](z: B)(op: it.A => Either[B, B => B]): B = it(coll).iterator.lazyFoldRight(z)(op) + + /** + * Constructs an iterator where consecutive elements are accumulated as We should replace “iterator” with “collection” ------------------------------ In src/main/scala/scala/collection/decorators/IterableDecorator.scala <#4 (comment)> : > @@ -33,4 +33,46 @@ class IterableDecorator[C, I <: IsIterable[C]](coll: C)(implicit val it: I) { def lazyFoldRight[B](z: B)(op: it.A => Either[B, B => B]): B = it(coll).iterator.lazyFoldRight(z)(op) + + /** + * Constructs an iterator where consecutive elements are accumulated as + * long as the output of f for each element doesn't change. + * <pre> + * Vector(1,2,2,3,3,3,2,2) + * .splitBy(identity) + * </pre> + * produces + * <pre> + * Iterator(Vector(1), It should be Vector(Vector(1), instead of Iterator(Vector(1), ------------------------------ In src/main/scala/scala/collection/decorators/IterableDecorator.scala <#4 (comment)> : > + * .splitBy(identity) + * </pre> + * produces + * <pre> + * Iterator(Vector(1), + * Vector(2,2), + * Vector(3,3,3), + * Vector(2,2)) + * </pre> + * + * @param f the function to compute a key for an element + * @tparam K the type of the computed key + * @return an iterator of sequences of the consecutive elements with the + * same key in the original iterator + */ + def splitBy[K, That, CC](f: it.A => K)(implicit bf: BuildFrom[C, it.A, That], bff: BuildFrom[C, That, CC]): CC = { What about naming the type parameters That1, That2? Or maybe C1 and C2? ------------------------------ In src/main/scala/scala/collection/decorators/IterableDecorator.scala <#4 (comment)> : > + * <pre> + * Vector(1,2,2,3,3,3,2,2) + * .splitBy(identity) + * </pre> + * produces + * <pre> + * Iterator(Vector(1), + * Vector(2,2), + * Vector(3,3,3), + * Vector(2,2)) + * </pre> + * + * @param f the function to compute a key for an element + * @tparam K the type of the computed key + * @return an iterator of sequences of the consecutive elements with the + * same key in the original iterator What about the following description? @return <https://github.com/return> a sequence of sequences of consecutive elements having the same key ------------------------------ In src/main/scala/scala/collection/decorators/IterableDecorator.scala <#4 (comment)> : > + * @param f the function to compute a key for an element + * @tparam K the type of the computed key + * @return an iterator of sequences of the consecutive elements with the + * same key in the original iterator + */ + def splitBy[K, That, CC](f: it.A => K)(implicit bf: BuildFrom[C, it.A, That], bff: BuildFrom[C, That, CC]): CC = { + val builder = bff.newBuilder(coll) + + val iterator = it(coll).iterator + var init = bf.newBuilder(coll) + if (iterator.hasNext) { + var ref = iterator.next(); + init += ref + while (iterator.hasNext) { + val el = iterator.next(); + if (f(el) != f(ref)) { Should we put f(ref) in a val to compute it only once? ------------------------------ In src/main/scala/scala/collection/decorators/IteratorDecorator.scala <#4 (comment)> : > + + override def next(): immutable.Seq[A] = { + if (hasNext) { + val seq = Vector.newBuilder[A] + if (hdDefined) { + seq += hd + } else { + hd = `this`.next() + hdDefined = true + seq += hd + } + var hadSameKey = true + while (`this`.hasNext && hadSameKey) { + val el = `this`.next() + hdDefined = true + if (f(el) == f(hd)) { Same comment here about f(hd) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4?email_source=notifications&email_token=AAAFTQ3OQOZAVD5OGKBL4DTQFJLDVA5CNFSM4FAUBXA2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCB4Y2MA#pullrequestreview-276401456>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAFTQ2YTQDMIAIESJ3OMBLQFJLDVANCNFSM4FAUBXAQ> .

jeantil · 2019-08-20T07:18:28Z

I had some free time this morning thanks to kids sleeping in :)

julienrf · 2019-08-20T08:17:21Z

Looks good to me, thanks! Do you mind squashing the commits into a single one?

jeantil · 2019-08-20T13:39:52Z

No problem, I made separate commits to facilitate the review. Le mar. 20 août 2019 à 10:17, Julien Richard-Foy <[email protected]> a écrit :

…

Looks good to me, thanks! Do you mind squashing the commits into a single one? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4?email_source=notifications&email_token=AAAFTQYEVH7SZK5QKKD3KQLQFOSBFA5CNFSM4FAUBXA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4VO7HI#issuecomment-522907549>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAFTQ3PDOMBWWQT7VTUCGDQFOSBFANCNFSM4FAUBXAQ> .

`Iterator#splitBy` constructs an iterator where consecutive elements of the original iterator are accumulated as long as the output of a key function for each element doesn't change. This operation makes sense as soon as you are trying to process an iterator where you know the elements will be sorted in a certain way and you need to group them without loading all the data in memory. For instance * processing a file where the ordering is guaranteed but the file doesn't fit in the heap, * processing a streaming resultset where the underlying database guarantees the ordering because of a sort clause. The same operation is added to `Iterable` with the difference that the specific container type of the input is preserved for both collection levels of the output, thus * `Set(1,2,3).splitBy(identity)` returns `Set(Set(1), Set(2), Set(3))` * `Vector(1,2,3).splitBy(identity)` returns `Vector(Vector1), Vector2), Vector3))` * etc.

jeantil · 2019-08-21T07:03:17Z

Done I squashed everything to a single commit

julienrf · 2019-08-21T07:09:10Z

🚀

nafg · 2019-09-12T15:48:24Z

How is this different than span?

jeantil · 2019-09-12T16:23:38Z

The simplified type signatures for both operations look like the following which should give a hint as to the difference

def span(p: (A) ⇒ Boolean): (Iterable[A], Iterable[A])

vs

def splitBy[K](f: (A) => K):Iterable[Iterable[A]]

But the difference increases in the details :

span accumulates all the elements as long as the predicate is satisfied, however it assumes the accumulation condition can be derived from a single element
splitBy accumulates all elements which successively resolve to the same key. Writing a predicate for this requires access to the previous element to know if the key changed which isn't available in span.
groupBy accumulates all elements which resolve to the same key, consuming all the input in the process.

I think of splitBy as closer to a lazy groupBy that avoids consuming the whole input than to span.

nafg · 2019-09-12T16:45:17Z

Oh I see. I assumed from the name that it was comparable to splitAt but with a predicate. What about a name that is more associated with groupBy, like chunkBy or groupOnBoundaries or something?

…

On Thu, Sep 12, 2019 at 12:23 PM Jean Helou ***@***.***> wrote: The simplified type signatures for both operations look like which should give a hint as to the difference def span(p: (A) ⇒ Boolean): (Iterable[A], Iterable[A]) vs def splitBy[K](f: (A) => K):Iterable[Iterable[A]] But the difference is in the details : - span accumulates all the elements as long as the predicate is satisfied, however it assumes the accumulation condition can be derived from a single element - splitBy accumulates all elements which successively resolve to the same key. Writing a predicate for this requires access to the previous element to know if the key changed which isn't available in span. - groupBy accumulates all elements which resolve to the same key, consuming all the input in the process. I think of splitBy as closer to a lazy groupBy that avoids consuming the whole input than to span. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#4?email_source=notifications&email_token=AAAYAUALFHYDAO2WQGIJC6LQJJUIVA5CNFSM4FAUBXA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6SOVKY#issuecomment-530901675>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAYAUBGHVJAMY5DUBMYZJDQJJUIVANCNFSM4FAUBXAQ> .

jeantil · 2019-09-12T17:06:09Z

It was initially called groupUntilChanged and was renamed to splitBy as per comments during the review process...

I think that chunkBy is an interesting proposal though. I don't mind making another PR to rename it. I would prefer to have @julienrf and @joshlemer 's thoughts on this before I make it :)

edit
I found the discussion that led to the renaming, it seems that the name splitby is consistent with the naming in other tools including Mathematica...

nafg · 2019-09-12T17:10:40Z

I don't really care, just thinking out loud ;)

…

On Thu, Sep 12, 2019 at 1:06 PM Jean Helou ***@***.***> wrote: It was initially called groupUntilChanged and was renamed to splitBy as per comments during the review process... I think that chunkBy is an interesting proposal though. I don't mind making another PR to rename it. I would prefer to have @julienrf <https://github.com/julienrf> and @joshlemer <https://github.com/joshlemer> 's thoughts on this before I make it :) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#4?email_source=notifications&email_token=AAAYAUCOTLOO4PGBPAW5NHTQJJZIFA5CNFSM4FAUBXA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6SSPIA#issuecomment-530917280>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAYAUAW7775YLNAGW5BTXDQJJZIFANCNFSM4FAUBXAQ> .

thebignet reviewed May 19, 2018

View reviewed changes

src/main/scala/scala/collection/decorators/IteratorDecorator.scala Outdated Show resolved Hide resolved

src/main/scala/scala/collection/decorators/IteratorDecorator.scala Outdated Show resolved Hide resolved

thebignet approved these changes May 20, 2018

View reviewed changes

jeantil force-pushed the iterator-group_until_changed branch 2 times, most recently from 7a3f8ec to 7e5521a Compare May 20, 2018 20:15

joshlemer suggested changes May 1, 2019

View reviewed changes

jeantil force-pushed the iterator-group_until_changed branch from 7e5521a to d659dd1 Compare May 2, 2019 15:44

julienrf mentioned this pull request Jul 30, 2019

Import splitWith from strawman-contrib #6

Closed

julienrf requested changes Jul 30, 2019

View reviewed changes

jeantil changed the title ~~adds groupUntilChanged extension method on scala.collection.Iterator~~ adds splitBy extension method on scala.collection.Iterator Jul 30, 2019

jeantil force-pushed the iterator-group_until_changed branch from 8a007ef to 226accb Compare August 17, 2019 12:39

jeantil force-pushed the iterator-group_until_changed branch from 226accb to b732778 Compare August 17, 2019 12:52

julienrf reviewed Aug 19, 2019

View reviewed changes

julienrf approved these changes Aug 20, 2019

View reviewed changes

jeantil force-pushed the iterator-group_until_changed branch from 718abf1 to c57a652 Compare August 21, 2019 07:02

julienrf merged commit aa605f7 into scala:master Aug 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adds splitBy extension method on scala.collection.Iterator #4

adds splitBy extension method on scala.collection.Iterator #4

jeantil commented May 18, 2018 •

edited

Loading

thebignet left a comment

jeantil commented May 20, 2018

jeantil commented Jun 29, 2018

julienrf commented Jun 29, 2018

jeantil commented Jun 29, 2018

joshlemer commented May 2, 2019

jeantil commented May 3, 2019

julienrf commented Jul 30, 2019

julienrf left a comment

joshlemer commented Jul 30, 2019

jeantil commented Jul 30, 2019 •

edited

Loading

julienrf commented Jul 30, 2019

julienrf commented Jul 30, 2019

jeantil commented Jul 30, 2019 •

edited

Loading

jeantil commented Aug 17, 2019 •

edited

Loading

julienrf left a comment

jeantil commented Aug 19, 2019 via email

jeantil commented Aug 20, 2019

julienrf commented Aug 20, 2019

jeantil commented Aug 20, 2019 via email

jeantil commented Aug 21, 2019

julienrf commented Aug 21, 2019

nafg commented Sep 12, 2019

jeantil commented Sep 12, 2019 •

edited

Loading

nafg commented Sep 12, 2019 via email

jeantil commented Sep 12, 2019 •

edited

Loading

nafg commented Sep 12, 2019 via email

adds splitBy extension method on scala.collection.Iterator #4

adds splitBy extension method on scala.collection.Iterator #4

Conversation

jeantil commented May 18, 2018 • edited Loading

thebignet left a comment

Choose a reason for hiding this comment

jeantil commented May 20, 2018

jeantil commented Jun 29, 2018

julienrf commented Jun 29, 2018

jeantil commented Jun 29, 2018

joshlemer commented May 2, 2019

jeantil commented May 3, 2019

julienrf commented Jul 30, 2019

julienrf left a comment

Choose a reason for hiding this comment

joshlemer commented Jul 30, 2019

jeantil commented Jul 30, 2019 • edited Loading

julienrf commented Jul 30, 2019

julienrf commented Jul 30, 2019

jeantil commented Jul 30, 2019 • edited Loading

jeantil commented Aug 17, 2019 • edited Loading

julienrf left a comment

Choose a reason for hiding this comment

jeantil commented Aug 19, 2019 via email

jeantil commented Aug 20, 2019

julienrf commented Aug 20, 2019

jeantil commented Aug 20, 2019 via email

jeantil commented Aug 21, 2019

julienrf commented Aug 21, 2019

nafg commented Sep 12, 2019

jeantil commented Sep 12, 2019 • edited Loading

nafg commented Sep 12, 2019 via email

jeantil commented Sep 12, 2019 • edited Loading

nafg commented Sep 12, 2019 via email

jeantil commented May 18, 2018 •

edited

Loading

jeantil commented Jul 30, 2019 •

edited

Loading

jeantil commented Jul 30, 2019 •

edited

Loading

jeantil commented Aug 17, 2019 •

edited

Loading

jeantil commented Sep 12, 2019 •

edited

Loading

jeantil commented Sep 12, 2019 •

edited

Loading