Skip to content

QuoteUnpicker.unpickleExpr and QuoteUnpicker.unpickleType are slow #12228

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
deusaquilus opened this issue Apr 26, 2021 · 23 comments · Fixed by #12242
Closed

QuoteUnpicker.unpickleExpr and QuoteUnpicker.unpickleType are slow #12228

deusaquilus opened this issue Apr 26, 2021 · 23 comments · Fixed by #12242

Comments

@deusaquilus
Copy link
Contributor

deusaquilus commented Apr 26, 2021

Compiler version

3.0.0-RC3

I am getting ~700ms range bottlenecks from unpickleExpr and unpickleType (blue and light-blue in the chart below) for the Proto-Quill parser and expr-accumulation (for a simple query with 20 parameters). I can't accommodate large nested queries with this kind of latency.

Are there any tricks I can use to optimize unpickling?
Are there any compiler args I can use to make unpickling faster or to disable it for certain parts of the code?

Also, if you want more detailed information in the flame graph, let me know which packages you'd like me to instrument and filter for (since I can't instrument the entire Dotty).

image
image

@nicolasstucki
Copy link
Contributor

Is it possible to get a wider flame graph or an interactive one? It is impossible to see the method names.

Can you create a small self-contained example that shows the same issue?

@nicolasstucki
Copy link
Contributor

It would also be quite useful to see the timings of the code within unpickleTerm https://github.com/lampepfl/dotty/blob/9632f6b68ea726cf2b219f55995e0bd80abd66c1/compiler/src/dotty/tools/dotc/quoted/PickledQuotes.scala#L56-L63

We should know what are the proportions of unpickle, spliceTypes and spliceTerms are for this method.

@deusaquilus
Copy link
Contributor Author

I'll try to get some more detailed information about what happening is inside of unpickleExpr/unpickleType.
It's hard to make a self-contained example because this flame-graph is basically the entire ProtoQuill quotation subsystem. I could publish the ProtoQuill library to maven-central and then a small code-sample that uses it if that helps.

@deusaquilus
Copy link
Contributor Author

Also, do you have JProfiler? If you do, I can send you the saved JProfile session that you can interactively explore.
If not, you can get a trial version. They also offer academic licenses.

@deusaquilus
Copy link
Contributor Author

deusaquilus commented Apr 26, 2021

Here is a bit more info. I have colorized unpickle (pink), unpickleTerm (red), and spliceTerms (purple). I didn't find spliceTypes anywhere in the chart. It seems QuoteUnpickler.unpickle is the one that takes the most time.
Color legend is below.

image
image

I can send you the profiling session separately if you'd like.

@deusaquilus
Copy link
Contributor Author

deusaquilus commented Apr 26, 2021

Here is even more info with additional colorizations. You can see DottyUnpicker.tree (light green) slow things down in many places, also RefTree.isTerm() and NamedType.denot() (both are light blue in different shades).

image
image

@deusaquilus
Copy link
Contributor Author

Also, it is worth noting that TreeAccumulator.foldOverTree also appears in many places (in yellow). It might also be useful to optimize.

image
image

@deusaquilus
Copy link
Contributor Author

Hopefully, this provides some additional useful information. Please let me know if you need something else.

@deusaquilus
Copy link
Contributor Author

Here's one more chart with the info about "hot-spots" in the dotty.tools.dotc call stack.

image

Here is the call-tree of one instance where I use unpickleTerm.

image

Here's the flame-graph of that unpickleTerm call.

image

Hope this helps.

@odersky
Copy link
Contributor

odersky commented Apr 27, 2021

It seems to read a very large number of names. We should analyze which names are serialized when we pickle quotes and whether we can cut the number somehow.

@nicolasstucki
Copy link
Contributor

One thing we should be able to do is cache the result of PickleQuotes.unpickle.

@nicolasstucki
Copy link
Contributor

@deusaquilus we will need a small self-contained example that is representative of the kind of code you have and that shows latency in the performance. This will be used for our tests and benchmarks to ensure we properly fix the issue and do not get regressions in the future.

@deusaquilus
Copy link
Contributor Author

deusaquilus commented Apr 27, 2021

One thing we should be able to do is cache the result of PickleQuotes.unpickle.

Yes! I think that would be really good.

In order to make a self contained example, I need to implement at least a Quill AST and a Unlifter. Maybe also a Quoter and a Parser. It would basically be a mini-quill implementation. It will be have to be more like a “medium sized” self contained example.

@nicolasstucki
Copy link
Contributor

Otherwise, do you know which quote is taking the longes time to unpickle in your profile?

@nicolasstucki
Copy link
Contributor

After doing some tests I found that unpickling '{ println($e + "abc") } where e = Expr("abc") I found the following results.
These are quick and dirty benchmarks using some System.nanoTime in unpickleTerm. Nevertheless, they show the potential improvement.

Warm compiler

unpickle> 15.41μs
spliceTypes> 0.041μs
spliceTerms> 2.464μs

Warm compiler (cached unpickle)

unpickle> 0.047μs
spliceTypes> 0.029μs
spliceTerms> 1.508μs

Cold compiler

// first unpickling
unpickle> 3888.469μs
spliceTypes> 3.571μs
spliceTerms> 1831.784μs

// second unpickling
unpickle> 473.927μs
spliceTypes> 0.401μs
spliceTerms> 97.896μs

// third unpickling
unpickle> 310.548μs
spliceTypes> 0.25μs
spliceTerms> 74.405μs

Cold compiler (cached unpickle)

// first unpickling
unpickle> 3576.882μs
spliceTypes> 3.56μs
spliceTerms> 1736.942μs

// second unpickling
unpickle> 11.535μs
spliceTypes> 0.545μs
spliceTerms> 133.605μs

// third unpickling
unpickle> 2.536μs
spliceTypes> 0.27μs
spliceTerms> 81.706μs

@nicolasstucki
Copy link
Contributor

There is still a question of the granularity of the cache

  • Per macro expansion?
  • Per compilation unit?
  • Per run?

Should we have a maximum size for the cache? Keys should be cheap as they contain strings that exist in the bytecode. On the other hand, trees might cause some memory pressure.

@deusaquilus
Copy link
Contributor Author

deusaquilus commented Apr 27, 2021

The quotation that this was benchmarked from was this.

val result = testContext.run(liftQuery(insertValues).foreach(e => insert(e)))

This was from PostgresAsyncEncodingSpec.scala which is in quill-jasync-postgres. I can create a branch where this is the only thing under quill-sql if that helps.

I think the granularity of the cache should be at least Per macro expansion if not even higher, in Quill the results of the LiftMacro and the QuoteMacro are frequently reused. Any lower of a granularity I think would be totally useless for Quill's use-case.

In terms of memory size, I think there should be no limit by default, memory is fairly cheap these days. Maybe there should be a compiler-arg that sets this value if needed, that might be useful for builds in CI environments where resources are more constrained.

@nicolasstucki
Copy link
Contributor

@deusaquilus I think now I can create a benchmark from scratch. Will be simpler to tune and diagnose.

@deusaquilus
Copy link
Contributor Author

deusaquilus commented Apr 28, 2021

@nicolasstucki Please have a look at this. I have created a miniature quill implementation of ~270 lines of code that represents the majority of Quill's use case:
https://github.com/deusaquilus/miniquill/blob/master/src/main/scala/org/stuff/MiniQuill.scala

I also have created a file that represents some large entities and join clauses that are characteristic of the kinds of things that I see in the corporate world (or at least as much as it can be with such a small implementation). This macro takes ~1 second to compile on a warm JVM and 5+ seconds on a cold jvm. It spends the vast majority of that time inside unpickleTerm.
https://github.com/deusaquilus/miniquill/blob/master/src/main/scala/org/stuff/UseMiniQuillLarge.scala

image

I also have a smaller query that is not a good benchmarking example but it is a good reference for what the most basic kind of query looks like:
https://github.com/deusaquilus/miniquill/blob/master/src/main/scala/org/stuff/UseMiniQuill.scala

Please, please have a look at MiniQuill.scala and UseMiniQuillLarge.scala and use them in your benchmarking tests. I think they are a great example of what not only Quill will use the macros to do, but many other macro-based frameworks as well.

@deusaquilus
Copy link
Contributor Author

Also, please keep in mind that the queries you see in MiniQuillLarge.scala file's Queries object can either be in the same file, or spread across multiple files. Both kinds of things exist in Quill codebases.

@deusaquilus
Copy link
Contributor Author

deusaquilus commented Apr 28, 2021

Also, feel free to play around with the number of parameters in the LargeModel object if that helps at all. If this is a benchmark you want to run hundreds of times it might be useful to decrease them. That will however most likely make the unpicker look like a smaller part of the latency.

@nicolasstucki nicolasstucki linked a pull request Apr 28, 2021 that will close this issue
@nicolasstucki
Copy link
Contributor

Speed improvements after #12242 can be seen here https://dotty-bench2.epfl.ch/12242/

@deusaquilus
Copy link
Contributor Author

deusaquilus commented May 3, 2021

This is awesome!!! In my MiniQuill example, the unpickle cache makes it go down from 13ms to 6.5ms.

Before:
image

After:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants