-
Notifications
You must be signed in to change notification settings - Fork 67
[JIT] Implement non-streaming JIT with polars
#1093
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
- merge operations repo delphi_python Dockerfile into delphi_web_python - copy Python requirements file to this directory - copy setup.sh to this directory
Co-authored-by: melange396 <[email protected]>
* sorted requirements.txt files * removed duplicated requirements from ./dev/docker/python/requirements.txt * reduce runs of "pip install" when creating "delphi_web_python" docker image * renamed requirements.txt to requirements.api.txt * merge dev/docker/python/requirements.txt with requirements.dev.txt * deduplicate packages in requirements.api.txt and requirements.dev.txt * pinned packages in requirements.dev.txt and removed unused Co-authored-by: Dmitry Shemetov <[email protected]>
* add smooth_diff * add model updates * add /trend endpoint * add /trendseries endpoint * add /csv endpoint * params with utility functions * update date utility functions
- bypass run_query, use as_pandas - bypass APrinter - write hacky APrintery logic in handle "/" and pass tests
- make multiple db requests for multi-signal queries - base signals pass through without jit code - derived signals load a smaller Pandas df this time
- queries for only base signals bypass JIT - queries for mixed base and derived signal queries split the two in a new handler
- avoid contiguous indexing by using TimeOffset in rolling - accept time gaps for diff - pull DataFrame construction into the groupby for loop
* remove an extra set_df_dtypes call at the end of JIT * remove an extra df.copy() in set_df_dtypes * use a single .astype() call in set_df_dtypes
[JIT Variant] Improve Pandas performance by loading fewer columns into the main DataFrame
* fix issue, lag handling * handle base signals in the JIT code together * load a different data for base signals vs derived
Update on the derived signals benchmark:
|
issue: branch not up to date with this PR is still marked as a draft, but FYI several of the commits in it seem to be rebased versions of commits dev already has. that makes it difficult to find the new functionality in the diff so that folks can provide feedback. e.g. Could you build a link to the files pane that highlights the commits you need feedback on most urgently? I made an attempt, but this one was too broad and this one was too narrow. |
Sorry yea, this is definitely not ready for viewing rn. I'm going to rebase soon. |
A variant of #1014 using Polars instead of Pandas.
Prerequisites:
dev
branchdev
Summary
A reimplementation of the internals of the JIT system using Pandas. My local machine benchmarking showed that the Polars implementation might be 2x as fast as the Pandas. We will see how it works in Locust.
In general, our JIT approaches are bottlenecked by two main components: data loading and data dumping. The data loading refers to loading data from the database into a fast memory format (like Numpy arrays or Pandas DataFrame or Polars DataFrame). This is slow because the the database result is first handled by Python, which then traverses through each row in the result (using a slow Python loop), and loads them into the new fast format. The data dumping part requires us convert back from the fast format into something we can send to the client. This is extremely slow if we need to convert back to a list of Python rows, which is then made even slower by the overhead in our API server's APrinter that handles writing the results to json or csv.
Fortunately, both Pandas and Polars have a way to go directly to JSON, using a fast internal method (e.g.
pd.DataFrame.to_json(orient="records")
andpl.DataFrame.write_json(row_oriented=True)
). This has made #1014 competitive performance-wise. From my local benchmarking, it appears that most of the gains in this branch over the Pandas branch comes from improvements to the speed of this step.The drawback however, is that because I bypassed APrinter, I rewrote a lot of the logic it handles. This part of the code will need to be addressed and double-checked again before release into production.
TODO: benchmark a separate branch of
dev
that does not include JIT, but loads the query results to Polars and uses write_json to bypass APrinter. I suspect that this might be a substantial performance improvement.