-
Notifications
You must be signed in to change notification settings - Fork 68
[JIT Variant] Non-streaming Pandas approach #1014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
dshemetov
wants to merge
45
commits into
dev
Choose a base branch
from
ds/jit-pandas
base: dev
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Docker image tags can't have forward slashes in them, so use the tag name |
6659b74
to
b9ea7a3
Compare
4 tasks
melange396
reviewed
Nov 10, 2022
bd40689
to
04df084
Compare
a41db9d
to
9bc035a
Compare
- merge operations repo delphi_python Dockerfile into delphi_web_python - copy Python requirements file to this directory - copy setup.sh to this directory
Co-authored-by: melange396 <[email protected]>
* sorted requirements.txt files * removed duplicated requirements from ./dev/docker/python/requirements.txt * reduce runs of "pip install" when creating "delphi_web_python" docker image * renamed requirements.txt to requirements.api.txt * merge dev/docker/python/requirements.txt with requirements.dev.txt * deduplicate packages in requirements.api.txt and requirements.dev.txt * pinned packages in requirements.dev.txt and removed unused Co-authored-by: Dmitry Shemetov <[email protected]>
* add smooth_diff * add model updates * add /trend endpoint * add /trendseries endpoint * add /csv endpoint * params with utility functions * update date utility functions
4e3c261
to
8f2bdaf
Compare
9bc035a
to
de9c56e
Compare
de9c56e
to
87a3a60
Compare
- bypass run_query, use as_pandas - bypass APrinter - write hacky APrintery logic in handle "/" and pass tests
8f2bdaf
to
2391d10
Compare
- make multiple db requests for multi-signal queries - base signals pass through without jit code - derived signals load a smaller Pandas df this time
Merged
4 tasks
- queries for only base signals bypass JIT - queries for mixed base and derived signal queries split the two in a new handler
- avoid contiguous indexing by using TimeOffset in rolling - accept time gaps for diff - pull DataFrame construction into the groupby for loop
* remove an extra set_df_dtypes call at the end of JIT * remove an extra df.copy() in set_df_dtypes * use a single .astype() call in set_df_dtypes
[JIT Variant] Improve Pandas performance by loading fewer columns into the main DataFrame
* fix issue, lag handling * handle base signals in the JIT code together * load a different data for base signals vs derived
4 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
A version of #646 without streaming.
Prerequisites:
dev
branchdev
Summary
The Python iterator-based approach to JIT has proven to be 3-5x slower than reading those signals from the database. It is well-known that: a) Python for-loops are slow and b) nested iterators incur a stack overhead, which slows computations down. This PR attempts to get around these by using Pandas to do time series computations on a per-signal basis: