[JIT] Implement non-streaming JIT with `polars` #1093

dshemetov · 2023-02-23T01:18:09Z

A variant of #1014 using Polars instead of Pandas.

Prerequisites:

Unless it is a documentation hotfix it should be merged against the dev branch
Branch is up-to-date with the branch to be merged with, i.e. dev
Build is successful
Code is cleaned up and formatted

Summary

A reimplementation of the internals of the JIT system using Pandas. My local machine benchmarking showed that the Polars implementation might be 2x as fast as the Pandas. We will see how it works in Locust.

In general, our JIT approaches are bottlenecked by two main components: data loading and data dumping. The data loading refers to loading data from the database into a fast memory format (like Numpy arrays or Pandas DataFrame or Polars DataFrame). This is slow because the the database result is first handled by Python, which then traverses through each row in the result (using a slow Python loop), and loads them into the new fast format. The data dumping part requires us convert back from the fast format into something we can send to the client. This is extremely slow if we need to convert back to a list of Python rows, which is then made even slower by the overhead in our API server's APrinter that handles writing the results to json or csv.

Fortunately, both Pandas and Polars have a way to go directly to JSON, using a fast internal method (e.g. pd.DataFrame.to_json(orient="records") and pl.DataFrame.write_json(row_oriented=True)). This has made #1014 competitive performance-wise. From my local benchmarking, it appears that most of the gains in this branch over the Pandas branch comes from improvements to the speed of this step.

The drawback however, is that because I bypassed APrinter, I rewrote a lot of the logic it handles. This part of the code will need to be addressed and double-checked again before release into production.

TODO: benchmark a separate branch of dev that does not include JIT, but loads the query results to Polars and uses write_json to bypass APrinter. I suspect that this might be a substantial performance improvement.

- merge operations repo delphi_python Dockerfile into delphi_web_python - copy Python requirements file to this directory - copy setup.sh to this directory

Co-authored-by: melange396 <[email protected]>

* sorted requirements.txt files * removed duplicated requirements from ./dev/docker/python/requirements.txt * reduce runs of "pip install" when creating "delphi_web_python" docker image * renamed requirements.txt to requirements.api.txt * merge dev/docker/python/requirements.txt with requirements.dev.txt * deduplicate packages in requirements.api.txt and requirements.dev.txt * pinned packages in requirements.dev.txt and removed unused Co-authored-by: Dmitry Shemetov <[email protected]>

* add smooth_diff * add model updates * add /trend endpoint * add /trendseries endpoint * add /csv endpoint * params with utility functions * update date utility functions

- bypass run_query, use as_pandas - bypass APrinter - write hacky APrintery logic in handle "/" and pass tests

- make multiple db requests for multi-signal queries - base signals pass through without jit code - derived signals load a smaller Pandas df this time

- queries for only base signals bypass JIT - queries for mixed base and derived signal queries split the two in a new handler

- avoid contiguous indexing by using TimeOffset in rolling - accept time gaps for diff - pull DataFrame construction into the groupby for loop

* remove an extra set_df_dtypes call at the end of JIT * remove an extra df.copy() in set_df_dtypes * use a single .astype() call in set_df_dtypes

[JIT Variant] Improve Pandas performance by loading fewer columns into the main DataFrame

* fix issue, lag handling * handle base signals in the JIT code together * load a different data for base signals vs derived

dshemetov · 2023-02-23T22:52:17Z

Update on the derived signals benchmark:

dev runs in ~2 minutes
jit-pandas runs in ~3:30 minutes
jit-polars runs in ~3 minutes

krivard · 2023-02-24T14:43:07Z

issue: branch not up to date with dev

this PR is still marked as a draft, but FYI several of the commits in it seem to be rebased versions of commits dev already has. that makes it difficult to find the new functionality in the diff so that folks can provide feedback.

e.g.

This PR has 869ff21
Dev already has these changes as part of c5c1989

Could you build a link to the files pane that highlights the commits you need feedback on most urgently? I made an attempt, but this one was too broad and this one was too narrow.

dshemetov · 2023-02-24T19:41:09Z

Sorry yea, this is definitely not ready for viewing rn. I'm going to rebase soon.

dshemetov and others added 30 commits December 5, 2022 15:37

System: update delphi_web_python Docker image

6d82286

- merge operations repo delphi_python Dockerfile into delphi_web_python - copy Python requirements file to this directory - copy setup.sh to this directory

devtools: remove unused Docker images from Makefile

33037dc

CI: remove unused Docker images from build

9a3c07d

System: remove stale comment from Dockerfile

96153db

Co-authored-by: melange396 <[email protected]>

Server: add CovidcastRow helper class for testing

869ff21

Server: update csv_to_database to use CovidcastRow

db3405c

Server: update test_db to use CovidcastRow

f46b7a2

Server: update test_delete_batch to use CovidcastRow

9a609e9

Server: update test_delphi_epidata to use CovidcastRow

1032e5b

Server: update test_covidcast_endpoints to use CovidcastRow

c106430

Server: update test_covidcast to use CovidcastRow

c2bcbb0

Server: update test_utils to use CovidcastRow

357b3af

Server: update TimePair to auto-sort tuples

7e5cc0f

Server: minor model.py data_source_by_id name update

d23d599

Server: update csv issue none handling

b707a66

Server: add type hints to _query

ca4e50d

Acquisition: update test_csv_uploading to remove Pandas warning

98e45b7

Server: add PANDAS_DTYPES to model.py

bc40d33

Docker: add more_itertools==8.4.0 to Python and API images

c391a28

Acquisition: update database.py to use CovidcastRow

163185f

Docker: bump API and Python pandas to 1.5.1

aff6036

JIT: major feature commit

828836d

* add smooth_diff * add model updates * add /trend endpoint * add /trendseries endpoint * add /csv endpoint * params with utility functions * update date utility functions

CI: Build a container image from this branch

8f2bdaf

JIT: add Pandas JIT approach with optimizations

c1d7ca2

CI: Update to build a JIT Pandas image

87a3a60

JIT: Push new approach

dc171a9

- bypass run_query, use as_pandas - bypass APrinter - write hacky APrintery logic in handle "/" and pass tests

JIT: update to not groupby geo again

2d9f84f

JIT: improve reindex_iterable

69ccf51

Do concatenation all together

e5d4bf2

dshemetov added 18 commits January 25, 2023 13:47

Add reindex_iterable2

23f608a

Do the row assignment step all in one

67a106b

Remove old code

dd5ddc4

JIT: try new Pandas approach

59f7e49

- make multiple db requests for multi-signal queries - base signals pass through without jit code - derived signals load a smaller Pandas df this time

CI: update to build new Docker image

11680ed

CI: update to account for typo in branch name

eaabc19

Partial code cleanup, derived_signals_map bugfix

4d2967a

fix bug

d286f3a

Better handle JIT routing

7f9c8e8

- queries for only base signals bypass JIT - queries for mixed base and derived signal queries split the two in a new handler

New JIT optimization:

2b5d546

- avoid contiguous indexing by using TimeOffset in rolling - accept time gaps for diff - pull DataFrame construction into the groupby for loop

More optimizations:

205636f

* remove an extra set_df_dtypes call at the end of JIT * remove an extra df.copy() in set_df_dtypes * use a single .astype() call in set_df_dtypes

Merge pull request #1073 from cmu-delphi/ds/jit-pandas-mutli-sql

7bc24d3

[JIT Variant] Improve Pandas performance by loading fewer columns into the main DataFrame

JIT: clean up optimizations

3aa2252

* fix issue, lag handling * handle base signals in the JIT code together * load a different data for base signals vs derived

Update the image name in ci.yaml

b431b9e

Update requirements with polars pyarrow

13bf7a1

Update ci.yaml with new image tag

7298862

JIT: Switch to polars

283d2cd

Fix test

e0e0d03

dshemetov mentioned this pull request Feb 24, 2023

[PoC] Route around APrinter with Pandas Dataframe #1099

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JIT] Implement non-streaming JIT with `polars` #1093

[JIT] Implement non-streaming JIT with `polars` #1093

dshemetov commented Feb 23, 2023

dshemetov commented Feb 23, 2023

krivard commented Feb 24, 2023

dshemetov commented Feb 24, 2023

[JIT] Implement non-streaming JIT with polars #1093

Are you sure you want to change the base?

[JIT] Implement non-streaming JIT with polars #1093

Conversation

dshemetov commented Feb 23, 2023

Summary

dshemetov commented Feb 23, 2023

krivard commented Feb 24, 2023

dshemetov commented Feb 24, 2023

[JIT] Implement non-streaming JIT with `polars` #1093

[JIT] Implement non-streaming JIT with `polars` #1093