[JIT Variant] Non-streaming Pandas approach #1014

dshemetov · 2022-10-31T20:41:54Z

A version of #646 without streaming.

Prerequisites:

Unless it is a documentation hotfix it should be merged against the dev branch
Branch is up-to-date with the branch to be merged with, i.e. dev
Build is successful
Code is cleaned up and formatted

Summary

The Python iterator-based approach to JIT has proven to be 3-5x slower than reading those signals from the database. It is well-known that: a) Python for-loops are slow and b) nested iterators incur a stack overhead, which slows computations down. This PR attempts to get around these by using Pandas to do time series computations on a per-signal basis:

Pandas dataframe computations call fast C code instead of Python for-loops
loading the data into a dataframe incurs a memory cost, but removes the nested iterator stack overhead

dshemetov · 2022-10-31T21:41:28Z

Docker image tags can't have forward slashes in them, so use the tag name jit-pandas for this one.

integrations/server/test_covidcast.py

- merge operations repo delphi_python Dockerfile into delphi_web_python - copy Python requirements file to this directory - copy setup.sh to this directory

Co-authored-by: melange396 <[email protected]>

* sorted requirements.txt files * removed duplicated requirements from ./dev/docker/python/requirements.txt * reduce runs of "pip install" when creating "delphi_web_python" docker image * renamed requirements.txt to requirements.api.txt * merge dev/docker/python/requirements.txt with requirements.dev.txt * deduplicate packages in requirements.api.txt and requirements.dev.txt * pinned packages in requirements.dev.txt and removed unused Co-authored-by: Dmitry Shemetov <[email protected]>

* add smooth_diff * add model updates * add /trend endpoint * add /trendseries endpoint * add /csv endpoint * params with utility functions * update date utility functions

- bypass run_query, use as_pandas - bypass APrinter - write hacky APrintery logic in handle "/" and pass tests

- make multiple db requests for multi-signal queries - base signals pass through without jit code - derived signals load a smaller Pandas df this time

- queries for only base signals bypass JIT - queries for mixed base and derived signal queries split the two in a new handler

- avoid contiguous indexing by using TimeOffset in rolling - accept time gaps for diff - pull DataFrame construction into the groupby for loop

* remove an extra set_df_dtypes call at the end of JIT * remove an extra df.copy() in set_df_dtypes * use a single .astype() call in set_df_dtypes

[JIT Variant] Improve Pandas performance by loading fewer columns into the main DataFrame

* fix issue, lag handling * handle base signals in the JIT code together * load a different data for base signals vs derived

dshemetov requested a review from melange396 October 31, 2022 20:42

dshemetov changed the base branch from dev to jit_computations October 31, 2022 20:44

dshemetov force-pushed the ds/jit-pandas branch from 6659b74 to b9ea7a3 Compare November 9, 2022 23:21

dshemetov mentioned this pull request Nov 9, 2022

[JIT Variant] Split multi-signal queries into many single-signal queries #1026

Closed

4 tasks

melange396 reviewed Nov 10, 2022

View reviewed changes

integrations/server/test_covidcast.py Show resolved Hide resolved

dshemetov marked this pull request as draft November 11, 2022 23:39

dshemetov force-pushed the jit_computations branch 2 times, most recently from bd40689 to 04df084 Compare December 5, 2022 21:41

dshemetov force-pushed the ds/jit-pandas branch from a41db9d to 9bc035a Compare December 5, 2022 21:53

dshemetov and others added 20 commits December 5, 2022 15:37

System: update delphi_web_python Docker image

6d82286

- merge operations repo delphi_python Dockerfile into delphi_web_python - copy Python requirements file to this directory - copy setup.sh to this directory

devtools: remove unused Docker images from Makefile

33037dc

CI: remove unused Docker images from build

9a3c07d

System: remove stale comment from Dockerfile

96153db

Co-authored-by: melange396 <[email protected]>

Server: add CovidcastRow helper class for testing

869ff21

Server: update csv_to_database to use CovidcastRow

db3405c

Server: update test_db to use CovidcastRow

f46b7a2

Server: update test_delete_batch to use CovidcastRow

9a609e9

Server: update test_delphi_epidata to use CovidcastRow

1032e5b

Server: update test_covidcast_endpoints to use CovidcastRow

c106430

Server: update test_covidcast to use CovidcastRow

c2bcbb0

Server: update test_utils to use CovidcastRow

357b3af

Server: update TimePair to auto-sort tuples

7e5cc0f

Server: minor model.py data_source_by_id name update

d23d599

Server: update csv issue none handling

b707a66

Server: add type hints to _query

ca4e50d

Acquisition: update test_csv_uploading to remove Pandas warning

98e45b7

Server: add PANDAS_DTYPES to model.py

bc40d33

Docker: add more_itertools==8.4.0 to Python and API images

c391a28

dshemetov and others added 2 commits December 5, 2022 17:58

JIT: major feature commit

828836d

* add smooth_diff * add model updates * add /trend endpoint * add /trendseries endpoint * add /csv endpoint * params with utility functions * update date utility functions

CI: Build a container image from this branch

8f2bdaf

dshemetov force-pushed the jit_computations branch from 4e3c261 to 8f2bdaf Compare December 6, 2022 01:58

dshemetov force-pushed the ds/jit-pandas branch from 9bc035a to de9c56e Compare December 6, 2022 02:07

dshemetov added 2 commits December 6, 2022 15:31

JIT: add Pandas JIT approach with optimizations

c1d7ca2

CI: Update to build a JIT Pandas image

87a3a60

dshemetov force-pushed the ds/jit-pandas branch from de9c56e to 87a3a60 Compare December 6, 2022 23:31

dshemetov added 2 commits December 7, 2022 15:47

JIT: Push new approach

dc171a9

- bypass run_query, use as_pandas - bypass APrinter - write hacky APrintery logic in handle "/" and pass tests

JIT: update to not groupby geo again

2d9f84f

dshemetov force-pushed the jit_computations branch from 8f2bdaf to 2391d10 Compare December 8, 2022 22:11

dshemetov added 7 commits December 9, 2022 14:14

JIT: improve reindex_iterable

69ccf51

Do concatenation all together

e5d4bf2

Add reindex_iterable2

23f608a

Do the row assignment step all in one

67a106b

Remove old code

dd5ddc4

JIT: try new Pandas approach

59f7e49

- make multiple db requests for multi-signal queries - base signals pass through without jit code - derived signals load a smaller Pandas df this time

CI: update to build new Docker image

11680ed

dshemetov mentioned this pull request Jan 28, 2023

[JIT Variant] Improve Pandas performance by loading fewer columns into the main DataFrame #1073

Merged

4 tasks

dshemetov added 9 commits January 27, 2023 17:48

CI: update to account for typo in branch name

eaabc19

Partial code cleanup, derived_signals_map bugfix

4d2967a

fix bug

d286f3a

Better handle JIT routing

7f9c8e8

- queries for only base signals bypass JIT - queries for mixed base and derived signal queries split the two in a new handler

New JIT optimization:

2b5d546

- avoid contiguous indexing by using TimeOffset in rolling - accept time gaps for diff - pull DataFrame construction into the groupby for loop

More optimizations:

205636f

* remove an extra set_df_dtypes call at the end of JIT * remove an extra df.copy() in set_df_dtypes * use a single .astype() call in set_df_dtypes

Merge pull request #1073 from cmu-delphi/ds/jit-pandas-mutli-sql

7bc24d3

[JIT Variant] Improve Pandas performance by loading fewer columns into the main DataFrame

JIT: clean up optimizations

3aa2252

* fix issue, lag handling * handle base signals in the JIT code together * load a different data for base signals vs derived

Update the image name in ci.yaml

b431b9e

dshemetov changed the base branch from jit_computations to dev February 23, 2023 01:00

dshemetov mentioned this pull request Feb 23, 2023

[JIT] Implement non-streaming JIT with polars #1093

Draft

4 tasks

Remove TODO comment

f95432d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JIT Variant] Non-streaming Pandas approach #1014

[JIT Variant] Non-streaming Pandas approach #1014

dshemetov commented Oct 31, 2022 •

edited

Loading

dshemetov commented Oct 31, 2022

[JIT Variant] Non-streaming Pandas approach #1014

Are you sure you want to change the base?

[JIT Variant] Non-streaming Pandas approach #1014

Conversation

dshemetov commented Oct 31, 2022 • edited Loading

Summary

dshemetov commented Oct 31, 2022

dshemetov commented Oct 31, 2022 •

edited

Loading