[Backfill corrections] Use `dplyr` joins in `add_7dav` fn #1806

nmdefries · 2023-03-17T15:21:51Z

Description

Switch to dplyr join functions in place of base::merge, for speed.

Changelog

preprocessing.R
NAMESPACE

Note: Row order seems to impact model fit. This PR changes the final row order compared to using merge, and it has differences in some model coefficients versus earlier versions. Most changes are small (coefficients are approximately equal, e.g. within 10e-8). Some models also have the same coef value but the associated predictor switched (e.g. from Wed_issue_coef = 0 and Sat_ref_coef = 0.683, to Wed_issue_coef = 0.683 and Sat_ref_coef = 0).

As a result, some predicted values are different. Avg WIS is slightly different as well.

That said, all this seems inconsequential; the original merge result was not sorted in any particular way. Technically merge sorts by the by columns, but the sort algorithm it uses appears to not be stable, so the rows get un-ordered WRT other non-by fields like lag and time_value.

…join-v-merge-order-matters

nmdefries · 2023-03-21T15:52:42Z

I verified that using the old Reduce-merge approach also results in slightly different model fits/predictions if the sort order of the data is changed. The intermediate dfs of the old and new merge step are also identical once sorted, so it's only the sorting that changes the final result.

nmdefries · 2023-03-21T16:02:00Z

Options on what to do with this code chunk:

Leave implementation as-is.
- + no change in prediction/model output
- - slowest option
- - sort order is pseudo-random and arbitrary. Any other code change that changes the sort order will result in the output changing.
Use new join-based implementation
- + fast(est)
- + df is not arbitrarily sorted
- - results don't exactly match merge-based implementation
Use new join-based implementation, but sort it to match old implementation
- + faster than old code
- + results exactly match merge-based implementation
- - not entirely sure how to recreate sort order from merge implementation. I guess I can take the sort code they're using?
- - sort slows code down and is arbitrary. Any other code change that changes the sort order will result in the output changing.

nmdefries · 2023-03-21T16:03:50Z

My vote is for the second option. I don't really see any reason to try to make sure a new implementation maintains the same sort order and thus model/prediction output. The original sort order was arbitrary and unintentional. The results are very similar using different sort orders.

jingjtang · 2023-03-22T19:17:56Z

My vote is for the second option. I don't really see any reason to try to make sure a new implementation maintains the same sort order and thus model/prediction output. The original sort order was arbitrary and unintentional. The results are very similar using different sort orders.

Agree. Vote for the second one as well!

nmdefries · 2023-03-22T22:33:27Z

Output before and after the included code changes. The R script loads the files, and finds and compares differing columns.

20230320_new_reciving.tar.gz
20230320_old_reciving.tar.gz
comparing_differing_output.R.txt (This is an R script, but GH wouldn't accept it with that file extension.)

nmdefries · 2023-03-22T22:33:47Z

@jingjtang This is ready for review.

jingjtang

LGTM

nmdefries · 2023-03-27T11:50:23Z

@krivard This is ready to merge

nmdefries added 5 commits March 2, 2023 11:17

use joins instead of merge

0dd2c9d

remove test code

849834e

Merge branch 'ndefries/backfill/speed2' into ndefries/backfill/speed-…

678a7e4

…join-v-merge-order-matters

use purrr::reduce instead of base::Reduce

c1a9e10

Merge branch 'ndefries/backfill/speed2' into ndefries/backfill/speed-…

802d11b

…join-v-merge-order-matters

nmdefries requested a review from jingjtang March 21, 2023 15:52

nmdefries marked this pull request as ready for review March 22, 2023 22:33

jingjtang approved these changes Mar 27, 2023

View reviewed changes

krivard merged commit cad3600 into ndefries/backfill/speed2 Mar 27, 2023

krivard deleted the ndefries/backfill/speed-join-v-merge-order-matters branch March 27, 2023 13:51

nmdefries mentioned this pull request Mar 27, 2023

[Backfill corrections] Incorporate changes from 1806 and 1808 #1814

Merged

krivard mentioned this pull request Mar 29, 2023

Release covidcast-indicators 0.3.34 #1816

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Backfill corrections] Use `dplyr` joins in `add_7dav` fn #1806

[Backfill corrections] Use `dplyr` joins in `add_7dav` fn #1806

Uh oh!

nmdefries commented Mar 17, 2023 •

edited

Loading

Uh oh!

nmdefries commented Mar 21, 2023 •

edited

Loading

Uh oh!

nmdefries commented Mar 21, 2023 •

edited

Loading

Uh oh!

nmdefries commented Mar 21, 2023 •

edited

Loading

Uh oh!

jingjtang commented Mar 22, 2023

Uh oh!

nmdefries commented Mar 22, 2023

Uh oh!

nmdefries commented Mar 22, 2023

Uh oh!

jingjtang left a comment

Uh oh!

nmdefries commented Mar 27, 2023

Uh oh!

Uh oh!

[Backfill corrections] Use dplyr joins in add_7dav fn #1806

[Backfill corrections] Use dplyr joins in add_7dav fn #1806

Uh oh!

Conversation

nmdefries commented Mar 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changelog

Uh oh!

nmdefries commented Mar 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nmdefries commented Mar 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nmdefries commented Mar 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jingjtang commented Mar 22, 2023

Uh oh!

nmdefries commented Mar 22, 2023

Uh oh!

nmdefries commented Mar 22, 2023

Uh oh!

jingjtang left a comment

Choose a reason for hiding this comment

Uh oh!

nmdefries commented Mar 27, 2023

Uh oh!

Uh oh!

[Backfill corrections] Use `dplyr` joins in `add_7dav` fn #1806

[Backfill corrections] Use `dplyr` joins in `add_7dav` fn #1806

nmdefries commented Mar 17, 2023 •

edited

Loading

nmdefries commented Mar 21, 2023 •

edited

Loading

nmdefries commented Mar 21, 2023 •

edited

Loading

nmdefries commented Mar 21, 2023 •

edited

Loading