-
Notifications
You must be signed in to change notification settings - Fork 16
[Backfill corrections] Use dplyr
joins in add_7dav
fn
#1806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Backfill corrections] Use dplyr
joins in add_7dav
fn
#1806
Conversation
…join-v-merge-order-matters
…join-v-merge-order-matters
I verified that using the old |
Options on what to do with this code chunk:
|
My vote is for the second option. I don't really see any reason to try to make sure a new implementation maintains the same sort order and thus model/prediction output. The original sort order was arbitrary and unintentional. The results are very similar using different sort orders. |
Agree. Vote for the second one as well! |
Output before and after the included code changes. The R script loads the files, and finds and compares differing columns. 20230320_new_reciving.tar.gz |
@jingjtang This is ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@krivard This is ready to merge |
Description
Switch to
dplyr
join functions in place ofbase::merge
, for speed.Changelog
preprocessing.R
Note: Row order seems to impact model fit. This PR changes the final row order compared to using
merge
, and it has differences in some model coefficients versus earlier versions. Most changes are small (coefficients are approximately equal, e.g. within 10e-8). Some models also have the same coef value but the associated predictor switched (e.g. fromWed_issue_coef = 0
andSat_ref_coef = 0.683
, toWed_issue_coef = 0.683
andSat_ref_coef = 0
).As a result, some predicted values are different. Avg WIS is slightly different as well.
That said, all this seems inconsequential; the original
merge
result was not sorted in any particular way. Technicallymerge
sorts by theby
columns, but the sort algorithm it uses appears to not be stable, so the rows get un-ordered WRT other non-by
fields likelag
andtime_value
.