Skip to content

[Backfill corrections] Use dplyr joins in add_7dav fn #1806

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

nmdefries
Copy link
Contributor

@nmdefries nmdefries commented Mar 17, 2023

Description

Switch to dplyr join functions in place of base::merge, for speed.

Changelog

  • preprocessing.R
  • NAMESPACE

Note: Row order seems to impact model fit. This PR changes the final row order compared to using merge, and it has differences in some model coefficients versus earlier versions. Most changes are small (coefficients are approximately equal, e.g. within 10e-8). Some models also have the same coef value but the associated predictor switched (e.g. from Wed_issue_coef = 0 and Sat_ref_coef = 0.683, to Wed_issue_coef = 0.683 and Sat_ref_coef = 0).

As a result, some predicted values are different. Avg WIS is slightly different as well.

That said, all this seems inconsequential; the original merge result was not sorted in any particular way. Technically merge sorts by the by columns, but the sort algorithm it uses appears to not be stable, so the rows get un-ordered WRT other non-by fields like lag and time_value.

@nmdefries
Copy link
Contributor Author

nmdefries commented Mar 21, 2023

I verified that using the old Reduce-merge approach also results in slightly different model fits/predictions if the sort order of the data is changed. The intermediate dfs of the old and new merge step are also identical once sorted, so it's only the sorting that changes the final result.

@nmdefries nmdefries requested a review from jingjtang March 21, 2023 15:52
@nmdefries
Copy link
Contributor Author

nmdefries commented Mar 21, 2023

Options on what to do with this code chunk:

  • Leave implementation as-is.
    • + no change in prediction/model output
    • - slowest option
    • - sort order is pseudo-random and arbitrary. Any other code change that changes the sort order will result in the output changing.
  • Use new join-based implementation
    • + fast(est)
    • + df is not arbitrarily sorted
    • - results don't exactly match merge-based implementation
  • Use new join-based implementation, but sort it to match old implementation
    • + faster than old code
    • + results exactly match merge-based implementation
    • - not entirely sure how to recreate sort order from merge implementation. I guess I can take the sort code they're using?
    • - sort slows code down and is arbitrary. Any other code change that changes the sort order will result in the output changing.

@nmdefries
Copy link
Contributor Author

nmdefries commented Mar 21, 2023

My vote is for the second option. I don't really see any reason to try to make sure a new implementation maintains the same sort order and thus model/prediction output. The original sort order was arbitrary and unintentional. The results are very similar using different sort orders.

@jingjtang
Copy link
Contributor

My vote is for the second option. I don't really see any reason to try to make sure a new implementation maintains the same sort order and thus model/prediction output. The original sort order was arbitrary and unintentional. The results are very similar using different sort orders.

Agree. Vote for the second one as well!

@nmdefries
Copy link
Contributor Author

Output before and after the included code changes. The R script loads the files, and finds and compares differing columns.

20230320_new_reciving.tar.gz
20230320_old_reciving.tar.gz
comparing_differing_output.R.txt (This is an R script, but GH wouldn't accept it with that file extension.)

@nmdefries nmdefries marked this pull request as ready for review March 22, 2023 22:33
@nmdefries
Copy link
Contributor Author

@jingjtang This is ready for review.

Copy link
Contributor

@jingjtang jingjtang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nmdefries
Copy link
Contributor Author

@krivard This is ready to merge

@krivard krivard merged commit cad3600 into ndefries/backfill/speed2 Mar 27, 2023
@krivard krivard deleted the ndefries/backfill/speed-join-v-merge-order-matters branch March 27, 2023 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants