Find the sources of the conflicts with most recent version of epiprocess #104

rachlobay · 2022-07-19T00:49:01Z

In DESCRIPTION, had to change the version of epiprocess used to to older version dajmcdon/epiprocess else many errors like these from this R CMD Check are encountered. Possibly due to [.epi_df that was introduced in epiprocess. Should try to find the sources of the conflicts with the most recent version of epiprocess that is available on Github.

The text was updated successfully, but these errors were encountered:

mgyliu · 2022-07-19T21:29:53Z

This comment addresses the following error in the R CMD check:

Error in `dplyr::mutate()`:
! Problem while computing `time_value = time_value + shift_val`.
Caused by error in `time_value + shift_val`:
! non-numeric argument to binary operator

Rachel and I had a bit of a debugging session to try and figure this out.
TLDR; we think the issue lies in how epi_df handles indexing, specifically with column selection.

Removing [.epi_df from the epiprocess NAMESPACE file makes that error go away
Selecting columns from an epi_df using dplyr::select changes the ordering of the column names only, without changing the ordering of the column values, resulting in a misalignment (see below examples)
In epipredict, there is only one spot from which the error message originates - in epi_shift_single.
epi_shift_single gets called in step_epi_lag and step_epi_ahead.
When there are calls to fit where the recipe has those lag and ahead steps, I think it hits the epi_shift_single function and errors out due to the epi_df column selection issues.

I put a browser() in the first line of epi_shift.R:epi_shift_single and ran devtools::test(). One of the tests that breaks is the last test in tests/testthat/test-step_epi_shift.R. The following console stuff is from that test.

After selecting columns from an epi_df, the time_value column contains geo_value values and vice versa!!!

Browse[1]> x 
An `epi_df` object, 20,496 x 4 with metadata:
* geo_type  = state
* time_type = day
* as_of     = 2022-05-31 12:08:25

# A tibble: 20,496 x 4
   geo_value time_value case_rate death_rate
 * <chr>     <date>         <dbl>      <dbl>
 1 ak        2020-12-31      35.9      0.158
 2 al        2020-12-31      65.1      0.438
 3 ar        2020-12-31      66.0      1.27 
 4 as        2020-12-31       0        0    
 5 az        2020-12-31      76.8      1.10 
 6 ca        2020-12-31      96.0      0.751
 7 co        2020-12-31      35.8      0.649
 8 ct        2020-12-31      52.1      0.819
 9 dc        2020-12-31      31.0      0.601
10 de        2020-12-31      65.2      0.807
# ... with 20,486 more rows
# i Use `print(n = ...)` to see more rows
Browse[1]> x %>%
+     dplyr::select(tidyselect::all_of(c(key_cols, col)))
An `epi_df` object, 20,496 x 3 with metadata:
* geo_type  = state
* time_type = day
* as_of     = 2022-05-31 12:08:25

# A tibble: 20,496 x 3
   time_value geo_value  death_rate 
 * <chr>      <date>          <dbl>
 1 ak         2020-12-31      0.158
 2 al         2020-12-31      0.438
 3 ar         2020-12-31      1.27 
 4 as         2020-12-31      0    
 5 az         2020-12-31      1.10 
 6 ca         2020-12-31      0.751
 7 co         2020-12-31      0.649
 8 ct         2020-12-31      0.819
 9 dc         2020-12-31      0.601
10 de         2020-12-31      0.807
# ... with 20,486 more rows
# i Use `print(n = ...)` to see more rows
Browse[1]>

The same thing happens when you select time_value and geo_value manually, in that order. But if you select them in the order in which they appear in the original epi_df, there are no isses.

Browse[1]> x %>% dplyr::select(c(time_value, geo_value))
An `epi_df` object, 20,496 x 2 with metadata:
* geo_type  = state
* time_type = day
* as_of     = 2022-05-31 12:08:25

# A tibble: 20,496 x 2
   time_value geo_value 
 * <chr>      <date>    
 1 ak         2020-12-31
 2 al         2020-12-31
 3 ar         2020-12-31
 4 as         2020-12-31
 5 az         2020-12-31
 6 ca         2020-12-31
 7 co         2020-12-31
 8 ct         2020-12-31
 9 dc         2020-12-31
10 de         2020-12-31
# ... with 20,486 more rows
# i Use `print(n = ...)` to see more rows
Browse[1]>

The issue is that x, which is an epi_df, doesn't handle dplyr functions the way that we expect. Notice that the time_value and geo_value columns get swapped unexpectedly.

In epi_shift_single, dplyr::mutate gets called after the select. It tries to add time_value and a numeric, but since time_value now contains characters like "ak" and "al", R throws that "non-numeric argument to binary operator" error.

mgyliu · 2022-07-19T22:30:08Z

This comment addresses another error message in the R CMD check:

Failure (test-epi_recipe.R:52:3): epi_recipe formula works

This is an easier fix. The test uses expect_identical on two dataframes that are the same, except their rows are in different orders. The test fails because of the different row ordering. Example in this line, but this file has multiple calls to expect_identical.

Original test code:

expect_identical(r$var_info, ref_var_info)

Proposed changed test code:

expect_identical(dplyr::arrange(r$var_info,variable), dplyr::arrange(ref_var_info, variable))

The dplyr::arrange orders both dataframes by the same column to ensure the row ordering is correct.

I also propose looking into other places where expect_identical gets called. I suspect this isn't the only place the failure could happen. IMO a more robust way of testing this is to make assertions about specific rows or columns rather than entire dataframes. But the dplyr::arrange seems okay for small dataframes.

dajmcdon · 2022-07-20T01:08:07Z

Wow! Nice catch. What a strange behaviour.

rachlobay · 2022-07-20T16:10:32Z

Great! Thanks for posting the above comments. They definitely help! But it is probably best to address this from the root, if possible.

To look into the R CMD Check meltdown from using the most recent edition of epiprocess, Ken, Maggie, and I went through the merged pull requests from the past ~ two weeks (time frame’s based on when the Github versions of epipredict and epiprocess were not having issues) and found that the most troublesome change is[.epi_df); the other pull requests that were merged seemed to be mostly updating tests or documentation in epiprocess (so unlikely to cause such a problem). Based on how epirocess behaves when [.epi_df is present vs absent, I think that[.epi_df is causing these problems (& I expect there’ll be more trouble from it in the future, so we'll refine as necessary).

I was looking into a couple of ways to fix the two types of error messages shown in the R CMD check by going through the [.epi_df code line by line & running the tests that Maggie and I were looking at yesterday… The simplest soln to fix both issues (without messing with the new_epi_df code) looks to be to go from new_epi_df(tibble::as_tibble(res),… --> as_epi_df(res,… in there. So basically we gotta remove the call to construct a new epi_df from scratch… And going for this change makes sense because sometimes res (the epi_df subset we’re getting out of [.epi_df) already has the epi_df class to begin with, and as_epi_df will call new_epi_df if it is necessary, else just return as is. When I make this change all looks to be well between the most recent versions of epipredict and epiprocess (if someone else wants to check they get they intended effects, please do).

Now, if we want to keep that & not change it to as_epi_df, I briefly looked into what we’d likely need to do to new_epi_df... Removing the three lines in there where column reordering occurs seems to fix the errors involvingparsnip::fit() & that makes sense with what Maggie and I were seeing (the weird behaviour with var./col. ordering). Then, probably all we’d have to do to fix the other type of error messages is some manual column rearrangement in the tests, as Maggie was suggesting (because the columns of a new epi_df now wouldn't have a strict ordering anymore)... But I’d be more hesitant to go for this approach because I didn’t write that code that we’d be removing here (note that new_epi_df code is the original as_epi_df code) and maybe that ordering is there for a good reason that I am not considering.

mgyliu · 2022-07-21T22:37:06Z

Adding some notes from my discussion with Rachel for papertrail:

[.epi_df gets called multiple times through parsnip::fit
These lines in epi_df.R:new_dpi_df seem to do some column reordering, which could be part of the problem. I ran a debugger in here and those lines of code seem to reorder columns correctly. But good to keep this part of the code in mind for a future fix.
Current fix is to change new_epi_df(tibble::as_tibble(res),… --> as_epi_df(res,… which Rachel and I agree is a good solution for now.
We are still not sure what exactly is causing the column misnaming. The above fix is good for now. If we need to look into the exact root cause later, it might be good to look into the implementation of parsnip::fit to determine how it's making calls to [.epi_df.

For future reference, steps to reproduce:

This is the example we were working with

library(dplyr)
library(recipes)
library(parsnip)

jhu <- case_death_rate_subset %>%
  filter(time_value > "2021-11-01", geo_value %in% c("ak", "ca", "ny"))

r <- epi_recipe(jhu) %>%
  step_epi_lag(death_rate, lag = c(0, 7, 14)) 
#   %>%
#   step_epi_ahead(death_rate, ahead = 7) %>%
#   step_epi_lag(case_rate, lag = c(0, 7, 14)) %>%
#   step_naomit(all_predictors()) %>%
#   step_naomit(all_outcomes(), skip = TRUE)

wf <- epi_workflow(r, linear_reg()) 
fit <- wf %>% parsnip::fit(jhu) 
library(dplyr)
library(recipes)
library(parsnip)

jhu <- case_death_rate_subset %>%
  filter(time_value > "2021-11-01", geo_value %in% c("ak", "ca", "ny"))

r <- epi_recipe(jhu) %>%
  step_epi_lag(death_rate, lag = c(0, 7, 14)) 
#   %>%
#   step_epi_ahead(death_rate, ahead = 7) %>%
#   step_epi_lag(case_rate, lag = c(0, 7, 14)) %>%
#   step_naomit(all_predictors()) %>%
#   step_naomit(all_outcomes(), skip = TRUE)

wf <- epi_workflow(r, linear_reg()) 
fit <- wf %>% parsnip::fit(jhu) 
# wf

Place a debugger in [.epi_df
Load epiprocess locally with load_all("../epiprocess") (or replace with your relative path to local epiprocess directory)
Run the example code above. Notice that the debugger does not get hit until the parsnip::fit line. Notice that the debugger is hit multiple times in that line.
Can also place a debugger in epi_shift_single to see the column reordering issue from my first comment in this thread.

rachlobay · 2022-07-26T07:46:19Z

As Logan and I were talking about yesterday, it may also be good to look into what lines in parsnip::fit() are conflicting with the couple of problematic lines of code in new_epi_df(), because it is definitely possible that there could be problems cropping up from this in the future... I'll post an issue, but it is probably not urgent to get to as I think that we have more pressing problems to deal with right now.

dajmcdon · 2022-08-01T17:26:23Z

I messed around with this a bit yesterday. The (first?) specific issue (using the example in my check) is:

In epi_shift_single(), line 35 in my code: dplyr::select(tidyselect::all_of(c(key_cols, col))).
The result of this line switches the NAMES time_value and geo_value for some reason. The result being that time_value then contains <chr> data with states while geo_value contains the dates.
Trying to add an integer shift_val to the character bombs.

dajmcdon · 2022-08-03T16:35:43Z

Can @rachlobay or @mgyliu open a PR that makes the simple change?

mgyliu · 2022-08-03T18:35:47Z

To confirm, the simple change is just going from new_epi_df(tibble::as_tibble(res),... to as_epi_df(res... in [.epi_df?
@rachlobay did you already have a branch with the change implemented? If not I'm happy to take it

rachlobay · 2022-08-03T21:44:51Z

Yeah. I got a branch with that change... I also want to simplify the [.epi_df code in this PR a bit as per feedback from Logan, so I'll field this one. Expect a PR to be up for this within the next few hours (over on epiprocess as that's where the changes'll be)

dajmcdon added bug Something isn't working P0 do this immediately labels Jul 20, 2022

rachlobay mentioned this issue Jul 26, 2022

Look into parsnip::fit() conflict with new_epi_df()? #111

Closed

dajmcdon assigned mgyliu and rachlobay Aug 3, 2022

rachlobay mentioned this issue Aug 3, 2022

Addressed the sources of the conflicts between epipredict and epiprocess (see Issue 104 in epipredict) cmu-delphi/epiprocess#185

Merged

rachlobay closed this as completed in cmu-delphi/epiprocess#185 Aug 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Find the sources of the conflicts with most recent version of epiprocess #104

Find the sources of the conflicts with most recent version of epiprocess #104

rachlobay commented Jul 19, 2022

mgyliu commented Jul 19, 2022 •

edited

Loading

mgyliu commented Jul 19, 2022 •

edited

Loading

dajmcdon commented Jul 20, 2022

rachlobay commented Jul 20, 2022 •

edited

Loading

mgyliu commented Jul 21, 2022 •

edited

Loading

rachlobay commented Jul 26, 2022

dajmcdon commented Aug 1, 2022

dajmcdon commented Aug 3, 2022

mgyliu commented Aug 3, 2022

rachlobay commented Aug 3, 2022 •

edited

Loading

Find the sources of the conflicts with most recent version of epiprocess #104

Find the sources of the conflicts with most recent version of epiprocess #104

Comments

rachlobay commented Jul 19, 2022

mgyliu commented Jul 19, 2022 • edited Loading

mgyliu commented Jul 19, 2022 • edited Loading

dajmcdon commented Jul 20, 2022

rachlobay commented Jul 20, 2022 • edited Loading

mgyliu commented Jul 21, 2022 • edited Loading

rachlobay commented Jul 26, 2022

dajmcdon commented Aug 1, 2022

dajmcdon commented Aug 3, 2022

mgyliu commented Aug 3, 2022

rachlobay commented Aug 3, 2022 • edited Loading

mgyliu commented Jul 19, 2022 •

edited

Loading

mgyliu commented Jul 19, 2022 •

edited

Loading

rachlobay commented Jul 20, 2022 •

edited

Loading

mgyliu commented Jul 21, 2022 •

edited

Loading

rachlobay commented Aug 3, 2022 •

edited

Loading