Skip to content

Find the sources of the conflicts with most recent version of epiprocess #104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rachlobay opened this issue Jul 19, 2022 · 10 comments · Fixed by cmu-delphi/epiprocess#185
Closed
Assignees
Labels
bug Something isn't working P0 do this immediately

Comments

@rachlobay
Copy link
Contributor

In DESCRIPTION, had to change the version of epiprocess used to to older version dajmcdon/epiprocess else many errors like these from this R CMD Check are encountered. Possibly due to [.epi_df that was introduced in epiprocess. Should try to find the sources of the conflicts with the most recent version of epiprocess that is available on Github.

@mgyliu
Copy link
Contributor

mgyliu commented Jul 19, 2022

This comment addresses the following error in the R CMD check:

Error in `dplyr::mutate()`:
! Problem while computing `time_value = time_value + shift_val`.
Caused by error in `time_value + shift_val`:
! non-numeric argument to binary operator

Rachel and I had a bit of a debugging session to try and figure this out.
TLDR; we think the issue lies in how epi_df handles indexing, specifically with column selection.

  • Removing [.epi_df from the epiprocess NAMESPACE file makes that error go away
  • Selecting columns from an epi_df using dplyr::select changes the ordering of the column names only, without changing the ordering of the column values, resulting in a misalignment (see below examples)
  • In epipredict, there is only one spot from which the error message originates - in epi_shift_single.
  • epi_shift_single gets called in step_epi_lag and step_epi_ahead.
  • When there are calls to fit where the recipe has those lag and ahead steps, I think it hits the epi_shift_single function and errors out due to the epi_df column selection issues.

I put a browser() in the first line of epi_shift.R:epi_shift_single and ran devtools::test(). One of the tests that breaks is the last test in tests/testthat/test-step_epi_shift.R. The following console stuff is from that test.

After selecting columns from an epi_df, the time_value column contains geo_value values and vice versa!!!

Browse[1]> x 
An `epi_df` object, 20,496 x 4 with metadata:
* geo_type  = state
* time_type = day
* as_of     = 2022-05-31 12:08:25

# A tibble: 20,496 x 4
   geo_value time_value case_rate death_rate
 * <chr>     <date>         <dbl>      <dbl>
 1 ak        2020-12-31      35.9      0.158
 2 al        2020-12-31      65.1      0.438
 3 ar        2020-12-31      66.0      1.27 
 4 as        2020-12-31       0        0    
 5 az        2020-12-31      76.8      1.10 
 6 ca        2020-12-31      96.0      0.751
 7 co        2020-12-31      35.8      0.649
 8 ct        2020-12-31      52.1      0.819
 9 dc        2020-12-31      31.0      0.601
10 de        2020-12-31      65.2      0.807
# ... with 20,486 more rows
# i Use `print(n = ...)` to see more rows
Browse[1]> x %>%
+     dplyr::select(tidyselect::all_of(c(key_cols, col)))
An `epi_df` object, 20,496 x 3 with metadata:
* geo_type  = state
* time_type = day
* as_of     = 2022-05-31 12:08:25

# A tibble: 20,496 x 3
   time_value geo_value  death_rate 
 * <chr>      <date>          <dbl>
 1 ak         2020-12-31      0.158
 2 al         2020-12-31      0.438
 3 ar         2020-12-31      1.27 
 4 as         2020-12-31      0    
 5 az         2020-12-31      1.10 
 6 ca         2020-12-31      0.751
 7 co         2020-12-31      0.649
 8 ct         2020-12-31      0.819
 9 dc         2020-12-31      0.601
10 de         2020-12-31      0.807
# ... with 20,486 more rows
# i Use `print(n = ...)` to see more rows
Browse[1]> 

The same thing happens when you select time_value and geo_value manually, in that order. But if you select them in the order in which they appear in the original epi_df, there are no isses.

Browse[1]> x %>% dplyr::select(c(time_value, geo_value))
An `epi_df` object, 20,496 x 2 with metadata:
* geo_type  = state
* time_type = day
* as_of     = 2022-05-31 12:08:25

# A tibble: 20,496 x 2
   time_value geo_value 
 * <chr>      <date>    
 1 ak         2020-12-31
 2 al         2020-12-31
 3 ar         2020-12-31
 4 as         2020-12-31
 5 az         2020-12-31
 6 ca         2020-12-31
 7 co         2020-12-31
 8 ct         2020-12-31
 9 dc         2020-12-31
10 de         2020-12-31
# ... with 20,486 more rows
# i Use `print(n = ...)` to see more rows
Browse[1]> 

The issue is that x, which is an epi_df, doesn't handle dplyr functions the way that we expect. Notice that the time_value and geo_value columns get swapped unexpectedly.

In epi_shift_single, dplyr::mutate gets called after the select. It tries to add time_value and a numeric, but since time_value now contains characters like "ak" and "al", R throws that "non-numeric argument to binary operator" error.

@mgyliu
Copy link
Contributor

mgyliu commented Jul 19, 2022

This comment addresses another error message in the R CMD check:

Failure (test-epi_recipe.R:52:3): epi_recipe formula works

This is an easier fix. The test uses expect_identical on two dataframes that are the same, except their rows are in different orders. The test fails because of the different row ordering. Example in this line, but this file has multiple calls to expect_identical.

Original test code:

expect_identical(r$var_info, ref_var_info)

Proposed changed test code:

expect_identical(dplyr::arrange(r$var_info,variable), dplyr::arrange(ref_var_info, variable))

The dplyr::arrange orders both dataframes by the same column to ensure the row ordering is correct.

I also propose looking into other places where expect_identical gets called. I suspect this isn't the only place the failure could happen. IMO a more robust way of testing this is to make assertions about specific rows or columns rather than entire dataframes. But the dplyr::arrange seems okay for small dataframes.

@dajmcdon
Copy link
Contributor

Wow! Nice catch. What a strange behaviour.

@dajmcdon dajmcdon added bug Something isn't working P0 do this immediately labels Jul 20, 2022
@rachlobay
Copy link
Contributor Author

rachlobay commented Jul 20, 2022

Great! Thanks for posting the above comments. They definitely help! But it is probably best to address this from the root, if possible.

To look into the R CMD Check meltdown from using the most recent edition of epiprocess, Ken, Maggie, and I went through the merged pull requests from the past ~ two weeks (time frame’s based on when the Github versions of epipredict and epiprocess were not having issues) and found that the most troublesome change is[.epi_df); the other pull requests that were merged seemed to be mostly updating tests or documentation in epiprocess (so unlikely to cause such a problem). Based on how epirocess behaves when [.epi_df is present vs absent, I think that[.epi_df is causing these problems (& I expect there’ll be more trouble from it in the future, so we'll refine as necessary).

I was looking into a couple of ways to fix the two types of error messages shown in the R CMD check by going through the [.epi_df code line by line & running the tests that Maggie and I were looking at yesterday… The simplest soln to fix both issues (without messing with the new_epi_df code) looks to be to go from new_epi_df(tibble::as_tibble(res),… --> as_epi_df(res,… in there. So basically we gotta remove the call to construct a new epi_df from scratch… And going for this change makes sense because sometimes res (the epi_df subset we’re getting out of [.epi_df) already has the epi_df class to begin with, and as_epi_df will call new_epi_df if it is necessary, else just return as is. When I make this change all looks to be well between the most recent versions of epipredict and epiprocess (if someone else wants to check they get they intended effects, please do).

Now, if we want to keep that & not change it to as_epi_df, I briefly looked into what we’d likely need to do to new_epi_df... Removing the three lines in there where column reordering occurs seems to fix the errors involvingparsnip::fit() & that makes sense with what Maggie and I were seeing (the weird behaviour with var./col. ordering). Then, probably all we’d have to do to fix the other type of error messages is some manual column rearrangement in the tests, as Maggie was suggesting (because the columns of a new epi_df now wouldn't have a strict ordering anymore)... But I’d be more hesitant to go for this approach because I didn’t write that code that we’d be removing here (note that new_epi_df code is the original as_epi_df code) and maybe that ordering is there for a good reason that I am not considering.

@mgyliu
Copy link
Contributor

mgyliu commented Jul 21, 2022

Adding some notes from my discussion with Rachel for papertrail:

  • [.epi_df gets called multiple times through parsnip::fit
  • These lines in epi_df.R:new_dpi_df seem to do some column reordering, which could be part of the problem. I ran a debugger in here and those lines of code seem to reorder columns correctly. But good to keep this part of the code in mind for a future fix.
  • Current fix is to change new_epi_df(tibble::as_tibble(res),… --> as_epi_df(res,… which Rachel and I agree is a good solution for now.
  • We are still not sure what exactly is causing the column misnaming. The above fix is good for now. If we need to look into the exact root cause later, it might be good to look into the implementation of parsnip::fit to determine how it's making calls to [.epi_df.

For future reference, steps to reproduce:

This is the example we were working with

library(dplyr)
library(recipes)
library(parsnip)

jhu <- case_death_rate_subset %>%
  filter(time_value > "2021-11-01", geo_value %in% c("ak", "ca", "ny"))

r <- epi_recipe(jhu) %>%
  step_epi_lag(death_rate, lag = c(0, 7, 14)) 
#   %>%
#   step_epi_ahead(death_rate, ahead = 7) %>%
#   step_epi_lag(case_rate, lag = c(0, 7, 14)) %>%
#   step_naomit(all_predictors()) %>%
#   step_naomit(all_outcomes(), skip = TRUE)

wf <- epi_workflow(r, linear_reg()) 
fit <- wf %>% parsnip::fit(jhu) 
library(dplyr)
library(recipes)
library(parsnip)

jhu <- case_death_rate_subset %>%
  filter(time_value > "2021-11-01", geo_value %in% c("ak", "ca", "ny"))

r <- epi_recipe(jhu) %>%
  step_epi_lag(death_rate, lag = c(0, 7, 14)) 
#   %>%
#   step_epi_ahead(death_rate, ahead = 7) %>%
#   step_epi_lag(case_rate, lag = c(0, 7, 14)) %>%
#   step_naomit(all_predictors()) %>%
#   step_naomit(all_outcomes(), skip = TRUE)

wf <- epi_workflow(r, linear_reg()) 
fit <- wf %>% parsnip::fit(jhu) 
# wf
  • Place a debugger in [.epi_df
  • Load epiprocess locally with load_all("../epiprocess") (or replace with your relative path to local epiprocess directory)
  • Run the example code above. Notice that the debugger does not get hit until the parsnip::fit line. Notice that the debugger is hit multiple times in that line.
  • Can also place a debugger in epi_shift_single to see the column reordering issue from my first comment in this thread.

@rachlobay
Copy link
Contributor Author

As Logan and I were talking about yesterday, it may also be good to look into what lines in parsnip::fit() are conflicting with the couple of problematic lines of code in new_epi_df(), because it is definitely possible that there could be problems cropping up from this in the future... I'll post an issue, but it is probably not urgent to get to as I think that we have more pressing problems to deal with right now.

@dajmcdon
Copy link
Contributor

dajmcdon commented Aug 1, 2022

I messed around with this a bit yesterday. The (first?) specific issue (using the example in my check) is:

  1. In epi_shift_single(), line 35 in my code: dplyr::select(tidyselect::all_of(c(key_cols, col))).
  2. The result of this line switches the NAMES time_value and geo_value for some reason. The result being that time_value then contains <chr> data with states while geo_value contains the dates.
  3. Trying to add an integer shift_val to the character bombs.

@dajmcdon
Copy link
Contributor

dajmcdon commented Aug 3, 2022

Can @rachlobay or @mgyliu open a PR that makes the simple change?

@mgyliu
Copy link
Contributor

mgyliu commented Aug 3, 2022

To confirm, the simple change is just going from new_epi_df(tibble::as_tibble(res),... to as_epi_df(res... in [.epi_df?
@rachlobay did you already have a branch with the change implemented? If not I'm happy to take it

@rachlobay
Copy link
Contributor Author

rachlobay commented Aug 3, 2022

Yeah. I got a branch with that change... I also want to simplify the [.epi_df code in this PR a bit as per feedback from Logan, so I'll field this one. Expect a PR to be up for this within the next few hours (over on epiprocess as that's where the changes'll be)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 do this immediately
Projects
None yet
3 participants