Skip to content

Create geo-aggregation extension of tsibble and demonstrate its use in aggregation vignette #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jacobbien opened this issue Nov 9, 2021 · 8 comments
Assignees
Labels
question Further information is requested

Comments

@jacobbien
Copy link

jacobbien commented Nov 9, 2021

The basic idea here is to do for geography what tsibble does for time. I will elaborate on this below. In particular, a key outcome of this issue would be to demonstrate geographic aggregation functionality in the second half of the aggregate.Rmd vignette. But unlike in cmu-delphi/epiprocess#24, where the only task was to write the demo, in this case one would first need to actually develop the geographic aggregation functionality itself. I sketch out here a particular approach to this, which would involve writing a new separate R package that would inherit from tsibble, as described below.

@jacobbien
Copy link
Author

jacobbien commented Nov 9, 2021

For the full discussion that this issue is based on, see cmu-delphi/epiprocess#7

@jacobbien
Copy link
Author

jacobbien commented Nov 9, 2021

Here is some context for what we are interested in achieving. The idea sketched out here would be to create a class that inherits from tsibble, perhaps called tsibble_us that inherits from tsibble where it enforces the key to be one of national, state, hrr, county, msa. Performing spatial aggregation is a common yet non-trivial task, so if the ability to do this easily is built in, that could be very useful.

Consider this example from the tsibble documentation showing how index_by() + summarize() can be used to aggregate (in time) to a coarser index:

tourism %>%
  index_by(Year = ~ year(.)) %>%
  group_by(Region, State) %>%
  summarise(Total = sum(Trips))

If one wanted to aggregate to the Year-Region level instead, couldn't one just remove State from the group_by? Well, but there's a problem here. Namely, what if some states are missing? The advantage of tsibble_us is that it would know if some states are missing (and likewise, if you try to aggregate counties to state level it would know if there are missing counties).

This is directly analogous to how tsibble worries a lot (so that the user doesn't have to!) about missing time values. In Handle implicit missingness with tsibble, Earo Wang describes four functions that have to do with handling missing time values: has_gaps, scan_gaps, count_gaps, and fill_gaps. Our class tsibble_us could do the same for missing locations. Imagine four corresponding functions for missing locations (let's call a missing location a "hole" for now for lack of a better term):

  • has_holes - are there missing locations? (E.g., if key is at the county level, are all US counties there? One could also have an argument where one specifies a limited scope, e.g., if the scope is defined as CA, then %>% has_gaps(scope = state("CA")) would only check if there were missing CA counties in the data object.
  • scan_holes - what are the missing counties
  • count_holes - how many missing counties
  • fill_holes - this function could at minimum create NA time series for the missing counties. This means that if we aggregate from the county to state level, it will be apparent that CA was missing some counties. One could also imagine cases where filling in 0s makes sense. In fact, tsibble_us could even offer imputation based on the geo hierarchy (e.g. fill in with the average of all counties within this state) or based on spatial proximity.

tsibble_us could also have population size information for each geo value, so that weighted averages could be easy to do in aggregation.

An example: Suppose dat is a tsibble_us object with daily-county cases (three columns: Date, the index; County, the geo-key; and Cases being a column with incidence rate, per 100k). We could aggregate to state-level epiweek data with something like the following:

dat %>%
  fill_gaps() %>% 
  fill_holes() %>%
  index_by(Epiweek = ~ epiweek(.)) %>%
  geokey_by(State = ~ state(.)) %>%
  summarise(Aggregated_cases = population_weighted_mean(Cases))

Here geokey_by, state and population_weighted_mean would be tsibble_us functions that are based on the information contained in the package about the us geography.

Obviously, instead of tsibble_us, one could write a more general tsibble_geo and then there could be location specific classes that inherit from it, like tsibble_us, tsibble_europe, tsibble_world, etc.

@jacobbien jacobbien changed the title Add geo-aggregation functionality Create geo-aggregation extension of tsibble and demonstrate its use in aggregation vignette Nov 9, 2021
@qpmnguyen
Copy link
Collaborator

@ryantibs @jacobbien I'm happy to take this task if it's still open. Also happy to help with any issues I've created with my previous PR as well.

@ryantibs
Copy link
Member

ryantibs commented Feb 9, 2022

@qpmnguyen Sounds good! Let's discuss on slack what the best approach is, because there is a lot of functionality already written (for our indicators pipeline) for geo aggregation stuff in Python, which is in the covidcast_indicators repo. Can you please post a message on #epi-tooling channel recapping what the basic issue here is, and what are some proposed plans of attack, and tag all the "usual" folks for discussion?

There's also some smaller issues I'm about to open if that slack discussion takes a while to resolve on what's the best strategy. I'll point you to this when I open them.

Re your time aggregation PR: I'm just finishing going through it now, should be able to merge it soon.

@ryantibs
Copy link
Member

ryantibs commented Feb 9, 2022

As a follow-up, I only opened up one tiny issue cmu-delphi/epiprocess#39, the other one I managed to figure out and fix already. Your PR is merged. Thanks again!

@qpmnguyen
Copy link
Collaborator

Sounds good! I'll take a look at the indicators Repo and draft up some discussion points in Slack.

@dshemetov dshemetov added the question Further information is requested label Mar 9, 2022
@ryantibs
Copy link
Member

@qpmnguyen Do we have a repo yet for gtsibble or whatever we're calling it? If so, can you transfer this issue over to there (do you have permissions?)

@qpmnguyen qpmnguyen transferred this issue from cmu-delphi/epiprocess Apr 15, 2022
@qpmnguyen
Copy link
Collaborator

@ryantibs I just transferred the issue! I have a branch working locally on my fork and will initiate a pull request once more things have been added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants