-
Notifications
You must be signed in to change notification settings - Fork 10
What signature should a forecaster object have? #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for writing this out! My thoughts are as follows. Variables versus data framesI think it makes sense to specify variables directly. I will make an argument for that. I would say the signature should be:
where
This is fairly easy (intuitive) for a user to understand. Having a model fitting function start with Meanwhile the next two arguments in the signature are required to specify the forecast task. The task is to produce one forecast for the latest time value for each unique combination of key variables . So that's how we explain their need in general. We can allow a user to omit Why not allow a single data frame to be passed?If we allowed the user to specify just a single data frame An exampleAn example of how I could get this to work is as follows. Suppose I have
More complicated: let's suppose I had another column
Do forecasters need to be aware of the time variable (and key variables, if there are any)?Yes, as supported by my logic above. Otherwise there is no way to safely/generally infer the forecast task. The only other way I could image doing it is to have the user provide Does everything except data go into the control arg?Yes. More precisely, in my proposal, everything except map/get_predictionsI'm confused about this because we would always just be using How much input validation to do?My understanding was that, based on one of the recent zoom brainstorming sessions, we decided to modularize any forecaster internally into four parts:
So certainly input validation should be provided as part of the functionality for part 1 here. Meaning, checking for time gaps, filling time gaps, etc., should all be offerings. In my view a "forecaster" is a particular combination of 4 functions (one for each part). In any forecaster we provide, we should give the user flexibility through the forecaster args A user could decide to turn off the pre-fitting (basically, just use the identity map for part 1) if they wanted to, by signaling this in the control args. |
I think with the Defining the task gets a little messy when you think about variable transformations and combining different components. Consider a count -> rate transform:
Wanting a transform-invtransform toggle also shows that pre-fitting and post-prediction elements can be entangled; this might mean we don't want to express this as just a pipe chain from prefit to fit to predict to postpredict, but as various wrappers on top of a fitter&predictor. |
Thanks for the comments! I think you raised two separate issues. On the df versus variable names. Sure. If you wanted to say that they could pass a formula and a data frame, and key vars and time value, then that would be easy to accommodate. It would be equivalent to passing x, y, time value, and key vars. I personally don't like the combination of formulas and data frames at all, but I know they're common in R. So the signature could still be
where
and you could leave On the other point about transformations: I didn't see this as an argument for or against any particular way of doing things (variables vs data frames), just a general point to be aware of---is that right? I think it's a decent point but I think I see a way to handle it. Let me try to get a working example and see if you like it. |
The transformation thing might matter because it impacts whether or not someone writing a new forecaster HAS to use a formula interface or some other manipulable way of representing a forecasting task --- this might rule out the forecaster(x, y, ....) approach, unless x and y are turned into quosures. Do forecasters have to have an object representing the their task?
|
Forecaster function signaturesEDIT: Updating this section to explain the confusion I had to others that might have a similar misconception. I do agree @ryantibs that
is too ambiguous and is too permissive. I was confused by the apparent contradiction between the statement that
but below they appear to be either variable names or something more ambiguous
To clarify, this is an instance of a tidy evaluation idiom in Is sliding universal for all forecasters?A follow-up comment and relating to the question of why need
All that said, I do agree that sliding is a key computation pattern and covers AR-like models (which is all (most?) of our hospitalization forecasters). The points I want to make are: (a) a general get_predictions function may be inevitable, (b) our design should recognize the strengths of epi_slide and not try to cover cases it can't cover. |
Function signatures We probably need to investigate ** Gaps in epi_slide ** Interesting points. Current get_predictions doesn't seem to handle this either? Maybe inspires an Kalman + backfill seems too complicated to me to think about for current development. |
Quick note on The lack of knowledge of "what happened" in past slide computations is real though. That said, I believe it should be possible to fix this, at least in |
Also, just for my own notes: I think we cleared up Dmitry's confusion on the first point in his last message on our call. (It was about tidy evaluation which is nonstandard and takes some getting used to.) And yes I think sliding is fairly universal, as per my last message, but we need to allow access to past computations to make it truly universal. So at this point I don't see many unanswered/unresolved questions about the design. I think we arrived at this.
|
@ryantibs In your "An example" above, what is the internal behavior of the slide function? I'm guessing that in the first case (the simple one where all the 4 inputs are variable names), the forecaster actually "sees" vectors (maybe named?) while in the more complicated version, it "sees" tibbles for |
What the forecaster sees: yes, should be all vectors in the first case. And in the second case, tibbles for |
Decision: sticking with
|
Making this thread to address the broader question opened in #3 about the function signature of a forecaster object. cc @dajmcdon @brookslogan @ryantibs
Current proposals:
Two dimensions of variation here:
df
?One possibility (setting aside the question of whether indexing variables need to be arguments) is that the forecaster signature in (1) could be a function inside the forecaster in signature (2). (2) would do additional parsing of the dataframe
df
and produce training/testing matricesx
,y
and hand them off to something that looks like (1).A question to pump our design intuition: what do we want a forecaster call pattern to look like? For example, following along with the vignettes in epiprocess, we can imagine having a call that looks something like this:
This assumes
some_forecaster
to have a similar signature as when we are usingevalcast
, namelyforecaster(df_list, forecast_date)
(I'm not entirely sure if I can pipe the result of the group_by into map like that, but let's suppose that's correct).That is one possibility. Another possibility is to have an
evalcast::get_predictions
-like functionThis function would a lot incorporate grouping and indexing logic, do argument validation, etc. There is logic here that we likely won't want to have a first-time user do on their own. Ideally, once they've put their data in the
epi_df
orepi_archive
format, we can be confident the data has the columns and features we need to do model training and forecasting,An implicit question here is: how much complexity do we place in the forecaster object and in the
get_predictions
routine. Does the forecaster assume a clean, indexed time-series, so it can focus on just doing the math/statistics or does it also need to intersect with the domain of validating data, etc.? Do we like the structure we used inevalcast::get_predictions
or does that bake-in too much information in one spot and assume a one size fits all model for forecasters?The text was updated successfully, but these errors were encountered: