-
Notifications
You must be signed in to change notification settings - Fork 16
[fb-package] Create script to combine contingency tables by group over all dates #1039
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cf6ca12
to
fb73bc5
Compare
fb73bc5
to
4b445bc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me overall; my comments aren't blockers. I am a bit nervous, though, since the merging logic is a bit tricky and it's hard to test because it does the IO directly in combine_and_save_tables
. I wonder if it's worth splitting that into combine_tables
and save_tables
so that combine_tables
just does the data frame merge step, and then you can easily test that step to ensure it always combines correctly.
Generalize code that selects columns to use in finding unique rows. Now uses all columns up to but not including the first column that starts with "val_", "se_", etc (any column name that we do or can report in the contingency tables). Discuss column ordering/naming behavior in contingency pipeline comments.
7140c42
to
f4a8159
Compare
@capnrefsmmat Could you take another look at this? The biggest change is the removal of the functionality to directly append to the output file if an input table has not been seen before. It complicated the logic a lot and I realized it's not going to be used that much. Even if an input table hasn't been seen before, it can still have a column mismatch with existing output, for example, if a new indicator was added or if an old indicator was discontinued. So most of the time we still want to read in the output so the input-output merge is done correctly (by name rather than position). Because of this change, the |
readr's column guessing procedure only uses the first 1000 lines, by default, of a file to guess variable type for each column. If a column is completely missing for the first 1000 lines, it is read in as a logical which causes parsing failures if the column contains non-boolean values later, outside the type guessing range. This happens when reading in output files if an indicator was newly added. To correctly specify these, use the column specification from the input file/s. All columns included in input files are at least partially non-missing and sorted alphabetically (indepdendent of missingness), so we should always see non-missing values in the first 1000 lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Nice commit messages too
Description
Create script to combine into a single file the contingency tables that use the same set of grouping variables but over different time periods. These rollup files will be posted on the AWPS website so users can more easily browse through the files and, download all dates for a given grouping.
The user calls the script with the path to the input files and the path to the output files. Input files are grouped according to what output file they will go in (based on grouping variables used + geo level + aggregation time range). Generally this will be used to add a set of weekly or monthly tables to the output files, so most of the time we expect a one-to-one mapping of input and output names.
If files have not been seen before (based on list kept inThe output data is always read in so that we can merge it with the input data and deduplicate if necessary.seen.txt
), they are directly appended to the pre-existing output file so that we don't have to read in the potentially large output data every time. Then the newly-added filenames are added to the already-seen list. Again, we expect this to be the case the majority of the time. If files have been seen before, the output data is read in so that we can deduplicate it. For each unique group, we keep the row with the most recent issue date.Output names have a similar format to the individual tables (e.g.
20210418_20210424_weekly_nation_age_edulevel.csv
), but drop the initial dates and are compressed using gzip (e.g.,weekly_nation_age_edulevel.csv.gz
).Example input and output.
Changelog
contingency-combine.R
script as a command-line tool.Fixes
Asana task