Skip to content

[fb-package] Create script to combine contingency tables by group over all dates #1039

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 10, 2021

Conversation

nmdefries
Copy link
Contributor

@nmdefries nmdefries commented May 4, 2021

Description

Create script to combine into a single file the contingency tables that use the same set of grouping variables but over different time periods. These rollup files will be posted on the AWPS website so users can more easily browse through the files and, download all dates for a given grouping.

The user calls the script with the path to the input files and the path to the output files. Input files are grouped according to what output file they will go in (based on grouping variables used + geo level + aggregation time range). Generally this will be used to add a set of weekly or monthly tables to the output files, so most of the time we expect a one-to-one mapping of input and output names.

If files have not been seen before (based on list kept in seen.txt), they are directly appended to the pre-existing output file so that we don't have to read in the potentially large output data every time. Then the newly-added filenames are added to the already-seen list. Again, we expect this to be the case the majority of the time. If files have been seen before, the output data is read in so that we can deduplicate it. For each unique group, we keep the row with the most recent issue date. The output data is always read in so that we can merge it with the input data and deduplicate if necessary.

Output names have a similar format to the individual tables (e.g. 20210418_20210424_weekly_nation_age_edulevel.csv), but drop the initial dates and are compressed using gzip (e.g., weekly_nation_age_edulevel.csv.gz).

Example input and output.

Changelog

  • Create contingency-combine.R script as a command-line tool.

Fixes

Asana task

@nmdefries nmdefries force-pushed the fb-package-rollup-tables branch from cf6ca12 to fb73bc5 Compare May 5, 2021 16:02
@nmdefries nmdefries force-pushed the fb-package-rollup-tables branch from fb73bc5 to 4b445bc Compare May 5, 2021 16:12
@nmdefries nmdefries requested a review from capnrefsmmat May 5, 2021 16:13
Copy link
Contributor

@capnrefsmmat capnrefsmmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me overall; my comments aren't blockers. I am a bit nervous, though, since the merging logic is a bit tricky and it's hard to test because it does the IO directly in combine_and_save_tables. I wonder if it's worth splitting that into combine_tables and save_tables so that combine_tables just does the data frame merge step, and then you can easily test that step to ensure it always combines correctly.

nmdefries added 5 commits May 5, 2021 16:42
Generalize code that selects columns to use in finding unique rows. Now
uses all columns up to but not including the first column that starts
with "val_", "se_", etc (any column name that we do or can report in the
contingency tables). Discuss column ordering/naming behavior in
contingency pipeline comments.
@nmdefries nmdefries force-pushed the fb-package-rollup-tables branch from 7140c42 to f4a8159 Compare May 6, 2021 15:47
@nmdefries
Copy link
Contributor Author

@capnrefsmmat Could you take another look at this?

The biggest change is the removal of the functionality to directly append to the output file if an input table has not been seen before. It complicated the logic a lot and I realized it's not going to be used that much. Even if an input table hasn't been seen before, it can still have a column mismatch with existing output, for example, if a new indicator was added or if an old indicator was discontinued. So most of the time we still want to read in the output so the input-output merge is done correctly (by name rather than position).

Because of this change, the seen.txt file that tracks tables we've already input isn't strictly necessary, but I'm thinking of keeping it for now so there's a record of which tables have been processed. Removing the seen-file would make the code shorter/simpler, though. Thoughts?

readr's column guessing procedure only uses the first 1000 lines, by
default, of a file to guess variable type for each column. If a column
is completely missing for the first 1000 lines, it is read in as a
logical which causes parsing failures if the column contains non-boolean
values later, outside the type guessing range.

This happens when reading in output files if an indicator was newly
added. To correctly specify these, use the column specification from the
input file/s. All columns included in input files are at least partially
non-missing and sorted alphabetically (indepdendent of missingness), so
we should always see non-missing values in the first 1000 lines.
Copy link
Contributor

@capnrefsmmat capnrefsmmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Nice commit messages too

@capnrefsmmat capnrefsmmat merged commit d17c94b into main May 10, 2021
@nmdefries nmdefries deleted the fb-package-rollup-tables branch May 10, 2021 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants