[fb-package] Create script to combine contingency tables by group over all dates #1039

nmdefries · 2021-05-04T19:59:05Z

Description

Create script to combine into a single file the contingency tables that use the same set of grouping variables but over different time periods. These rollup files will be posted on the AWPS website so users can more easily browse through the files and, download all dates for a given grouping.

The user calls the script with the path to the input files and the path to the output files. Input files are grouped according to what output file they will go in (based on grouping variables used + geo level + aggregation time range). Generally this will be used to add a set of weekly or monthly tables to the output files, so most of the time we expect a one-to-one mapping of input and output names.

If files have not been seen before (based on list kept in seen.txt), they are directly appended to the pre-existing output file so that we don't have to read in the potentially large output data every time. Then the newly-added filenames are added to the already-seen list. Again, we expect this to be the case the majority of the time. If files have been seen before, the output data is read in so that we can deduplicate it. For each unique group, we keep the row with the most recent issue date. The output data is always read in so that we can merge it with the input data and deduplicate if necessary.

Output names have a similar format to the individual tables (e.g. 20210418_20210424_weekly_nation_age_edulevel.csv), but drop the initial dates and are compressed using gzip (e.g., weekly_nation_age_edulevel.csv.gz).

Example input and output.

Changelog

Create contingency-combine.R script as a command-line tool.

Fixes

Asana task

capnrefsmmat

This looks good to me overall; my comments aren't blockers. I am a bit nervous, though, since the merging logic is a bit tricky and it's hard to test because it does the IO directly in combine_and_save_tables. I wonder if it's worth splitting that into combine_tables and save_tables so that combine_tables just does the data frame merge step, and then you can easily test that step to ensure it always combines correctly.

facebook/contingency-combine.R

Generalize code that selects columns to use in finding unique rows. Now uses all columns up to but not including the first column that starts with "val_", "se_", etc (any column name that we do or can report in the contingency tables). Discuss column ordering/naming behavior in contingency pipeline comments.

nmdefries · 2021-05-06T16:18:40Z

@capnrefsmmat Could you take another look at this?

The biggest change is the removal of the functionality to directly append to the output file if an input table has not been seen before. It complicated the logic a lot and I realized it's not going to be used that much. Even if an input table hasn't been seen before, it can still have a column mismatch with existing output, for example, if a new indicator was added or if an old indicator was discontinued. So most of the time we still want to read in the output so the input-output merge is done correctly (by name rather than position).

Because of this change, the seen.txt file that tracks tables we've already input isn't strictly necessary, but I'm thinking of keeping it for now so there's a record of which tables have been processed. Removing the seen-file would make the code shorter/simpler, though. Thoughts?

readr's column guessing procedure only uses the first 1000 lines, by default, of a file to guess variable type for each column. If a column is completely missing for the first 1000 lines, it is read in as a logical which causes parsing failures if the column contains non-boolean values later, outside the type guessing range. This happens when reading in output files if an indicator was newly added. To correctly specify these, use the column specification from the input file/s. All columns included in input files are at least partially non-missing and sorted alphabetically (indepdendent of missingness), so we should always see non-missing values in the first 1000 lines.

capnrefsmmat

Looks good. Nice commit messages too

nmdefries added 2 commits May 3, 2021 17:59

Create script to combine contingency tables with shared grouping vars

db6c4bf

Allow direct append if never seen file

e43fc13

nmdefries force-pushed the fb-package-rollup-tables branch from cf6ca12 to fb73bc5 Compare May 5, 2021 16:02

Sort output by date; comment cleanup

4b445bc

nmdefries force-pushed the fb-package-rollup-tables branch from fb73bc5 to 4b445bc Compare May 5, 2021 16:12

nmdefries requested a review from capnrefsmmat May 5, 2021 16:13

capnrefsmmat approved these changes May 5, 2021

View reviewed changes

facebook/contingency-combine.R Show resolved Hide resolved

facebook/contingency-combine.R Outdated Show resolved Hide resolved

nmdefries added 5 commits May 5, 2021 16:42

Separate combine and save steps

f2b6c80

Make sure issue_date is always the last col

ea692bc

Use merge process if column names don't match

28737b3

Always use merge process to simplify logic

f4a8159

nmdefries force-pushed the fb-package-rollup-tables branch from 7140c42 to f4a8159 Compare May 6, 2021 15:47

capnrefsmmat approved these changes May 10, 2021

View reviewed changes

capnrefsmmat merged commit d17c94b into main May 10, 2021

nmdefries deleted the fb-package-rollup-tables branch May 10, 2021 13:32

krivard mentioned this pull request May 10, 2021

Deploy renamed nancodes to production #1045

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[fb-package] Create script to combine contingency tables by group over all dates #1039

[fb-package] Create script to combine contingency tables by group over all dates #1039

Uh oh!

nmdefries commented May 4, 2021 •

edited

Loading

Uh oh!

capnrefsmmat left a comment

Uh oh!

Uh oh!

Uh oh!

nmdefries commented May 6, 2021

Uh oh!

capnrefsmmat left a comment

Uh oh!

Uh oh!

[fb-package] Create script to combine contingency tables by group over all dates #1039

[fb-package] Create script to combine contingency tables by group over all dates #1039

Uh oh!

Conversation

nmdefries commented May 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changelog

Fixes

Uh oh!

capnrefsmmat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nmdefries commented May 6, 2021

Uh oh!

capnrefsmmat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nmdefries commented May 4, 2021 •

edited

Loading