Skip to content

[Backfill corrections] Align daily and rollup file formats; make dates portable #1760

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 23, 2023

Conversation

nmdefries
Copy link
Contributor

@nmdefries nmdefries commented Jan 12, 2023

Description

Make sure lag and issue_date fields are included in both daily and combined files. Store dates and location info as strings for better portability. These were previously datetime64 (causing the timezone issue) and object types.

claims_hosp backfill file generation was never merged, so changes to those functions are not included here.

Changelog

  • quidel_covidtest's backfill.py
  • changehc's backfill.py

Fixes

The original time_values are meant to be plain dates ("2020-01-01") with no timestamp or timezone info.

The parquet format uses a schema with types. So if Python writes the time_values as either pure dates (no timestamp info) or strings, R will read them in using the same types. The implicit timezone conversion (from no timezone, which R's arrow::read_parquet interprets as UTC, to the local timezone) only happens when the time_values are saved into parquet form as datetime64s.

Python doesn't seem to have a "pure" date class (no time/timezone info). The datetime64 type assumes the time is 00:00 even if none is given. To drop the time info, we can convert to a pure date but this changes the type to object. The object class can be a little dangerous to use. In this case, it's not clear what type parquet will assign to such a column or how R will read it in, and within Python objects can behave in unusual ways. Saving the dates to string is safer.

R will have to do an extra step to convert from string to a date, but it should avoid any weird time/timezone issues.

Copy link
Contributor

@jingjtang jingjtang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nmdefries
Copy link
Contributor Author

@krivard This is ready to merge

@krivard krivard merged commit 4b2e7c5 into main Jan 23, 2023
@krivard krivard deleted the ndefries/backcorr-input-format branch January 23, 2023 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants