[Backfill corrections] Align daily and rollup file formats; make dates portable #1760
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Make sure
lag
andissue_date
fields are included in both daily and combined files. Store dates and location info as strings for better portability. These were previouslydatetime64
(causing the timezone issue) andobject
types.claims_hosp
backfill file generation was never merged, so changes to those functions are not included here.Changelog
quidel_covidtest
'sbackfill.py
changehc
'sbackfill.py
Fixes
The original
time_value
s are meant to be plain dates ("2020-01-01") with no timestamp or timezone info.The
parquet
format uses a schema with types. So if Python writes thetime_value
s as either pure dates (no timestamp info) or strings, R will read them in using the same types. The implicit timezone conversion (from no timezone, which R'sarrow::read_parquet
interprets as UTC, to the local timezone) only happens when thetime_value
s are saved intoparquet
form asdatetime64
s.Python doesn't seem to have a "pure" date class (no time/timezone info). The
datetime64
type assumes the time is 00:00 even if none is given. To drop the time info, we can convert to a pure date but this changes the type toobject
. Theobject
class can be a little dangerous to use. In this case, it's not clear what typeparquet
will assign to such a column or how R will read it in, and within Pythonobject
s can behave in unusual ways. Saving the dates to string is safer.R will have to do an extra step to convert from string to a date, but it should avoid any weird time/timezone issues.