Skip to content

2085 add proportions nhsn #2111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Feb 11, 2025
Merged

2085 add proportions nhsn #2111

merged 25 commits into from
Feb 11, 2025

Conversation

aysim319
Copy link
Contributor

Description

address #2085

Changelog

added new function that checks the metadata for last update
-> the function also checks for 503 error
-> added signals for total reporting hospitals

Associated Issue(s)

Copy link
Contributor

@nmdefries nmdefries left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you planning on adding the new RSV signals in this PR or in a separate one?

Comment on lines 29 to 31
updated_timestamp = datetime.utcfromtimestamp(int(response["rowsUpdatedAt"]))
now = datetime.utcnow()
recently_updated = (now - updated_timestamp) < timedelta(days=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: I think this "recently-updated" logic is sufficient but not robust. For example, if we fail to pull data for multiple days, the next day we run we would not pull data we had never seen before if it was not posted in the last day.

The more robust solution would be to save last pull's updated_timestamp to a local file. We would then load that and compare updated_timestamp to that -- if exactly equal, skip update; if unequal, pull data.

Copy link
Contributor Author

@aysim319 aysim319 Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely makes sense and something I didn't think about! The only thing I did different was use the api instead of scanning the file since I imagine the file list is going to go and doesn't make much sense to scan the file list every day

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, checking the API could make sense, too. The one thing I'd caution is timezones -- your previous approach explicitly used UTC on both "old" and "now" timestamps, but I don't know what the API uses.

Second, the API only has dates, not times. Would that ever cause problems? E.g. we want to check for updates multiple times a day.

Copy link
Contributor Author

@aysim319 aysim319 Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, checking the API could make sense, too. The one thing I'd caution is timezones -- your previous
approach explicitly used UTC on both "old" and "now" timestamps, but I don't know what the API uses

since the data and the dates are just date and not datetime, I didn't take timezones into account....hmm i also don't know for sure which timezone, i believe it's EST, but have to double check

the API only has dates, not times. Would that ever cause problems? E.g. we want to check for updates multiple times a day.

since this is data that generally updates weekly, I was planning on just running once a day, so I thought timezone wouldn't be as much of an issue

Copy link
Contributor

@nmdefries nmdefries Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, given these complications, I'm thinking reading/writing to a file is easier. We wouldn't need to keep a complete list of all update datetimes ever, just the single most recent datetime. So the file wouldn't keep getting bigger and bigger, we could just read a single line.

This lets us store a UTC date (no timezones to worry about), no API date-processing to worry about, and we can store a datetime to be extra precise.

Copy link
Contributor Author

@aysim319 aysim319 Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't a fan of have metadata files, seems overkill / introduce more complexity than I would like, so after talking things through with Nolan just now, I decided to simplify the logic and create backups daily, but still do simple check to see recently updated to actually continue processing and create the csv files, so if there are outages that happened after the initial pulls, we can go back and do patches for them.

Nolan also mentioned that for the future, we could look into creating a generic tool/script to dedup things specifically and I like that direction since it would seperate the complexity away from this code base

@aysim319
Copy link
Contributor Author

aysim319 commented Feb 4, 2025

Are you planning on adding the new RSV signals in this PR or in a separate one?

I was originally planning for a seperate one. I was considering adding the new signal in this PR, but when I tried to add just the new columns and locally ran the tests and ran across issues and looks like it might be more involved, so i think it'd be better to create a seperate one. I also kinda shoved in other issues (retry and daily checking) and didn't want to add more things on top

@aysim319 aysim319 requested a review from nmdefries February 5, 2025 15:57
@aysim319 aysim319 requested a review from nolangormley February 5, 2025 20:19
Copy link
Contributor

@nolangormley nolangormley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small questions

df = df.astype(TYPE_DICT)
try:
df = df.astype(TYPE_DICT)
except KeyError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this just pass?

Copy link
Contributor Author

@aysim319 aysim319 Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was that some of the older data didn't have newer signals (rsv, reporting hospitals) in the source back up it would log, but resume the patching process.

Previously I tried modify TYPE_DICT, but being mutable caused some issue in patching runs. So this was the next solution...is it a good one ehhh....I log that that the signal is unavailable eariler (line 150) and I thought I shouldn't log basically the same message twice

Since you brought up...I should also check if the rest of the columns actually changed data types and maybe look into a less janky way

Copy link
Contributor

@nolangormley nolangormley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@nmdefries nmdefries left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple cleanup requests -- deduping pull_nhsn... and pull_preliminary... is the main one.

if not custom_run
else pull_data_from_file(backup_dir, issue_date, logger, prelim_flag=False)
)

keep_columns = list(TYPE_DICT.keys())
recently_updated = True if custom_run else check_last_updated(socrata_token, MAIN_DATASET_ID, logger)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: if we put this before the pull_data logic, we could avoid fetching from the source API in most cases (since this is or will be running every day but only updates once a week).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was the original thought, but previously you brought up if there's multiple failures in a row and the solution was to squirrel away for now, but at least avoid duplicating for both raw and processed

@@ -144,24 +210,31 @@ def pull_preliminary_nhsn_data(
pd.DataFrame
Dataframe as described above.
"""
# Pull data from Socrata API
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: pull_preliminary_nhsn_data and pull_nhsn_data are really similar. I think it will become a maintenance issue to keep both. We should probably keep these two functions as wrappers of a shared fn that takes a is_prelim flag (or similar).

Diff of the two fns:

Screen Shot 2025-02-11 at 15 20 55

Copy link
Contributor Author

@aysim319 aysim319 Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know they're similar, i thought about it and went back and forth about it but I was in the thought of maybe in the future there would be something different going on so kept it seperate. I'm not too concerned about this, since we'll be slowly deprecating this codebase;

@nmdefries nmdefries self-requested a review February 11, 2025 21:35
@aysim319 aysim319 merged commit f543fe9 into main Feb 11, 2025
17 checks passed
@aysim319 aysim319 deleted the 2085-add-proportions-nhsn branch February 14, 2025 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add retry for 50x error for scorata api and also throw error when that happens
3 participants