-
Notifications
You must be signed in to change notification settings - Fork 16
2085 add proportions nhsn #2111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 10 commits
ff91c4c
5ef99b2
6b19402
ad92262
f4b3c40
1df478c
6e5a99b
7cabd8a
6a73c35
6e0d4c2
2da6c08
1e408ba
77662dc
783ab24
76d5436
18de943
e3e96bf
e9bb0a7
33f3db5
88fbc6e
7e6b23a
a220e0d
d8f237b
11ceae9
ebe52aa
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,8 +1,12 @@ | ||
# -*- coding: utf-8 -*- | ||
"""Functions for pulling NSSP ER data.""" | ||
import logging | ||
import random | ||
import time | ||
from datetime import datetime, timedelta | ||
from pathlib import Path | ||
from typing import Optional | ||
from urllib.error import HTTPError | ||
|
||
import pandas as pd | ||
from delphi_utils import create_backup_csv | ||
|
@@ -11,20 +15,45 @@ | |
from .constants import MAIN_DATASET_ID, PRELIM_DATASET_ID, PRELIM_SIGNALS_MAP, PRELIM_TYPE_DICT, SIGNALS_MAP, TYPE_DICT | ||
|
||
|
||
def pull_data(socrata_token: str, dataset_id: str): | ||
def check_last_updated(client, dataset_id, logger): | ||
"""Check last updated timestamp to determine data should be pulled or not.""" | ||
try: | ||
response = client.get_metadata(dataset_id) | ||
except HTTPError as err: | ||
if err.code == 503: | ||
time.sleep(2 + random.randint(0, 1000) / 1000.0) | ||
aysim319 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
response = client.get_metadata(dataset_id) | ||
else: | ||
raise err | ||
aysim319 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
updated_timestamp = datetime.utcfromtimestamp(int(response["rowsUpdatedAt"])) | ||
now = datetime.utcnow() | ||
recently_updated = (now - updated_timestamp) < timedelta(days=1) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. issue: I think this "recently-updated" logic is sufficient but not robust. For example, if we fail to pull data for multiple days, the next day we run we would not pull data we had never seen before if it was not posted in the last day. The more robust solution would be to save last pull's There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. definitely makes sense and something I didn't think about! The only thing I did different was use the api instead of scanning the file since I imagine the file list is going to go and doesn't make much sense to scan the file list every day There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, checking the API could make sense, too. The one thing I'd caution is timezones -- your previous approach explicitly used UTC on both "old" and "now" timestamps, but I don't know what the API uses. Second, the API only has dates, not times. Would that ever cause problems? E.g. we want to check for updates multiple times a day. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
since the data and the dates are just date and not datetime, I didn't take timezones into account....hmm i also don't know for sure which timezone, i believe it's EST, but have to double check
since this is data that generally updates weekly, I was planning on just running once a day, so I thought timezone wouldn't be as much of an issue There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, given these complications, I'm thinking reading/writing to a file is easier. We wouldn't need to keep a complete list of all update datetimes ever, just the single most recent datetime. So the file wouldn't keep getting bigger and bigger, we could just read a single line. This lets us store a UTC date (no timezones to worry about), no API date-processing to worry about, and we can store a datetime to be extra precise. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wasn't a fan of have metadata files, seems overkill / introduce more complexity than I would like, so after talking things through with Nolan just now, I decided to simplify the logic and create backups daily, but still do simple check to see recently updated to actually continue processing and create the csv files, so if there are outages that happened after the initial pulls, we can go back and do patches for them. Nolan also mentioned that for the future, we could look into creating a generic tool/script to dedup things specifically and I like that direction since it would seperate the complexity away from this code base |
||
prelim_prefix = "Preliminary " if dataset_id == PRELIM_DATASET_ID else "" | ||
if recently_updated: | ||
logger.info(f"{prelim_prefix}NHSN data was recently updated; Pulling data", updated_timestamp=updated_timestamp) | ||
else: | ||
logger.info(f"{prelim_prefix}NHSN data is stale; Skipping", updated_timestamp=updated_timestamp) | ||
return recently_updated | ||
|
||
|
||
def pull_data(socrata_token: str, dataset_id: str, logger): | ||
"""Pull data from Socrata API.""" | ||
client = Socrata("data.cdc.gov", socrata_token) | ||
results = [] | ||
offset = 0 | ||
limit = 50000 # maximum limit allowed by SODA 2.0 | ||
while True: | ||
page = client.get(dataset_id, limit=limit, offset=offset) | ||
if not page: | ||
break # exit the loop if no more results | ||
results.extend(page) | ||
offset += limit | ||
|
||
df = pd.DataFrame.from_records(results) | ||
recently_updated = check_last_updated(client, "ua7e-t2fy", logger) | ||
aysim319 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
df = pd.DataFrame() | ||
if recently_updated: | ||
results = [] | ||
offset = 0 | ||
limit = 50000 # maximum limit allowed by SODA 2.0 | ||
while True: | ||
aysim319 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
page = client.get(dataset_id, limit=limit, offset=offset) | ||
if not page: | ||
break # exit the loop if no more results | ||
results.extend(page) | ||
offset += limit | ||
|
||
df = pd.DataFrame.from_records(results) | ||
return df | ||
|
||
|
||
|
@@ -89,7 +118,7 @@ def pull_nhsn_data( | |
""" | ||
# Pull data from Socrata API | ||
df = ( | ||
pull_data(socrata_token, dataset_id=MAIN_DATASET_ID) | ||
pull_data(socrata_token, MAIN_DATASET_ID, logger) | ||
if not custom_run | ||
else pull_data_from_file(backup_dir, issue_date, logger, prelim_flag=False) | ||
) | ||
|
@@ -144,8 +173,9 @@ def pull_preliminary_nhsn_data( | |
pd.DataFrame | ||
Dataframe as described above. | ||
""" | ||
# Pull data from Socrata API | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I know they're similar, i thought about it and went back and forth about it but I was in the thought of maybe in the future there would be something different going on so kept it seperate. I'm not too concerned about this, since we'll be slowly deprecating this codebase; |
||
df = ( | ||
pull_data(socrata_token, dataset_id=PRELIM_DATASET_ID) | ||
pull_data(socrata_token, PRELIM_DATASET_ID, logger) | ||
if not custom_run | ||
else pull_data_from_file(backup_dir, issue_date, logger, prelim_flag=True) | ||
) | ||
|
Uh oh!
There was an error while loading. Please reload this page.