-
Notifications
You must be signed in to change notification settings - Fork 16
lint: minor code clean to nwss #1954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,10 +9,9 @@ | |
SIGNALS, | ||
PROVIDER_NORMS, | ||
METRIC_SIGNALS, | ||
METRIC_DATES, | ||
SAMPLE_SITE_NAMES, | ||
SIG_DIGITS, | ||
NEWLINE, | ||
TYPE_DICT, | ||
TYPE_DICT_METRIC, | ||
) | ||
|
||
|
||
|
@@ -35,34 +34,29 @@ def sig_digit_round(value, n_digits): | |
return result | ||
|
||
|
||
def construct_typedicts(): | ||
"""Create the type conversion dictionary for both dataframes.""" | ||
# basic type conversion | ||
type_dict = {key: float for key in SIGNALS} | ||
type_dict["timestamp"] = "datetime64[ns]" | ||
# metric type conversion | ||
signals_dict_metric = {key: float for key in METRIC_SIGNALS} | ||
metric_dates_dict = {key: "datetime64[ns]" for key in METRIC_DATES} | ||
type_dict_metric = {**metric_dates_dict, **signals_dict_metric, **SAMPLE_SITE_NAMES} | ||
return type_dict, type_dict_metric | ||
|
||
|
||
def warn_string(df, type_dict): | ||
"""Format the warning string.""" | ||
return f""" | ||
def convert_df_type(df, type_dict, logger): | ||
"""Convert types and warn if there are unexpected columns.""" | ||
try: | ||
df = df.astype(type_dict) | ||
except KeyError as exc: | ||
newline = "\n" | ||
raise KeyError( | ||
f""" | ||
Expected column(s) missed, The dataset schema may | ||
have changed. Please investigate and amend the code. | ||
|
||
Columns needed: | ||
{NEWLINE.join(sorted(type_dict.keys()))} | ||
expected={newline.join(sorted(type_dict.keys()))} | ||
|
||
Columns available: | ||
{NEWLINE.join(sorted(df.columns))} | ||
received={newline.join(sorted(df.columns))} | ||
""" | ||
) from exc | ||
if new_columns := set(df.columns) - set(type_dict.keys()): | ||
logger.info("New columns found in NWSS dataset.", new_columns=new_columns) | ||
return df | ||
|
||
|
||
def reformat(df, df_metric): | ||
"""Add columns from df_metric to df, and rename some columns. | ||
"""Add columns from df_metric to df, and rename some columns. | ||
|
||
Specifically the population and METRIC_SIGNAL columns, and renames date_start to timestamp. | ||
""" | ||
|
@@ -80,27 +74,16 @@ def reformat(df, df_metric): | |
return df | ||
|
||
|
||
def drop_unnormalized(df): | ||
dshemetov marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"""Drop unnormalized. | ||
|
||
mutate `df` to no longer have rows where the normalization scheme isn't actually identified, | ||
as we can't classify the kind of signal | ||
""" | ||
return df[~df["normalization"].isna()] | ||
|
||
|
||
def add_identifier_columns(df): | ||
"""Add identifier columns. | ||
|
||
Add columns to get more detail than key_plot_id gives; | ||
specifically, state, and `provider_normalization`, which gives the signal identifier | ||
""" | ||
df["state"] = df.key_plot_id.str.extract( | ||
r"_(\w\w)_" | ||
) # a pair of alphanumerics surrounded by _ | ||
df["provider"] = df.key_plot_id.str.extract( | ||
r"(.*)_[a-z]{2}_" | ||
) # anything followed by state ^ | ||
# a pair of alphanumerics surrounded by _ | ||
df["state"] = df.key_plot_id.str.extract(r"_(\w\w)_") | ||
# anything followed by state ^ | ||
df["provider"] = df.key_plot_id.str.extract(r"(.*)_[a-z]{2}_") | ||
df["signal_name"] = df.provider + "_" + df.normalization | ||
|
||
|
||
|
@@ -120,7 +103,7 @@ def check_endpoints(df): | |
) | ||
|
||
|
||
def pull_nwss_data(token: str): | ||
def pull_nwss_data(token: str, logger): | ||
"""Pull the latest NWSS Wastewater data, and conforms it into a dataset. | ||
|
||
The output dataset has: | ||
|
@@ -141,11 +124,6 @@ def pull_nwss_data(token: str): | |
pd.DataFrame | ||
Dataframe as described above. | ||
""" | ||
# Constants | ||
keep_columns = [*SIGNALS, *METRIC_SIGNALS] | ||
# concentration key types | ||
type_dict, type_dict_metric = construct_typedicts() | ||
|
||
# Pull data from Socrata API | ||
client = Socrata("data.cdc.gov", token) | ||
results_concentration = client.get("g653-rqe2", limit=10 ** 10) | ||
|
@@ -154,19 +132,14 @@ def pull_nwss_data(token: str): | |
df_concentration = pd.DataFrame.from_records(results_concentration) | ||
df_concentration = df_concentration.rename(columns={"date": "timestamp"}) | ||
|
||
try: | ||
df_concentration = df_concentration.astype(type_dict) | ||
except KeyError as exc: | ||
raise ValueError(warn_string(df_concentration, type_dict)) from exc | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd rather not clutter this function with repeated text when the contents of the warning don't really matter for this function. Definitely should be a |
||
# Schema checks. | ||
df_concentration = convert_df_type(df_concentration, TYPE_DICT, logger) | ||
df_metric = convert_df_type(df_metric, TYPE_DICT_METRIC, logger) | ||
|
||
try: | ||
df_metric = df_metric.astype(type_dict_metric) | ||
except KeyError as exc: | ||
raise ValueError(warn_string(df_metric, type_dict_metric)) from exc | ||
# Drop sites without a normalization scheme. | ||
df = df_concentration[~df_concentration["normalization"].isna()] | ||
|
||
# if the normalization scheme isn't recorded, why is it even included as a sample site? | ||
df = drop_unnormalized(df_concentration) | ||
# pull 2 letter state labels out of the key_plot_id labels | ||
# Pull 2 letter state labels out of the key_plot_id labels. | ||
add_identifier_columns(df) | ||
|
||
# move population and metric signals over to df | ||
|
@@ -180,13 +153,14 @@ def pull_nwss_data(token: str): | |
# otherwise, best to assume some value rather than break the data) | ||
df.population_served = df.population_served.ffill() | ||
check_endpoints(df) | ||
keep_columns.extend( | ||
[ | ||
"timestamp", | ||
"state", | ||
"population_served", | ||
"normalization", | ||
"provider", | ||
] | ||
) | ||
|
||
keep_columns = [ | ||
*SIGNALS, | ||
*METRIC_SIGNALS, | ||
"timestamp", | ||
"state", | ||
"population_served", | ||
"normalization", | ||
"provider", | ||
] | ||
return df[keep_columns] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.