Skip to content

Commit eba4ea2

Browse files
minhkhulmelange396
andauthored
Delete secondary nssp signals (#2101)
* delete secondary * remove docs * Update nssp/DETAILS.md Co-authored-by: george <[email protected]> * Update nssp/DETAILS.md Co-authored-by: george <[email protected]> --------- Co-authored-by: george <[email protected]>
1 parent 156e32e commit eba4ea2

File tree

6 files changed

+5
-191
lines changed

6 files changed

+5
-191
lines changed

nssp/DETAILS.md

Lines changed: 2 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -2,29 +2,16 @@
22

33
We import the NSSP Emergency Department Visit data, including percentage and smoothed percentage of ER visits attributable to a given pathogen, from the CDC website. The data is provided at the county level, state level and national level; we do a population-weighted mean to aggregate from county data up to the HRR and MSA levels.
44

5-
There are 2 sources we grab data from for nssp:
6-
- Primary source: https://data.cdc.gov/Public-Health-Surveillance/NSSP-Emergency-Department-Visit-Trajectories-by-St/rdmq-nq56/data_preview
7-
- Secondary (2023RVR) source: https://data.cdc.gov/Public-Health-Surveillance/2023-Respiratory-Virus-Response-NSSP-Emergency-Dep/7mra-9cq9/data_preview
8-
There are 8 signals output from the primary source and 4 output from secondary. There are no smoothed signals from secondary source.
9-
10-
Note that the data produced from secondary source are mostly the same as their primary source equivalent, with past analysis shows around 95% of datapoints having less than 0.1 value difference and the other 5% having a 0.1 to 1.2 value difference.
5+
NSSP source data: https://data.cdc.gov/Public-Health-Surveillance/NSSP-Emergency-Department-Visit-Trajectories-by-St/rdmq-nq56/data_preview
116

127
## Geographical Levels
13-
Primary source:
148
* `state`: reported from source using two-letter postal code
159
* `county`: reported from source using fips code
1610
* `national`: just `us` for now, reported from source
1711
* `hhs`, `hrr`, `msa`: not reported from source, so we computed them from county-level data using a weighted mean. Each county is assigned a weight equal to its population in the last census (2020).
1812

19-
Secondary (2023RVR) source:
20-
* `state`: reported from source
21-
* `hhs`: reported from source
22-
* `national`: reported from source
23-
2413
## Metrics
2514
* `percent_visits_covid`, `percent_visits_rsv`, `percent_visits_influenza`: percentage of emergency department patient visits for specified pathogen.
2615
* `percent_visits_combined`: sum of the three percentages of visits for flu, rsv and covid.
2716
* `smoothed_percent_visits_covid`, `smoothed_percent_visits_rsv`, `smoothed_percent_visits_influenza`: 3 week moving average of the percentage of emergency department patient visits for specified pathogen.
28-
* `smoothed_percent_visits_combined`: 3 week moving average of the sum of the three percentages of visits for flu, rsv and covid.
29-
* `percent_visits_covid_2023RVR`, `percent_visits_rsv_2023RVR`, `percent_visits_influenza_2023RVR`: Taken from secondary source, percentage of emergency department patient visits for specified pathogen.
30-
* `percent_visits_combined_2023RVR`: Taken from secondary source, sum of the three percentages of visits for flu, rsv and covid.
17+
* `smoothed_percent_visits_combined`: 3 week moving average of the sum of the three percentages of visits for flu, rsv and covid.

nssp/README.md

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,7 @@
22

33
We import the NSSP Emergency Department Visit data, currently only the smoothed concentration, from the CDC website, aggregate to the state and national level from the wastewater sample site level, and export the aggregated data.
44

5-
There are 2 sources we grab data from for nssp:
6-
- Primary source: https://data.cdc.gov/Public-Health-Surveillance/NSSP-Emergency-Department-Visit-Trajectories-by-St/rdmq-nq56/data_preview
7-
- Secondary source: https://data.cdc.gov/Public-Health-Surveillance/2023-Respiratory-Virus-Response-NSSP-Emergency-Dep/7mra-9cq9/data_preview
5+
NSSP source data: https://data.cdc.gov/Public-Health-Surveillance/NSSP-Emergency-Department-Visit-Trajectories-by-St/rdmq-nq56/data_preview
86

97
For details see the `DETAILS.md` file in this directory.
108

nssp/delphi_nssp/constants.py

Lines changed: 0 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -41,29 +41,3 @@
4141
"fips": str,
4242
}
4343
)
44-
45-
SECONDARY_COLS_MAP = {
46-
"week_end": "timestamp",
47-
"geography": "geo_value",
48-
"percent_visits": "val",
49-
"pathogen": "signal",
50-
}
51-
52-
SECONDARY_SIGNALS_MAP = {
53-
"COVID-19": "pct_ed_visits_covid_2023RVR",
54-
"Influenza": "pct_ed_visits_influenza_2023RVR",
55-
"RSV": "pct_ed_visits_rsv_2023RVR",
56-
"Combined": "pct_ed_visits_combined_2023RVR",
57-
}
58-
59-
SECONDARY_SIGNALS = [val for (key, val) in SECONDARY_SIGNALS_MAP.items()]
60-
SECONDARY_GEOS = ["state", "nation", "hhs"]
61-
62-
SECONDARY_TYPE_DICT = {
63-
"timestamp": "datetime64[ns]",
64-
"geo_value": str,
65-
"val": float,
66-
"geo_type": str,
67-
"signal": str,
68-
}
69-
SECONDARY_KEEP_COLS = [key for (key, val) in SECONDARY_TYPE_DICT.items()]

nssp/delphi_nssp/pull.py

Lines changed: 0 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,6 @@
1010

1111
from .constants import (
1212
NEWLINE,
13-
SECONDARY_COLS_MAP,
14-
SECONDARY_KEEP_COLS,
15-
SECONDARY_SIGNALS_MAP,
16-
SECONDARY_TYPE_DICT,
1713
SIGNALS,
1814
SIGNALS_MAP,
1915
TYPE_DICT,
@@ -96,53 +92,3 @@ def pull_nssp_data(socrata_token: str, backup_dir: str, custom_run: bool, logger
9692

9793
keep_columns = ["timestamp", "geography", "county", "fips"]
9894
return df_ervisits[SIGNALS + keep_columns]
99-
100-
101-
def secondary_pull_nssp_data(
102-
socrata_token: str, backup_dir: str, custom_run: bool, logger: Optional[logging.Logger] = None
103-
):
104-
"""Pull the latest NSSP ER visits secondary dataset.
105-
106-
https://data.cdc.gov/Public-Health-Surveillance/2023-Respiratory-Virus-Response-NSSP-Emergency-Dep/7mra-9cq9/data_preview
107-
108-
The output dataset has:
109-
110-
- Each row corresponds to a single observation
111-
112-
Parameters
113-
----------
114-
socrata_token: str
115-
My App Token for pulling the NSSP data (could be the same as the nchs data)
116-
117-
Returns
118-
-------
119-
pd.DataFrame
120-
Dataframe as described above.
121-
"""
122-
socrata_results = pull_with_socrata_api(socrata_token, "7mra-9cq9")
123-
df_ervisits = pd.DataFrame.from_records(socrata_results)
124-
create_backup_csv(df_ervisits, backup_dir, custom_run, sensor="secondary", logger=logger)
125-
df_ervisits = df_ervisits.rename(columns=SECONDARY_COLS_MAP)
126-
127-
# geo_type is not provided in the dataset, so we infer it from the geo_value
128-
# which is either state names, "National" or hhs region numbers
129-
df_ervisits["geo_type"] = "state"
130-
131-
df_ervisits.loc[df_ervisits["geo_value"] == "National", "geo_type"] = "nation"
132-
133-
hhs_region_mask = df_ervisits["geo_value"].str.lower().str.startswith("region ")
134-
df_ervisits.loc[hhs_region_mask, "geo_value"] = df_ervisits.loc[hhs_region_mask, "geo_value"].str.replace(
135-
"Region ", ""
136-
)
137-
df_ervisits.loc[hhs_region_mask, "geo_type"] = "hhs"
138-
139-
df_ervisits["signal"] = df_ervisits["signal"].map(SECONDARY_SIGNALS_MAP)
140-
141-
df_ervisits = df_ervisits[SECONDARY_KEEP_COLS]
142-
143-
try:
144-
df_ervisits = df_ervisits.astype(SECONDARY_TYPE_DICT)
145-
except KeyError as exc:
146-
raise ValueError(warn_string(df_ervisits, SECONDARY_TYPE_DICT)) from exc
147-
148-
return df_ervisits

nssp/delphi_nssp/run.py

Lines changed: 2 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -31,8 +31,8 @@
3131
from delphi_utils.geomap import GeoMapper
3232
from delphi_utils.nancodes import add_default_nancodes
3333

34-
from .constants import AUXILIARY_COLS, CSV_COLS, GEOS, SECONDARY_GEOS, SECONDARY_SIGNALS, SIGNALS
35-
from .pull import pull_nssp_data, secondary_pull_nssp_data
34+
from .constants import AUXILIARY_COLS, CSV_COLS, GEOS, SIGNALS
35+
from .pull import pull_nssp_data
3636

3737

3838
def add_needed_columns(df, col_names=None):
@@ -141,52 +141,5 @@ def run_module(params):
141141
if len(dates) > 0:
142142
run_stats.append((max(dates), len(dates)))
143143

144-
logger.info("Generating secondary signals")
145-
secondary_df_pull = secondary_pull_nssp_data(socrata_token, backup_dir, custom_run, logger)
146-
for signal in SECONDARY_SIGNALS:
147-
secondary_df_pull_signal = secondary_df_pull[secondary_df_pull["signal"] == signal]
148-
if secondary_df_pull_signal.empty:
149-
logger.warning("No data found for signal", signal=signal)
150-
continue
151-
for geo in SECONDARY_GEOS:
152-
df = secondary_df_pull_signal.copy()
153-
logger.info("Generating signal and exporting to CSV", geo_type=geo, signal=signal)
154-
if geo == "state":
155-
df = df[(df["geo_type"] == "state")]
156-
df["geo_id"] = df["geo_value"].apply(
157-
lambda x: (
158-
us.states.lookup(x).abbr.lower()
159-
if us.states.lookup(x)
160-
else ("dc" if x == "District of Columbia" else x)
161-
)
162-
)
163-
unexpected_state_names = df[df["geo_id"] == df["geo_value"]]
164-
if unexpected_state_names.shape[0] > 0:
165-
logger.error(
166-
"Unexpected state names",
167-
unexpected_state_names=unexpected_state_names["geo_value"].unique(),
168-
)
169-
raise RuntimeError
170-
elif geo == "nation":
171-
df = df[(df["geo_type"] == "nation")]
172-
df["geo_id"] = "us"
173-
elif geo == "hhs":
174-
df = df[(df["geo_type"] == "hhs")]
175-
df["geo_id"] = df["geo_value"]
176-
# add se, sample_size, and na codes
177-
missing_cols = set(CSV_COLS) - set(df.columns)
178-
df = add_needed_columns(df, col_names=list(missing_cols))
179-
df_csv = df[CSV_COLS + ["timestamp"]]
180-
# actual export
181-
dates = create_export_csv(
182-
df_csv,
183-
geo_res=geo,
184-
export_dir=export_dir,
185-
sensor=signal,
186-
weekly_dates=True,
187-
)
188-
if len(dates) > 0:
189-
run_stats.append((max(dates), len(dates)))
190-
191144
## log this indicator run
192145
logging(start_time, run_stats, logger)

nssp/tests/test_pull.py

Lines changed: 0 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -7,16 +7,11 @@
77

88
from delphi_nssp.pull import (
99
pull_nssp_data,
10-
secondary_pull_nssp_data,
1110
pull_with_socrata_api,
1211
)
1312

1413
from delphi_nssp.constants import (
1514
NEWLINE,
16-
SECONDARY_COLS_MAP,
17-
SECONDARY_KEEP_COLS,
18-
SECONDARY_SIGNALS_MAP,
19-
SECONDARY_TYPE_DICT,
2015
SIGNALS,
2116
SIGNALS_MAP,
2217
TYPE_DICT,
@@ -81,44 +76,5 @@ def test_pull_nssp_data(self, mock_socrata, caplog):
8176
for file in backup_files:
8277
os.remove(file)
8378

84-
@patch("delphi_nssp.pull.Socrata")
85-
def test_secondary_pull_nssp_data(self, mock_socrata):
86-
today = pd.Timestamp.today().strftime("%Y%m%d")
87-
backup_dir = 'test_raw_data_backups'
88-
89-
# Load test data
90-
with open("test_data/secondary_page.txt", "r") as f:
91-
test_data = json.load(f)
92-
93-
# Mock Socrata client and its get method
94-
mock_client = MagicMock()
95-
mock_client.get.side_effect = [test_data, []] # Return test data on first call, empty list on second call
96-
mock_socrata.return_value = mock_client
97-
98-
custom_run = False
99-
logger = get_structured_logger()
100-
# Call function with test token
101-
test_token = "test_token"
102-
result = secondary_pull_nssp_data(test_token, backup_dir, custom_run, logger)
103-
# print(result)
104-
105-
# Check that Socrata client was initialized with correct arguments
106-
mock_socrata.assert_called_once_with("data.cdc.gov", test_token)
107-
108-
# Check that get method was called with correct arguments
109-
mock_client.get.assert_any_call("7mra-9cq9", limit=50000, offset=0)
110-
111-
for col in SECONDARY_KEEP_COLS:
112-
assert result[col].notnull().all(), f"{col} has rogue NaN"
113-
114-
assert result[result['geo_value'].str.startswith('Region') ].empty, "'Region ' need to be removed from geo_value for geo_type 'hhs'"
115-
assert (result[result['geo_type'] == 'nation']['geo_value'] == 'National').all(), "All rows with geo_type 'nation' must have geo_value 'National'"
116-
117-
# Check that backup file was created
118-
backup_files = glob.glob(f"{backup_dir}/{today}*")
119-
assert len(backup_files) == 2, "Backup file was not created"
120-
for file in backup_files:
121-
os.remove(file)
122-
12379
if __name__ == "__main__":
12480
unittest.main()

0 commit comments

Comments
 (0)