Skip to content

nssp pipeline code #1952

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 71 commits into from
Jun 10, 2024
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
71 commits
Select commit Hold shift + click to select a range
d4ca5ca
to make nssp run in staging
minhkhul Mar 18, 2024
11ff7d0
add nssp to Jenkinsfile
minhkhul Mar 20, 2024
d76d6ce
nssp_token name change
minhkhul Mar 20, 2024
c85c5dd
et code
minhkhul Apr 17, 2024
3014997
Update nssp/delphi_nssp/run.py
minhkhul Apr 19, 2024
4a90591
Update nssp/README.md
minhkhul Apr 19, 2024
638c51d
Update nssp/DETAILS.md
minhkhul Apr 19, 2024
82367b4
Update nssp/delphi_nssp/__main__.py
minhkhul Apr 20, 2024
68a6154
Update nssp/delphi_nssp/pull.py
minhkhul Apr 22, 2024
8851bd6
Update nssp/delphi_nssp/run.py
minhkhul Apr 22, 2024
39e2cbd
readme update
minhkhul Apr 22, 2024
95bdac1
column names mapping + signals name standardization to fit with other…
minhkhul Apr 22, 2024
583e24e
improve readability
minhkhul Apr 23, 2024
971968e
Add type_dict constant
minhkhul Apr 24, 2024
de9ef62
more type_dict
minhkhul Apr 24, 2024
900fcc9
add more unit test pull
minhkhul Apr 25, 2024
0678564
data for unit test of pull
minhkhul Apr 25, 2024
38cd523
hrr + msa geos
minhkhul Apr 25, 2024
5807cdb
use enumerate for clarity
minhkhul Apr 25, 2024
b4ec831
Merge pull request #1950 from cmu-delphi/nssp_staging
minhkhul Apr 25, 2024
e974afb
set nssp sircal max_age to 13 days
minhkhul Apr 25, 2024
85e7b8b
set nssp sircal max_age to 15 days, to account for nighttime run
minhkhul Apr 25, 2024
2247e1b
set nssp sircal max_age to 15 days, to account for nighttime run
minhkhul Apr 25, 2024
2bfd5fc
add validation to params
minhkhul Apr 25, 2024
c65796f
Update nssp/DETAILS.md
minhkhul Apr 26, 2024
a7a869d
Update nssp/delphi_nssp/constants.py
minhkhul Apr 26, 2024
c074a45
et code
minhkhul Apr 17, 2024
1be5b28
Update nssp/delphi_nssp/run.py
minhkhul Apr 19, 2024
bd5c782
Update nssp/README.md
minhkhul Apr 19, 2024
86acc03
Update nssp/DETAILS.md
minhkhul Apr 19, 2024
9a53923
Update nssp/delphi_nssp/__main__.py
minhkhul Apr 20, 2024
54094eb
Update nssp/delphi_nssp/pull.py
minhkhul Apr 22, 2024
1552504
Update nssp/delphi_nssp/run.py
minhkhul Apr 22, 2024
6ccddfc
readme update
minhkhul Apr 22, 2024
e560393
column names mapping + signals name standardization to fit with other…
minhkhul Apr 22, 2024
309b6c7
improve readability
minhkhul Apr 23, 2024
813d289
Add type_dict constant
minhkhul Apr 24, 2024
db1f8ae
more type_dict
minhkhul Apr 24, 2024
2a357c8
add more unit test pull
minhkhul Apr 25, 2024
d968789
data for unit test of pull
minhkhul Apr 25, 2024
24a638d
hrr + msa geos
minhkhul Apr 25, 2024
b11c528
use enumerate for clarity
minhkhul Apr 25, 2024
8169655
to make nssp run in staging
minhkhul Mar 18, 2024
fde9264
add nssp to Jenkinsfile
minhkhul Mar 20, 2024
bd545c8
nssp_token name change
minhkhul Mar 20, 2024
7a3807e
set nssp sircal max_age to 15 days, to account for nighttime run
minhkhul Apr 25, 2024
3623b5f
set nssp sircal max_age to 13 days
minhkhul Apr 25, 2024
daee033
add validation to params
minhkhul Apr 25, 2024
566a826
Update nssp/DETAILS.md
minhkhul Apr 26, 2024
cfa1b94
Update nssp/delphi_nssp/constants.py
minhkhul Apr 26, 2024
425b1fe
nssp correlation rmd and general notebook folder
dsweber2 May 9, 2024
5ce26f0
making Black happy
dsweber2 May 9, 2024
9b87133
update to new geomapper function
dsweber2 May 10, 2024
d53ff83
following 120 line convention everywhere
dsweber2 May 10, 2024
8eb6055
happy linter
dsweber2 May 10, 2024
e32cc55
happy black formatter in nssp
dsweber2 May 10, 2024
adf5df4
drop unneeded nssp tests
dsweber2 May 10, 2024
3559664
updates borked old tests, caught by @dshemetov
dsweber2 May 10, 2024
67601aa
rebase woes and version consistency
dsweber2 May 13, 2024
a79cff8
Update nssp-params-prod.json.j2 min/max lag to 13
minhkhul May 14, 2024
9c6f31b
Update params.json.template min/max lag to 7 and 13
minhkhul May 14, 2024
33a188e
missed column renames for geo_mapper, unneeded index
dsweber2 May 14, 2024
8a4cd18
Merge branch 'main' into nssp
dsweber2 Jun 5, 2024
eb2f000
Merge branch 'main' into nssp
dshemetov Jun 5, 2024
24b25dd
lint+fix: update from linter changes
dshemetov Jun 5, 2024
8daefe6
ci: update ci to lint nssp
dshemetov Jun 5, 2024
ec39773
lint: linter happy
dshemetov Jun 5, 2024
2e178a8
lint: pydocstyle happy
dshemetov Jun 5, 2024
90081df
lint: pydocstyle happy
dshemetov Jun 5, 2024
91b759c
Resolved merge conflicts by accepting all incoming changes
minhkhul Jun 10, 2024
355d65b
pct_visits to pct_ed_visits
minhkhul Jun 10, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions nssp/.pylintrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@

[MESSAGES CONTROL]

disable=logging-format-interpolation,
too-many-locals,
too-many-arguments,
# Allow pytest functions to be part of a class.
no-self-use,
# Allow pytest classes to have one test.
too-few-public-methods

[BASIC]

# Allow arbitrarily short-named variables.
variable-rgx=[a-z_][a-z0-9_]*
argument-rgx=[a-z_][a-z0-9_]*
attr-rgx=[a-z_][a-z0-9_]*

[DESIGN]

# Don't complain about pytest "unused" arguments.
ignored-argument-names=(_.*|run_as_module)
13 changes: 13 additions & 0 deletions nssp/DETAILS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# NSSP data

We import the NSSP Emergency Department Visit data, including percentage and smoothed percentage data, from the CDC website. The data is available in county level, state level and national level.

## Geographical Levels
* `state`: reported using two-letter postal code
* `county`: reported using fips code
* `national`: just `us` for now
## Metrics
* `percent_visits_covid`, `percent_visits_rsv`, `percent_visits_influenza`: percentage of emergency department patient visits for specified pathogen.
* `percent_visits_combined`: sum of the three percentages of visits for flu, rsv and covid.
* `smoothed_percent_visits_covid`, `smoothed_percent_visits_rsv`, `smoothed_percent_visits_influenza`: 3 week moving average of the percentage of emergency department patient visits for specified pathogen.
* `smoothed_percent_visits_combined`: 3 week moving average of the sum of the three percentages of visits for flu, rsv and covid.
29 changes: 29 additions & 0 deletions nssp/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
.PHONY = venv, lint, test, clean

dir = $(shell find ./delphi_* -name __init__.py | grep -o 'delphi_[_[:alnum:]]*' | head -1)
venv:
python3.8 -m venv env

install: venv
. env/bin/activate; \
pip install wheel ; \
pip install -e ../_delphi_utils_python ;\
pip install -e .

install-ci: venv
. env/bin/activate; \
pip install wheel ; \
pip install ../_delphi_utils_python ;\
pip install .

lint:
. env/bin/activate; pylint $(dir)
. env/bin/activate; pydocstyle $(dir)

test:
. env/bin/activate ;\
(cd tests && ../env/bin/pytest --cov=$(dir) --cov-report=term-missing)

clean:
rm -rf env
rm -f params.json
75 changes: 75 additions & 0 deletions nssp/README.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran these as it suggests and things run fine. The linter is a little angry. Either way

Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# NSSP Emergency Department Visit data

We import the NSSP Emergency Department Visit data, currently only the smoothed concentration, from the CDC website, aggregate to the state and national level from the wastewater sample site level, and export the aggregated data.
For details see the `DETAILS.md` file in this directory.

## Create a MyAppToken
`MyAppToken` is required when fetching data from SODA Consumer API
(https://dev.socrata.com/foundry/data.cdc.gov/r8kw-7aab). Follow the
steps below to create a MyAppToken.
- Click the `Sign up for an app token` button in the linked website
- Sign In or Sign Up with Socrata ID
- Click the `Create New App Token` button
- Fill in `Application Name` and `Description` (You can just use delphi_wastewater
for both) and click `Save`
- Copy the `App Token`


## Running the Indicator

The indicator is run by directly executing the Python module contained in this
directory. The safest way to do this is to create a virtual environment,
installed the common DELPHI tools, and then install the module and its
dependencies. To do this, run the following command from this directory:

```
make install
```

This command will install the package in editable mode, so you can make changes that
will automatically propagate to the installed package.

All of the user-changable parameters are stored in `params.json`. To execute
the module and produce the output datasets (by default, in `receiving`), run
the following:

```
env/bin/python -m delphi_nssp
```

If you want to enter the virtual environment in your shell,
you can run `source env/bin/activate`. Run `deactivate` to leave the virtual environment.

Once you are finished, you can remove the virtual environment and
params file with the following:

```
make clean
```

## Testing the code

To run static tests of the code style, run the following command:

```
make lint
```

Unit tests are also included in the module. To execute these, run the following
command from this directory:

```
make test
```

To run individual tests, run the following:

```
(cd tests && ../env/bin/pytest <your_test>.py --cov=delphi_NAME --cov-report=term-missing)
```

The output will show the number of unit tests that passed and failed, along
with the percentage of code covered by the tests.

None of the linting or unit tests should fail, and the code lines that are not covered by unit tests should be small and
should not include critical sub-routines.
38 changes: 38 additions & 0 deletions nssp/REVIEW.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
## Code Review (Python)

A code review of this module should include a careful look at the code and the
output. To assist in the process, but certainly not in replace of it, please
check the following items.

**Documentation**

- [ ] the README.md file template is filled out and currently accurate; it is
possible to load and test the code using only the instructions given
- [ ] minimal docstrings (one line describing what the function does) are
included for all functions; full docstrings describing the inputs and expected
outputs should be given for non-trivial functions

**Structure**

- [ ] code should pass lint checks (`make lint`)
- [ ] any required metadata files are checked into the repository and placed
within the directory `static`
- [ ] any intermediate files that are created and stored by the module should
be placed in the directory `cache`
- [ ] final expected output files to be uploaded to the API are placed in the
`receiving` directory; output files should not be committed to the respository
- [ ] all options and API keys are passed through the file `params.json`
- [ ] template parameter file (`params.json.template`) is checked into the
code; no personal (i.e., usernames) or private (i.e., API keys) information is
included in this template file

**Testing**

- [ ] module can be installed in a new virtual environment (`make install`)
- [ ] reasonably high level of unit test coverage covering all of the main logic
of the code (e.g., missing coverage for raised errors that do not currently seem
possible to reach are okay; missing coverage for options that will be needed are
not)
- [ ] all unit tests run without errors (`make test`)
- [ ] indicator directory has been added to GitHub CI
(`covidcast-indicators/.github/workflows/python-ci.yml`)
Empty file added nssp/cache/.gitignore
Empty file.
14 changes: 14 additions & 0 deletions nssp/delphi_nssp/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# -*- coding: utf-8 -*-
"""Module to pull and clean indicators from the NSSP source.

This file defines the functions that are made public by the module. As the
module is intended to be executed though the main method, these are primarily
for testing.
"""

from __future__ import absolute_import

from . import pull
from . import run

__version__ = "0.1.0"
12 changes: 12 additions & 0 deletions nssp/delphi_nssp/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# -*- coding: utf-8 -*-
"""Call the function run_module when executed.

This file indicates that calling the module (`python -m delphi_nssp`) will
call the function `run_module` found within the run.py file. There should be
no need to change this template.
"""

from delphi_utils import read_params
from .run import run_module # pragma: no cover

run_module(read_params()) # pragma: no cover
34 changes: 34 additions & 0 deletions nssp/delphi_nssp/constants.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
"""Registry for variations."""

GEOS = [
"nation",
"state",
"county",
]

SIGNALS_MAP = {
"percent_visits_covid": "pct_visits_covid",
"percent_visits_influenza": "pct_visits_influenza",
"percent_visits_rsv": "pct_visits_rsv",
"percent_visits_combined": "pct_visits_combined",
"percent_visits_smoothed_covid": "smoothed_pct_visits_covid",
"percent_visits_smoothed_1": "smoothed_pct_visits_influenza",
"percent_visits_smoothed_rsv": "smoothed_pct_visits_rsv",
"percent_visits_smoothed": "smoothed_pct_visits_combined",
}

SIGNALS = ["pct_visits_covid", "pct_visits_influenza", "pct_visits_rsv", "pct_visits_combined",
"smoothed_pct_visits_covid", "smoothed_pct_visits_influenza",
"smoothed_pct_visits_rsv", "smoothed_pct_visits_combined"]

NEWLINE = "\n"

CSV_COLS = [
"geo_id",
"val",
"se",
"sample_size",
"missing_val",
"missing_se",
"missing_sample_size"
]
84 changes: 84 additions & 0 deletions nssp/delphi_nssp/pull.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# -*- coding: utf-8 -*-
"""Functions for pulling NSSP ER data."""

import numpy as np
import pandas as pd
from sodapy import Socrata

from .constants import (
SIGNALS,
NEWLINE,
SIGNALS_MAP,
)


def construct_typedicts():
"""Create the type conversion dictionary for dataframe."""
# basic type conversion
type_dict = {key: float for key in SIGNALS}
type_dict["timestamp"] = "datetime64[ns]"
type_dict["geography"] = str
type_dict["county"] = str
type_dict["fips"] = int
return type_dict


def warn_string(df, type_dict):
"""Format the warning string."""
return f"""
Expected column(s) missed, The dataset schema may
have changed. Please investigate and amend the code.

Columns needed:
{NEWLINE.join(sorted(type_dict.keys()))}

Columns available:
{NEWLINE.join(sorted(df.columns))}
"""


def pull_nssp_data(socrata_token: str):
"""Pull the latest NWSS Wastewater data, and conforms it into a dataset.

The output dataset has:

- Each row corresponds to a single observation
- Each row additionally has columns for the signals in SIGNALS

Parameters
----------
socrata_token: str
My App Token for pulling the NWSS data (could be the same as the nchs data)
test_file: Optional[str]
When not null, name of file from which to read test data

Returns
-------
pd.DataFrame
Dataframe as described above.
"""
type_dict = construct_typedicts()

# Pull data from Socrata API
client = Socrata("data.cdc.gov", socrata_token)
results = []
offset = 0
limit = 50000 # maximum limit allowed by SODA 2.0
while True:
page = client.get("rdmq-nq56", limit=limit, offset=offset)
if not page:
break # exit the loop if no more results
results.extend(page)
offset += limit
Comment on lines +54 to +60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
limit = 50000 # maximum limit allowed by SODA 2.0
while True:
page = client.get("rdmq-nq56", limit=limit, offset=offset)
if not page:
break # exit the loop if no more results
results.extend(page)
offset += limit
limit = 50_000
for ii in range(100):
page = client.get("rdmq-nq56", limit=limit, offset=offset)
if not page:
max_ii = ii
break # exit the loop if no more results
results.extend(page)
offset += limit
if max_ii == 100:
raise ValueError("client has pulled 100x the socrata limit")

This is probably fine, but while true freaks me out. Feel free to use or not

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsweber2 why did you choose 100 here for the limit in your rewrite? I believe 5k is the per-page item limit, not the total limit. So theoretically we could get an infinite-page result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100 was maybe too low, though it would correspond to 50,000,000 items. If we're pulling more than that it should be quite a while down the road, or something has gone wrong. (or I may be misunderstanding how item counts work).

df_ervisits = pd.DataFrame.from_records(results)
print(df_ervisits.columns)
df_ervisits = df_ervisits.rename(columns={"week_end": "timestamp"})
df_ervisits = df_ervisits.rename(columns=SIGNALS_MAP)

try:
df_ervisits = df_ervisits.astype(type_dict)
except KeyError as exc:
raise ValueError(warn_string(df_ervisits, type_dict)) from exc

keep_columns = ["timestamp", "geography", "county", "fips"]
return df_ervisits[SIGNALS + keep_columns]
Loading
Loading