Skip to content

Commit 30b90fb

Browse files
authored
Merge pull request #966 from cmu-delphi/main
Deploy validation params to production
2 parents cdf6995 + 67a10e0 commit 30b90fb

40 files changed

+694
-4463
lines changed

.github/CONTRIBUTING.md

Lines changed: 9 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,11 @@
44

55
* `main`
66

7-
The primary/authoritative branch of this repository is called `main`, and contains up-to-date code and supporting libraries. This should be your starting point when creating a new indicator. It is protected so that only reviewed pull requests can be merged in.
7+
The primary branch of this repository is called `main`, and contains the version of the code and supporting libraries currently under development. This should be your starting point when creating a new indicator. It is protected so that only reviewed pull requests can be merged in. The main branch is configured to deploy to our staging environment on push. CI is set up to build and test all indicators on PR.
88

9-
* `deploy-*`
9+
* `prod`
1010

11-
Each automated pipeline has a corresponding branch which automatically deploys to a runtime host which runs the pipeline at a designated time each day. New features and bugfixes are merged into this branch using a pull request, so that our CI system can run the lint and test cycles and make sure the package will run correctly on the runtime host. If an indicator does not have a branch named after it starting with `deploy-`, that means the indicator has not yet been automated, and has a designated human keeper who is responsible for making sure the indicator runs each day -- whether that is manually or using a scheduler like cron is the keeper's choice.
11+
The production branch is configured to automatically deploy to our production environment on push, and is protected so that only administrators can push or merge. CI is set up to build and test all indicators on PR.
1212

1313
* everything else
1414

@@ -22,15 +22,6 @@ If you ensure that each issue deals with a single topic (ie a single new propose
2222

2323
Admins will assign issues to one or more people based on balancing expediency, expertise, and team robustness. It may be faster for one person to fix something, but we can reduce the risk of having too many single points of failure if two people work on it together.
2424

25-
## Project Boards
26-
27-
The Delphi Engineering team uses project boards to structure its weekly calls and track active tasks.
28-
29-
Immediate work is tracked on [Release Planning](https://github.com/cmu-delphi/covidcast-indicators/projects/2)
30-
31-
Long-term work and modeling collaborations are tracked on [Refactoring](https://github.com/cmu-delphi/covidcast-indicators/projects/3)
32-
33-
3425
## General workflow for indicators creation and deployment
3526

3627
So, how does one go about developing a pipeline for a new data source?
@@ -40,13 +31,11 @@ So, how does one go about developing a pipeline for a new data source?
4031
1. Create your new indicator branch from `main`.
4132
2. Build it using the appropriate template, following the guidelines in the included README.md and REVIEW.md files.
4233
3. Make some stuff!
43-
4. When your stuff works, push your `dev-*` branch to remote for review.
44-
5. Consult with a platform engineer for the remaining production setup needs. They will create a branch called `deploy-*` for your indicator.
45-
6. Initiate a pull request against this new branch.
46-
7. Following [the source documentation template](https://github.com/cmu-delphi/delphi-epidata/blob/main/docs/api/covidcast-signals/_source-template.md), create public API documentation for the source. You can submit this as a pull request against the delphi-epidata repository.
47-
8. If your peers like the code, the documentation is ready, and Jenkins approves, deploy your changes by merging the PR.
48-
9. An admin will propagate your successful changes to `main`.
49-
10. Rejoice!
34+
4. When your stuff works, push your development branch to remote, and open a PR against `main` for review.
35+
5. Once your PR has been merged, consult with a platform engineer for the remaining production setup needs. They will create a deployment workflow for your indicator including any necessary production parameters. Production secrets are encrypted in the Ansible vault. This workflow will be tested in staging by admins, who will consult you about any problems they encounter.
36+
6. Following [the source documentation template](https://github.com/cmu-delphi/delphi-epidata/blob/main/docs/api/covidcast-signals/_source-template.md), create public API documentation for the source. You can submit this as a pull request against the delphi-epidata repository.
37+
7. If your peers like the code, the documentation is ready, and the staging runs are successful, work with admins to schedule your indicator in production, merge the documentation, and announce the new indicator to the mailing list.
38+
8. Rejoice!
5039

5140
### Starting out
5241

@@ -86,12 +75,11 @@ becomes available to the public.
8675

8776
Once you have your branch set up you should get in touch with a platform engineer to pair up on the remaining production needs. These include:
8877

89-
- Creating the corresponding `deploy-*` branch in the repo.
9078
- Adding the necessary Jenkins scripts for your indicator.
9179
- Preparing the runtime host with any Automation configuration necessities.
9280
- Reviewing the workflow to make sure it meets the general guidelines and will run as expected on the runtime host.
9381

94-
Once all the last mile configuration is in place you can create a pull request against the correct `deploy-*` branch to initiate the CI/CD pipeline which will build, test, and package your indicator for deployment.
82+
Once all the last mile configuration is in place you can create a pull request against `prod` to initiate the CI/CD pipeline which will build, test, and package your indicator for deployment.
9583

9684
If everything looks ok, you've drafted source documentation, platform engineering has validated the last mile, and the pull request is accepted, you can merge the PR. Deployment will start automatically.
9785

.github/workflows/python-ci.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ jobs:
1616
if: github.event.pull_request.draft == false
1717
strategy:
1818
matrix:
19-
packages: [_delphi_utils_python, cdc_covidnet, changehc, claims_hosp, combo_cases_and_deaths, covid_act_now, doctor_visits, google_symptoms, hhs_hosp, hhs_facilities, jhu, nchs_mortality, nowcast, quidel, quidel_covidtest, safegraph, safegraph_patterns, usafacts]
19+
packages: [_delphi_utils_python, changehc, claims_hosp, combo_cases_and_deaths, covid_act_now, doctor_visits, google_symptoms, hhs_hosp, hhs_facilities, jhu, nchs_mortality, nowcast, quidel, quidel_covidtest, safegraph, safegraph_patterns, usafacts]
2020
defaults:
2121
run:
2222
working-directory: ${{ matrix.packages }}

_delphi_utils_python/delphi_utils/validator/README.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -53,18 +53,19 @@ Please update the follow settings:
5353

5454
* `common`: global validation settings
5555
* `data_source`: should match the [formatting](https://cmu-delphi.github.io/delphi-epidata/api/covidcast_signals.html) as used in COVIDcast API calls
56-
* `end_date`: specifies the last date to be checked; this can be specified as `YYYY-MM-DD` or as `today-{num}`. The latter is interpretted as `num` days before the current date (with `today-0` being today).
56+
* `end_date`: specifies the last date to be checked; this can be specified as `YYYY-MM-DD`, `today`, or `today-{num}`. The latter is interpretted as `num` days before the current date.
5757
* `span_length`: specifies the number of days before the `end_date` to check. `span_length` should be long enough to contain all recent source data that is still in the process of being updated (i.e. in the backfill period), for example, if the data source of interest has a 2-week lag before all reports are in for a given date, `span_length` should be 14 days
5858
* `suppressed_errors`: list of objects specifying errors that have been manually verified as false positives or acceptable deviations from expected. These errors can be specified with the following variables, where omitted values are interpreted as a wildcard, i.e., not specifying a date applies to all dates:
59-
* `check_name` (required): name of the check, as specified in the validation output
59+
* `check_name`: name of the check, as specified in the validation output
6060
* `date`: date in `YYYY-MM-DD` format
6161
* `geo_type`: geo resolution of the data
6262
* `signal`: name of COVIDcast API signal
6363
* `test_mode`: boolean; `true` checks only a small number of data files
6464
* `static`: settings for validations that don't require comparison with external COVIDcast API data
6565
* `minimum_sample_size` (default: 100): threshold for flagging small sample sizes as invalid
6666
* `missing_se_allowed` (default: False): whether signals with missing standard errors are valid
67-
* `misisng_sample_size_allowed` (default: False): whether signals with missing sample sizes are valid
67+
* `missing_sample_size_allowed` (default: False): whether signals with missing sample sizes are valid
68+
* `additional_valid_geo_values` (default: `{}`): map of geo type names to lists of geo values that are not recorded in the GeoMapper but are nonetheless valid for this indicator
6869
* `dynamic`: settings for validations that require comparison with external COVIDcast API data
6970
* `ref_window_size` (default: 7): number of days over which to look back for comparison
7071
* `smoothed_signals`: list of the names of the signals that are smoothed (e.g. 7-day average)

_delphi_utils_python/delphi_utils/validator/dynamic.py

Lines changed: 17 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -119,27 +119,23 @@ def validate(self, all_frames, report):
119119
continue
120120

121121
# Outlier dataframe
122-
if (signal_type in ["confirmed_7dav_cumulative_num", "confirmed_7dav_incidence_num",
123-
"confirmed_cumulative_num", "confirmed_incidence_num",
124-
"deaths_7dav_cumulative_num",
125-
"deaths_cumulative_num"]):
126-
earliest_available_date = geo_sig_df["time_value"].min()
127-
source_df = geo_sig_df.query(
128-
'time_value <= @self.params.time_window.end_date & '
129-
'time_value >= @self.params.time_window.start_date'
130-
)
131-
132-
# These variables are interpolated into the call to `api_df_or_error.query()`
133-
# below but pylint doesn't recognize that.
134-
# pylint: disable=unused-variable
135-
outlier_start_date = earliest_available_date - outlier_lookbehind
136-
outlier_end_date = earliest_available_date - timedelta(days=1)
137-
outlier_api_df = api_df_or_error.query(
138-
'time_value <= @outlier_end_date & time_value >= @outlier_start_date')
139-
# pylint: enable=unused-variable
140-
141-
self.check_positive_negative_spikes(
142-
source_df, outlier_api_df, geo_type, signal_type, report)
122+
earliest_available_date = geo_sig_df["time_value"].min()
123+
source_df = geo_sig_df.query(
124+
'time_value <= @self.params.time_window.end_date & '
125+
'time_value >= @self.params.time_window.start_date'
126+
)
127+
128+
# These variables are interpolated into the call to `api_df_or_error.query()`
129+
# below but pylint doesn't recognize that.
130+
# pylint: disable=unused-variable
131+
outlier_start_date = earliest_available_date - outlier_lookbehind
132+
outlier_end_date = earliest_available_date - timedelta(days=1)
133+
outlier_api_df = api_df_or_error.query(
134+
'time_value <= @outlier_end_date & time_value >= @outlier_start_date')
135+
# pylint: enable=unused-variable
136+
137+
self.check_positive_negative_spikes(
138+
source_df, outlier_api_df, geo_type, signal_type, report)
143139

144140
# Check data from a group of dates against recent (previous 7 days,
145141
# by default) data from the API.

_delphi_utils_python/delphi_utils/validator/errors.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ class ValidationFailure:
2323
"""Structured report of single validation failure."""
2424

2525
def __init__(self,
26-
check_name: str,
26+
check_name: Optional[str]=None,
2727
date: Optional[Union[str, dt.date]]=None,
2828
geo_type: Optional[str]=None,
2929
signal: Optional[str]=None,
@@ -33,8 +33,9 @@ def __init__(self,
3333
3434
Parameters
3535
----------
36-
check_name: str
37-
Name of check at which the failure happened.
36+
check_name: Optional[str]
37+
Name of check at which the failure happened. A value of `None` is used to express all
38+
possible checks with a given `date`, `geo_type`, and/or `signal`.
3839
date: Optional[Union[str, dt.date]]
3940
Date corresponding to the data over which the failure happened.
4041
Strings are interpretted in ISO format ("YYYY-MM-DD").

_delphi_utils_python/delphi_utils/validator/params.json.template

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,9 @@
1818
"minimum_sample_size": 100,
1919
"missing_sample_size_allowed": true,
2020
"missing_se_allowed": true,
21-
"validator_static_file_dir": "../validator/static"
21+
"additional_valid_geo_values": {
22+
"state": ["xyz"]
23+
}
2224
},
2325
"dynamic": {
2426
"expected_lag": {

_delphi_utils_python/delphi_utils/validator/static.py

Lines changed: 23 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,13 @@
11
"""Static file checks."""
2-
from os.path import join
32
import re
43
from datetime import datetime
54
from dataclasses import dataclass
5+
from typing import Dict, List
66
import pandas as pd
77
from .datafetcher import FILENAME_REGEX
88
from .errors import ValidationFailure
99
from .utils import GEO_REGEX_DICT, TimeWindow
10+
from ..geomap import GeoMapper
1011

1112
class StaticValidator:
1213
"""Class for validation of static properties of individual datasets."""
@@ -15,8 +16,6 @@ class StaticValidator:
1516
class Parameters:
1617
"""Configuration parameters."""
1718

18-
# Place to find the data files
19-
validator_static_file_dir: str
2019
# Span of time over which to perform checks
2120
time_window: TimeWindow
2221
# Threshold for reporting small sample sizes
@@ -25,6 +24,8 @@ class Parameters:
2524
missing_se_allowed: bool
2625
# Whether to report missing sample sizes
2726
missing_sample_size_allowed: bool
27+
# Valid geo values not found in the GeoMapper
28+
additional_valid_geo_values: Dict[str, List[str]]
2829

2930
def __init__(self, params):
3031
"""
@@ -37,13 +38,12 @@ def __init__(self, params):
3738
static_params = params.get("static", dict())
3839

3940
self.params = self.Parameters(
40-
validator_static_file_dir = static_params.get('validator_static_file_dir',
41-
'../validator/static'),
4241
time_window = TimeWindow.from_params(common_params["end_date"],
4342
common_params["span_length"]),
4443
minimum_sample_size = static_params.get('minimum_sample_size', 100),
4544
missing_se_allowed = static_params.get('missing_se_allowed', False),
46-
missing_sample_size_allowed = static_params.get('missing_sample_size_allowed', False)
45+
missing_sample_size_allowed = static_params.get('missing_sample_size_allowed', False),
46+
additional_valid_geo_values = static_params.get('additional_valid_geo_values', {})
4747
)
4848

4949

@@ -134,6 +134,22 @@ def check_df_format(self, df_to_test, nameformat, report):
134134

135135
report.increment_total_checks()
136136

137+
def _get_valid_geo_values(self, geo_type):
138+
# geomapper uses slightly different naming conventions for geo_types
139+
if geo_type == "state":
140+
geomap_type = "state_id"
141+
elif geo_type == "county":
142+
geomap_type = "fips"
143+
else:
144+
geomap_type = geo_type
145+
146+
gmpr = GeoMapper()
147+
valid_geos = gmpr.get_geo_values(geomap_type)
148+
valid_geos |= set(self.params.additional_valid_geo_values.get(geo_type, []))
149+
if geo_type == "county":
150+
valid_geos |= set(x + "000" for x in gmpr.get_geo_values("state_code"))
151+
return valid_geos
152+
137153
def check_bad_geo_id_value(self, df_to_test, filename, geo_type, report):
138154
"""
139155
Check for bad geo_id values, by comparing to a list of known historical values.
@@ -143,9 +159,7 @@ def check_bad_geo_id_value(self, df_to_test, filename, geo_type, report):
143159
- geo_type: string from CSV name specifying geo type (state, county, msa, etc.) of data
144160
- report: ValidationReport; report where results are added
145161
"""
146-
file_path = join(self.params.validator_static_file_dir, geo_type + '_geo.csv')
147-
valid_geo_df = pd.read_csv(file_path, dtype={'geo_id': str})
148-
valid_geos = valid_geo_df['geo_id'].values
162+
valid_geos = self._get_valid_geo_values(geo_type)
149163
unexpected_geos = [geo for geo in df_to_test['geo_id']
150164
if geo.lower() not in valid_geos]
151165
if len(unexpected_geos) > 0:

0 commit comments

Comments
 (0)