cmu-delphi · krivard · Mar 26, 2021 · Feb 5, 2021 · Feb 8, 2021 · Feb 19, 2021
diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md
@@ -4,11 +4,11 @@
 
 * `main`
 
-The primary/authoritative branch of this repository is called `main`, and contains up-to-date code and supporting libraries. This should be your starting point when creating a new indicator. It is protected so that only reviewed pull requests can be merged in.
+The primary branch of this repository is called `main`, and contains the version of the code and supporting libraries currently under development. This should be your starting point when creating a new indicator. It is protected so that only reviewed pull requests can be merged in. The main branch is configured to deploy to our staging environment on push. CI is set up to build and test all indicators on PR.
 
-* `deploy-*`
+* `prod`
 
-Each automated pipeline has a corresponding branch which automatically deploys to a runtime host which runs the pipeline at a designated time each day. New features and bugfixes are merged into this branch using a pull request, so that our CI system can run the lint and test cycles and make sure the package will run correctly on the runtime host. If an indicator does not have a branch named after it starting with `deploy-`, that means the indicator has not yet been automated, and has a designated human keeper who is responsible for making sure the indicator runs each day -- whether that is manually or using a scheduler like cron is the keeper's choice.
+The production branch is configured to automatically deploy to our production environment on push, and is protected so that only administrators can push or merge. CI is set up to build and test all indicators on PR.
 
 * everything else
 
@@ -22,15 +22,6 @@ If you ensure that each issue deals with a single topic (ie a single new propose
 
 Admins will assign issues to one or more people based on balancing expediency, expertise, and team robustness. It may be faster for one person to fix something, but we can reduce the risk of having too many single points of failure if two people work on it together.
 
-## Project Boards
-
-The Delphi Engineering team uses project boards to structure its weekly calls and track active tasks.
-
-Immediate work is tracked on [Release Planning](https://github.com/cmu-delphi/covidcast-indicators/projects/2)
-
-Long-term work and modeling collaborations are tracked on [Refactoring](https://github.com/cmu-delphi/covidcast-indicators/projects/3)
-
-
 ## General workflow for indicators creation and deployment
 
 So, how does one go about developing a pipeline for a new data source?
@@ -40,13 +31,11 @@ So, how does one go about developing a pipeline for a new data source?
 1. Create your new indicator branch from `main`.
 2. Build it using the appropriate template, following the guidelines in the included README.md and REVIEW.md files.
 3. Make some stuff!
-4. When your stuff works, push your `dev-*` branch to remote for review.
-5. Consult with a platform engineer for the remaining production setup needs. They will create a branch called `deploy-*` for your indicator.
-6. Initiate a pull request against this new branch.
-7. Following [the source documentation template](https://github.com/cmu-delphi/delphi-epidata/blob/main/docs/api/covidcast-signals/_source-template.md), create public API documentation for the source. You can submit this as a pull request against the delphi-epidata repository.
-8. If your peers like the code, the documentation is ready, and Jenkins approves, deploy your changes by merging the PR.
-9. An admin will propagate your successful changes to `main`.
-10. Rejoice!
+4. When your stuff works, push your development branch to remote, and open a PR against `main` for review.
+5. Once your PR has been merged, consult with a platform engineer for the remaining production setup needs. They will create a deployment workflow for your indicator including any necessary production parameters. Production secrets are encrypted in the Ansible vault. This workflow will be tested in staging by admins, who will consult you about any problems they encounter.
+6. Following [the source documentation template](https://github.com/cmu-delphi/delphi-epidata/blob/main/docs/api/covidcast-signals/_source-template.md), create public API documentation for the source. You can submit this as a pull request against the delphi-epidata repository.
+7. If your peers like the code, the documentation is ready, and the staging runs are successful, work with admins to schedule your indicator in production, merge the documentation, and announce the new indicator to the mailing list.
+8. Rejoice!
 
 ### Starting out
 
@@ -86,12 +75,11 @@ becomes available to the public.
 
 Once you have your branch set up you should get in touch with a platform engineer to pair up on the remaining production needs. These include:
 
-- Creating the corresponding `deploy-*` branch in the repo.
 - Adding the necessary Jenkins scripts for your indicator.
 - Preparing the runtime host with any Automation configuration necessities.
 - Reviewing the workflow to make sure it meets the general guidelines and will run as expected on the runtime host.
 
-Once all the last mile configuration is in place you can create a pull request against the correct `deploy-*` branch to initiate the CI/CD pipeline which will build, test, and package your indicator for deployment.
+Once all the last mile configuration is in place you can create a pull request against `prod` to initiate the CI/CD pipeline which will build, test, and package your indicator for deployment.
 
 If everything looks ok, you've drafted source documentation, platform engineering has validated the last mile, and the pull request is accepted, you can merge the PR. Deployment will start automatically.
 

diff --git a/.github/workflows/python-ci.yml b/.github/workflows/python-ci.yml
@@ -16,7 +16,7 @@ jobs:
     if: github.event.pull_request.draft == false
     strategy:
       matrix:
-        packages: [_delphi_utils_python, cdc_covidnet, changehc, claims_hosp, combo_cases_and_deaths, covid_act_now, doctor_visits, google_symptoms, hhs_hosp, hhs_facilities, jhu, nchs_mortality, nowcast, quidel, quidel_covidtest, safegraph, safegraph_patterns, usafacts]
+        packages: [_delphi_utils_python, changehc, claims_hosp, combo_cases_and_deaths, covid_act_now, doctor_visits, google_symptoms, hhs_hosp, hhs_facilities, jhu, nchs_mortality, nowcast, quidel, quidel_covidtest, safegraph, safegraph_patterns, usafacts]
     defaults:
       run:
         working-directory: ${{ matrix.packages }}

diff --git a/_delphi_utils_python/delphi_utils/validator/README.md b/_delphi_utils_python/delphi_utils/validator/README.md
@@ -53,18 +53,19 @@ Please update the follow settings:
 
 * `common`: global validation settings
    * `data_source`: should match the [formatting](https://cmu-delphi.github.io/delphi-epidata/api/covidcast_signals.html) as used in COVIDcast API calls
-   * `end_date`: specifies the last date to be checked; this can be specified as `YYYY-MM-DD` or as `today-{num}`.  The latter is interpretted as `num` days before the current date (with `today-0` being today).
+   * `end_date`: specifies the last date to be checked; this can be specified as `YYYY-MM-DD`, `today`, or `today-{num}`.  The latter is interpretted as `num` days before the current date.
    * `span_length`: specifies the number of days before the `end_date` to check. `span_length` should be long enough to contain all recent source data that is still in the process of being updated (i.e. in the backfill period), for example, if the data source of interest has a 2-week lag before all reports are in for a given date, `span_length` should be 14 days
    * `suppressed_errors`: list of objects specifying errors that have been manually verified as false positives or acceptable deviations from expected.  These errors can be specified with the following variables, where omitted values are interpreted as a wildcard, i.e., not specifying a date applies to all dates:
-       * `check_name` (required):  name of the check, as specified in the validation output
+       * `check_name`:  name of the check, as specified in the validation output
        * `date`:  date in `YYYY-MM-DD` format
        * `geo_type`:  geo resolution of the data
        * `signal`:  name of COVIDcast API signal
    * `test_mode`: boolean; `true` checks only a small number of data files
 * `static`: settings for validations that don't require comparison with external COVIDcast API data
    * `minimum_sample_size` (default: 100): threshold for flagging small sample sizes as invalid
    * `missing_se_allowed` (default: False): whether signals with missing standard errors are valid
-   * `misisng_sample_size_allowed` (default: False): whether signals with missing sample sizes are valid
+   * `missing_sample_size_allowed` (default: False): whether signals with missing sample sizes are valid
+   * `additional_valid_geo_values` (default: `{}`): map of geo type names to lists of geo values that are not recorded in the GeoMapper but are nonetheless valid for this indicator
 * `dynamic`: settings for validations that require comparison with external COVIDcast API data
    * `ref_window_size` (default: 7): number of days over which to look back for comparison 
    * `smoothed_signals`: list of the names of the signals that are smoothed (e.g. 7-day average)

diff --git a/_delphi_utils_python/delphi_utils/validator/dynamic.py b/_delphi_utils_python/delphi_utils/validator/dynamic.py
@@ -119,27 +119,23 @@ def validate(self, all_frames, report):
                 continue
 
             # Outlier dataframe
-            if (signal_type in ["confirmed_7dav_cumulative_num", "confirmed_7dav_incidence_num",
-                                "confirmed_cumulative_num", "confirmed_incidence_num",
-                                "deaths_7dav_cumulative_num",
-                                "deaths_cumulative_num"]):
-                earliest_available_date = geo_sig_df["time_value"].min()
-                source_df = geo_sig_df.query(
-                    'time_value <= @self.params.time_window.end_date & '
-                    'time_value >= @self.params.time_window.start_date'
-                )
-
-                # These variables are interpolated into the call to `api_df_or_error.query()`
-                # below but pylint doesn't recognize that.
-                # pylint: disable=unused-variable
-                outlier_start_date = earliest_available_date - outlier_lookbehind
-                outlier_end_date = earliest_available_date - timedelta(days=1)
-                outlier_api_df = api_df_or_error.query(
-                    'time_value <= @outlier_end_date & time_value >= @outlier_start_date')
-                # pylint: enable=unused-variable
-
-                self.check_positive_negative_spikes(
-                    source_df, outlier_api_df, geo_type, signal_type, report)
+            earliest_available_date = geo_sig_df["time_value"].min()
+            source_df = geo_sig_df.query(
+                'time_value <= @self.params.time_window.end_date & '
+                'time_value >= @self.params.time_window.start_date'
+            )
+
+            # These variables are interpolated into the call to `api_df_or_error.query()`
+            # below but pylint doesn't recognize that.
+            # pylint: disable=unused-variable
+            outlier_start_date = earliest_available_date - outlier_lookbehind
+            outlier_end_date = earliest_available_date - timedelta(days=1)
+            outlier_api_df = api_df_or_error.query(
+                'time_value <= @outlier_end_date & time_value >= @outlier_start_date')
+            # pylint: enable=unused-variable
+
+            self.check_positive_negative_spikes(
+                source_df, outlier_api_df, geo_type, signal_type, report)
 
             # Check data from a group of dates against recent (previous 7 days,
             # by default) data from the API.

diff --git a/_delphi_utils_python/delphi_utils/validator/errors.py b/_delphi_utils_python/delphi_utils/validator/errors.py
@@ -23,7 +23,7 @@ class ValidationFailure:
     """Structured report of single validation failure."""
 
     def __init__(self,
-                 check_name: str,
+                 check_name: Optional[str]=None,
                  date: Optional[Union[str, dt.date]]=None,
                  geo_type: Optional[str]=None,
                  signal: Optional[str]=None,
@@ -33,8 +33,9 @@ def __init__(self,
 
         Parameters
         ----------
-        check_name: str
-            Name of check at which the failure happened.
+        check_name: Optional[str]
+            Name of check at which the failure happened.  A value of `None` is used to express all
+            possible checks with a given `date`, `geo_type`, and/or `signal`.
         date: Optional[Union[str, dt.date]]
             Date corresponding to the data over which the failure happened.
             Strings are interpretted in ISO format ("YYYY-MM-DD").

diff --git a/_delphi_utils_python/delphi_utils/validator/params.json.template b/_delphi_utils_python/delphi_utils/validator/params.json.template
@@ -18,7 +18,9 @@
       "minimum_sample_size": 100,
       "missing_sample_size_allowed": true,
       "missing_se_allowed": true,
-      "validator_static_file_dir": "../validator/static"
+      "additional_valid_geo_values": {
+        "state": ["xyz"]
+      }
     },
     "dynamic": {
       "expected_lag": {

diff --git a/_delphi_utils_python/delphi_utils/validator/static.py b/_delphi_utils_python/delphi_utils/validator/static.py
@@ -1,12 +1,13 @@
 """Static file checks."""
-from os.path import join
 import re
 from datetime import datetime
 from dataclasses import dataclass
+from typing import Dict, List
 import pandas as pd
 from .datafetcher import FILENAME_REGEX
 from .errors import ValidationFailure
 from .utils import GEO_REGEX_DICT, TimeWindow
+from ..geomap import GeoMapper
 
 class StaticValidator:
     """Class for validation of static properties of individual datasets."""
@@ -15,8 +16,6 @@ class StaticValidator:
     class Parameters:
         """Configuration parameters."""
 
-        # Place to find the data files
-        validator_static_file_dir: str
         # Span of time over which to perform checks
         time_window: TimeWindow
         # Threshold for reporting small sample sizes
@@ -25,6 +24,8 @@ class Parameters:
         missing_se_allowed: bool
         # Whether to report missing sample sizes
         missing_sample_size_allowed: bool
+        # Valid geo values not found in the GeoMapper
+        additional_valid_geo_values: Dict[str, List[str]]
 
     def __init__(self, params):
         """
@@ -37,13 +38,12 @@ def __init__(self, params):
         static_params = params.get("static", dict())
 
         self.params = self.Parameters(
-            validator_static_file_dir = static_params.get('validator_static_file_dir',
-                                                             '../validator/static'),
             time_window = TimeWindow.from_params(common_params["end_date"],
                                                  common_params["span_length"]),
             minimum_sample_size = static_params.get('minimum_sample_size', 100),
             missing_se_allowed = static_params.get('missing_se_allowed', False),
-            missing_sample_size_allowed = static_params.get('missing_sample_size_allowed', False)
+            missing_sample_size_allowed = static_params.get('missing_sample_size_allowed', False),
+            additional_valid_geo_values = static_params.get('additional_valid_geo_values', {})
         )
 
 
@@ -134,6 +134,22 @@ def check_df_format(self, df_to_test, nameformat, report):
 
         report.increment_total_checks()
 
+    def _get_valid_geo_values(self, geo_type):
+        # geomapper uses slightly different naming conventions for geo_types
+        if geo_type == "state":
+            geomap_type = "state_id"
+        elif geo_type == "county":
+            geomap_type = "fips"
+        else:
+            geomap_type = geo_type
+
+        gmpr = GeoMapper()
+        valid_geos = gmpr.get_geo_values(geomap_type)
+        valid_geos |= set(self.params.additional_valid_geo_values.get(geo_type, []))
+        if geo_type == "county":
+            valid_geos |= set(x + "000" for x in gmpr.get_geo_values("state_code"))
+        return valid_geos
+
     def check_bad_geo_id_value(self, df_to_test, filename, geo_type, report):
         """
         Check for bad geo_id values, by comparing to a list of known historical values.
@@ -143,9 +159,7 @@ def check_bad_geo_id_value(self, df_to_test, filename, geo_type, report):
             - geo_type: string from CSV name specifying geo type (state, county, msa, etc.) of data
             - report: ValidationReport; report where results are added
         """
-        file_path = join(self.params.validator_static_file_dir, geo_type + '_geo.csv')
-        valid_geo_df = pd.read_csv(file_path, dtype={'geo_id': str})
-        valid_geos = valid_geo_df['geo_id'].values
+        valid_geos = self._get_valid_geo_values(geo_type)
         unexpected_geos = [geo for geo in df_to_test['geo_id']
                            if geo.lower() not in valid_geos]
         if len(unexpected_geos) > 0: