cmu-delphi
diff --git a/‎.bumpversion.cfg
Lines changed: 1 addition & 1 deletion b/‎.bumpversion.cfg
Lines changed: 1 addition & 1 deletion
diff --git a/‎_delphi_utils_python/.bumpversion.cfg
Lines changed: 1 addition & 1 deletion b/‎_delphi_utils_python/.bumpversion.cfg
Lines changed: 1 addition & 1 deletion
diff --git a/‎_delphi_utils_python/delphi_utils/__init__.py
Lines changed: 1 addition & 1 deletion b/‎_delphi_utils_python/delphi_utils/__init__.py
Lines changed: 1 addition & 1 deletion
diff --git a/‎_delphi_utils_python/delphi_utils/validator/PLANS.md
Lines changed: 22 additions & 25 deletions b/‎_delphi_utils_python/delphi_utils/validator/PLANS.md
Lines changed: 22 additions & 25 deletions
diff --git a/‎_delphi_utils_python/delphi_utils/validator/README.md
Lines changed: 10 additions & 11 deletions b/‎_delphi_utils_python/delphi_utils/validator/README.md
Lines changed: 10 additions & 11 deletions
diff --git a/‎_delphi_utils_python/delphi_utils/validator/datafetcher.py
Lines changed: 37 additions & 3 deletions b/‎_delphi_utils_python/delphi_utils/validator/datafetcher.py
Lines changed: 37 additions & 3 deletions
diff --git a/‎_delphi_utils_python/setup.py
Lines changed: 1 addition & 1 deletion b/‎_delphi_utils_python/setup.py
Lines changed: 1 addition & 1 deletion
diff --git a/‎_delphi_utils_python/tests/validator/test_datafetcher.py
Lines changed: 19 additions & 10 deletions b/‎_delphi_utils_python/tests/validator/test_datafetcher.py
Lines changed: 19 additions & 10 deletions
@@ -1,5 +1,5 @@
 [bumpversion]
-current_version = 0.1.13
+current_version = 0.1.14
 commit = True
 message = chore: bump covidcast-indicators to {new_version}
 tag = False
@@ -1,5 +1,5 @@
 [bumpversion]
-current_version = 0.1.10
+current_version = 0.1.11
 commit = True
 message = chore: bump delphi_utils to {new_version}
 tag = False
 
@@ -14,4 +14,4 @@
 from .signal import add_prefix
 from .nancodes import Nans
 
-__version__ = "0.1.10"
+__version__ = "0.1.11"
@@ -12,7 +12,6 @@
 * Negative ‘val’ values
 * Out-of-range ‘val’ values (>0 for all signals, <=100 for percents, <=100 000 for proportions)
 * Missing ‘se’ values
-* Appropriate ‘se’ values, within a calculated reasonable range
 * Stderr != 0
 * If signal and stderr both = 0 (seen in Quidel data due to lack of Jeffreys correction, [issue 255](https://github.com/cmu-delphi/covidcast-indicators/issues/255#issuecomment-692196541))
 * Missing ‘sample_size’ values
@@ -30,11 +29,13 @@
 
 ## Current features
 
-* Errors and warnings are summarized in class attribute and printed on exit
-* If any non-suppressed errors are raised, the validation process exits with non-zero status
+* Errors and warnings are summarized in class attribute and stored in log files (file path to be specified in params)
+* If any non-suppressed errors are raised and dry-run is set to False, the validation process exits with non-zero status
 * Various check settings are controllable via indicator-specific params.json files
 * User can manually disable specific checks for specific datasets using a field in the params.json file
 * User can enable test mode (checks only a small number of data files) using a field in the params.json file
+* User can enable dry-run mode (prevents system exit with error and ensures that success() method returns True) using a field in the params.json file
+
 
 ## Checks + features wishlist, and problems to think about
 
@@ -45,37 +46,33 @@
 
 ### Larger issues
 
-* Set up validator to use Sir-complains-a-lot alerting functionality on a signal-by-signal basis (should send alert output as a slack message and "@" a set person), as a stop-gap before the logging server is ready
-  * This is [how Sir-CAL works](https://github.com/benjaminysmith/covidcast-indicators/blob/main/sir_complainsalot/delphi_sir_complainsalot/run.py)
-  * [Example output](https://delphi-org.slack.com/archives/C01E81A3YKF/p1605793508000100)
 * Expand framework to support nchs_mortality, which is provided on a weekly basis and has some differences from the daily data. E.g. filenames use a different format ("weekly_YYYYWW_geotype_signalname.csv")
 * Make backtesting framework so new checks can be run individually on historical indicator data to tune false positives, output verbosity, understand frequency of error raising, etc. Should pull data from API the first time and save locally in `cache` dir.
 * Add DETAILS.md doc with detailed descriptions of what each check does and how. Will be especially important for statistical/anomaly detection checks.
-* Improve errors and error report
-  * Check if [errors raised from validating all signals](https://docs.google.com/spreadsheets/d/1_aRBDrNeaI-3ZwuvkRNSZuZ2wfHJk6Bxj35Ol_XZ9yQ/edit#gid=1226266834) are correct, not false positives, not overly verbose or repetitive
-  * Easier suppression of many errors at once
-    * Maybe store errors as dict of dicts. Keys could be check strings (e.g. "check_bad_se"), then next layer geo type, etc
-  * Nicer formatting for error “report”.
-    * Potentially set `__print__()` method in ValidationError class
-    * E.g. if a single type of error is raised for many different datasets, summarize all error messages into a single message? But it still has to be clear how to suppress each individually
-* Check for erratic data sources that wrongly report all zeroes
-  * E.g. the error with the Wisconsin data for the 10/26 forecasts
+* Easier-to-read error report
+  * Potentially set `__print__()` method in ValidationError class
+  * E.g. if a single type of error is raised for many different datasets, summarize all error messages into a single message? But it still has to be clear how to suppress each individually
+  * Consider adding summary counts of each type of error, rather than just a combined number
+* Check for data sources that wrongly report all zeroes
+  * E.g. the error with the Wisconsin data for the 10/26/2020 forecasts
   * Wary of a purely static check for this
-  * Are there any geo regions where this might cause false positives? E.g. small counties or MSAs, certain signals (deaths, since it's << cases)
-  * This test is partially captured by checking avgs in source vs reference data, unless erroneous zeroes continue for more than a week
-  * Also partially captured by outlier checking, depending on `size_cut` setting. If zeroes aren't outliers, then it's hard to say that they're erroneous at all.
-* Use known erroneous/anomalous days of source data to tune static thresholds and test behavior
-* If can't get data from API, do we want to use substitute data for the comparative checks instead?
-  * Currently, any API fetch problems just doesn't do comparative checks at all.
-  * E.g. most recent successful API pull -- might end up being a couple weeks older
-* Improve performance and reduce runtime (no particular goal, just avoid being painfully slow!)
+  * Regions with small populations (e.g. small counties or MSAs) and rare signals (e.g. deaths, since it's << cases) likely to cause false positives
+  * This test is captured by `check_avg_val_vs_reference`, as long as erroneous zeroes occur for less than the reference period (1-2 weeks)
+  * Also partially captured by `check_positive_negative_spikes`, depending on `size_cut` setting. However, `check_positive_negative_spikes` has limited applicability and only applies to incident cases and deaths signals.
+* Instead of failing validation for a single check error, compare rate of check failures to historical rate? Requires caching and updating historical failure rates by signal, data source, and geo region. Unclear if worthwhile.
+* Improve performance and reduce runtime (no particular goal, just handle low-hanging fruit and avoid being painfully slow!)
   * Profiling (iterate)
   * Save intermediate files?
   * Currently a bottleneck at "individual file checks" section. Parallelize?
   * Make `all_frames` MultiIndex-ed by geo type and signal name? Make a dict of data indexed by geo type and signal name? May improve performance or may just make access more readable.
-* Ensure validator runs on signals that require AWS credentials (iterate)
+* Revisit tuning of thresholds for outlier-related checks (`check_positive_negative_spikes`, `check_avg_val_vs_reference`) or parameters set in params.json.template
+  * Currently using manually tuned z-score thresholds using 1-2 months of data (June-July 2021), but signal behavior may change
+  * Certain signals (e.g. locally monotonic signals, sparse signals) exhibit different behavior and may require signal-specific paramters for checks such as z-scores.
+  * Use caching to store params and update these dynamically using recent data?
+* Create different error levels for checks beyond warning and critical: useful because certain checks clearly indicate some form of data corruption (e.g. `check_missing_date_files` identifying missing data), while other checks just report abnormal behavior that may be able to be explained.
+* Compare current validator model against known instances of data issues to evaluate performance (may be difficult if data corrections are issued)
 
-### Longer-term issues
+### Longer-term features
 
 * Data correctness and consistency over longer time periods (weeks to months). Compare data against long-ago (3 months?) API data for changes in trends.
   * Long-term trends and correlations between time series. Currently, checks only look at a data window of a few days
 
@@ -15,9 +15,8 @@ The validator is run by executing the Python module contained in this
 directory from the main directory of the indicator of interest.
 
 The safest way to do this is to create a virtual environment,
-install the common DELPHI tools, install the indicator module and its
-dependencies, and then install the validator module and its
-dependencies to the virtual environment.
+and install the common DELPHI tools, including the validator, and the
+validator module and its dependencies to the virtual environment.
 
 To do this, navigate to the main directory of the indicator of interest and run the following code:
 
@@ -26,15 +25,14 @@ python -m venv env
 source env/bin/activate
 pip install ../_delphi_utils_python/.
 pip install .
-pip install ../validator
 ```
 
 To execute the module and validate source data (by default, in `receiving`), run the indicator to generate data files, then run
 the validator, as follows:
 
 ```
 env/bin/python -m delphi_INDICATORNAME
-env/bin/python -m delphi_validator
+env/bin/python -m delphi_utils.validator
 ```
 
 Once you are finished with the code, you can deactivate the virtual environment
@@ -53,7 +51,10 @@ Please update the follow settings:
 
 * `common`: global validation settings
    * `data_source`: should match the [formatting](https://cmu-delphi.github.io/delphi-epidata/api/covidcast_signals.html) as used in COVIDcast API calls
-   * `end_date`: specifies the last date to be checked; this can be specified as `YYYY-MM-DD`, `today`, or `today-{num}`.  The latter is interpretted as `num` days before the current date.
+   * `dry_run`: boolean; `true` prevent system exit with error and ensures that the success() method of the ValidationReport always returns true.
+   * `end_date`: specifies the last date to be checked; this can be specified as `YYYY-MM-DD`, `today`, `today-{num}`, or `sunday+{num}`.  `today-num` is interpreted as `num` days before the current date. `sunday+{num}` sets the end date as the most recent day of the week prior to today (as specified by the user). The numeric input represents the day of the week (`sunday+1` denotes Monday, and so on, with 0 or 7 both indicating Sunday). Defaults to the global minimum of `min_expected_lag` days behind today.
+   * `max_expected_lag` (default: 10 for all unspecified signals): dictionary of signal name-string pairs specifying the maximum number of days of expected lag (time between event occurrence and when data about that event was published) for that signal. Default values can also be set using 'all' as the key: individually setting keys on top of this will override the default. Values are either numeric or `sunday+{num},{num}`. `sunday+{num},{num}` is used for indicators that update data on a regular weekly basis. The first number denotes the day of the week (with Monday = 1 and so on, and Sunday taking either 0 or 7). The second number denotes the expected lag for the data upon upload date. In other words, if the signal updates weekly on Wednesday's for the past week's data up til Sunday, the correct parameter would be something like `sunday+3,3`.
+   * `min_expected_lag` (default: 1 for all unspecified signals): dictionary of signal name-string pairs specifying the minimum number of days of expected lag (time between event occurrence and when data about that event was published) for that signal. Default values can be changed by using the 'all' key. See `max_expected_lag` for further details.
    * `span_length`: specifies the number of days before the `end_date` to check. `span_length` should be long enough to contain all recent source data that is still in the process of being updated (i.e. in the backfill period), for example, if the data source of interest has a 2-week lag before all reports are in for a given date, `span_length` should be 14 days
    * `suppressed_errors`: list of objects specifying errors that have been manually verified as false positives or acceptable deviations from expected.  These errors can be specified with the following variables, where omitted values are interpreted as a wildcard, i.e., not specifying a date applies to all dates:
        * `check_name`:  name of the check, as specified in the validation output
@@ -67,9 +68,8 @@ Please update the follow settings:
    * `missing_sample_size_allowed` (default: False): whether signals with missing sample sizes are valid
    * `additional_valid_geo_values` (default: `{}`): map of geo type names to lists of geo values that are not recorded in the GeoMapper but are nonetheless valid for this indicator
 * `dynamic`: settings for validations that require comparison with external COVIDcast API data
-   * `ref_window_size` (default: 7): number of days over which to look back for comparison 
+   * `ref_window_size` (default: 14): number of days over which to look back for comparison 
    * `smoothed_signals`: list of the names of the signals that are smoothed (e.g. 7-day average)
-   * `expected_lag` (default: 1 for all unspecified signals): dictionary of signal name-int pairs specifying the number of days of expected lag (time between event occurrence and when data about that event was published) for that signal
 
 
 ## Testing the code
@@ -80,14 +80,13 @@ To test the code, please create a new virtual environment in the main module dir
 python -m venv env
 source env/bin/activate
 pip install ../_delphi_utils_python/.
-pip install .
 ```
 
 To do a static test of the code style, it is recommended to run **pylint** on
 the module. To do this, run the following from the main module directory:
 
 ```
-env/bin/pylint delphi_validator
+env/bin/pylint delphi_utils.validator
 ```
 
 The most aggressive checks are turned off; only relatively important issues
@@ -96,7 +95,7 @@ should be raised and they should be manually checked (or better, fixed).
 Unit tests are also included in the module. To execute these, run the following command from this directory:
 
 ```
-(cd tests && ../env/bin/pytest --cov=delphi_validator --cov-report=term-missing)
+(cd tests && ../env/bin/pytest --cov=delphi_utils.validator --cov-report=term-missing)
 ```
 
 The output will show the number of unit tests that passed and failed, along with the percentage of code covered by the tests. None of the tests should fail and the code lines that are not covered by unit tests should be small and should not include critical sub-routines.
 
@@ -6,6 +6,7 @@
 from os import listdir
 from os.path import isfile, join
 import warnings
+import requests
 import pandas as pd
 import numpy as np
 
@@ -109,14 +110,47 @@ def get_geo_signal_combos(data_source):
 
     Cross references based on combinations reported available by COVIDcast metadata.
     """
+    # Maps data_source name with what's in the API, lists used in case of multiple names
+    source_signal_mappings = {
+        'chng': ['chng-cli', 'chng-covid'],
+        'indicator-combination': ['indicator-combination-cases-deaths'],
+        'quidel': ['quidel-covid-ag'],
+        'safegraph': ['safegraph-weekly']
+    }
     meta = covidcast.metadata()
     source_meta = meta[meta['data_source'] == data_source]
     # Need to convert np.records to tuples so they are hashable and can be used in sets and dicts.
     geo_signal_combos = list(map(tuple,
                                  source_meta[["geo_type", "signal"]].to_records(index=False)))
-
-    return geo_signal_combos
-
+    # Only add new geo_sig combos if status is active
+    new_geo_signal_combos = []
+    # Use a seen dict to save on multiple calls:
+    # True/False indicate if status is active, "unknown" means we should check
+    sig_combo_seen = dict()
+    for combo in geo_signal_combos:
+        if source_signal_mappings.get(data_source):
+            src_list = source_signal_mappings.get(data_source)
+        else:
+            src_list = [data_source]
+        for src in src_list:
+            sig = combo[1]
+            geo_status = sig_combo_seen.get((sig, src), "unknown")
+            if geo_status is True:
+                new_geo_signal_combos.append(combo)
+            elif geo_status == "unknown":
+                epidata_signal = requests.get(
+                    "https://api.covidcast.cmu.edu/epidata/covidcast/meta",
+                    params={'signal': f"{src}:{sig}"})
+                # Not an active signal
+                active_status = [val['active'] for i in epidata_signal.json()
+                    for val in i['signals']]
+                if active_status == []:
+                    sig_combo_seen[(sig, src)] = False
+                    continue
+                sig_combo_seen[(sig, src)] = active_status[0]
+                if active_status[0] is True:
+                    new_geo_signal_combos.append(combo)
+    return new_geo_signal_combos
 
 def fetch_api_reference(data_source, start_date, end_date, geo_type, signal_type):
     """
 
@@ -24,7 +24,7 @@
 
 setup(
     name="delphi_utils",
-    version="0.1.10",
+    version="0.1.11",
     description="Shared Utility Functions for Indicators",
     long_description=long_description,
     long_description_content_type="text/markdown",
 
@@ -24,20 +24,29 @@ def test_make_date_filter(self):
     @mock.patch("covidcast.metadata")
     def test_get_geo_signal_combos(self, mock_metadata):
         """Test that the geo signal combos are correctly pulled from the covidcast metadata."""
-        mock_metadata.return_value = pd.DataFrame({"data_source": ["a", "a", "a",
-                                                                   "b", "b", "b"],
-                                                   "signal": ["w", "x", "x",
-                                                              "y", "y", "z"],
+        # Need to use actual data_source and signal names since we reference the API
+        mock_metadata.return_value = pd.DataFrame({"data_source": ["chng", "chng", "chng",
+                                                                   "covid-act-now",
+                                                                   "covid-act-now",
+                                                                   "covid-act-now"],
+                                                   "signal": ["smoothed_outpatient_cli",
+                                                              "smoothed_outpatient_covid",
+                                                              "smoothed_outpatient_covid",
+                                                              "pcr_specimen_positivity_rate",
+                                                              "pcr_specimen_positivity_rate",
+                                                              "pcr_specimen_total_tests"],
                                                    "geo_type": ["state", "state", "county",
                                                                 "hrr", "msa", "msa"]
                                                   })
 
-        assert set(get_geo_signal_combos("a")) == set([("state", "w"),
-                                                       ("state", "x"),
-                                                       ("county", "x")])
-        assert set(get_geo_signal_combos("b")) == set([("hrr", "y"),
-                                                       ("msa", "y"),
-                                                       ("msa", "z")])
+        assert set(get_geo_signal_combos("chng")) == set(
+            [("state", "smoothed_outpatient_cli"),
+             ("state", "smoothed_outpatient_covid"),
+             ("county", "smoothed_outpatient_covid")])
+        assert set(get_geo_signal_combos("covid-act-now")) == set(
+            [("hrr", "pcr_specimen_positivity_rate"),
+             ("msa", "pcr_specimen_positivity_rate"),
+             ("msa", "pcr_specimen_total_tests")])
 
     @mock.patch("covidcast.signal")
     def test_threaded_api_calls(self, mock_signal):